Researchers have developed a method that allows a robot to learn a new pick-and-place task with only a few human demonstrations. This could allow a human to reprogram a robot to grasp previously unseen objects presented in random poses in about 15 minutes.
With e-commerce orders pouring in, a warehouse robot pulls mugs from a shelf and places them in boxes for shipping. Everything is running smoothly until the warehouse processes a change and the robot is forced to grasp taller, narrower mugs that are stored upside down.
Reprogramming that robot entails hand-labeling thousands of images that show it how to grasp these new mugs, followed by retraining the system. However, a new technique developed by MIT researchers would only require a few human demonstrations to reprogram the robot. This machine-learning method enables a robot to pick up and place previously unseen objects in previously unseen poses. The robot would be ready to perform a new pick-and-place task in 10 to 15 minutes.
The method employs a neural network that has been specifically designed to reconstruct the shapes of 3D objects. With just a few demonstrations, the system uses what the neural network has learned about 3D geometry to grasp new objects that are similar to those in the demos.
Our main contribution is the general ability to provide new skills to robots that need to operate in more unstructured environments with a high degree of variability. Because this problem is typically much more difficult, the concept of generalization by construction is a fascinating capability.
Anthony Simeonov
Using simulations and a real robotic arm, the researchers demonstrate that their system can manipulate never-before-seen mugs, bowls, and bottles arranged in random poses with only 10 demonstrations.
“Our main contribution is the general ability to provide new skills to robots that need to operate in more unstructured environments with a high degree of variability. Because this problem is typically much more difficult, the concept of generalization by construction is a fascinating capability” says Anthony Simeonov, co-lead author of the paper and a graduate student in electrical engineering and computer science (EECS).
Grasping geometry
A robot may be trained to pick up a specific item, but if that object is lying on its side (perhaps it fell over), the robot sees this as a completely new scenario. This is one reason it is so hard for machine-learning systems to generalize to new object orientations.
To address this issue, the researchers developed a new type of neural network model called a Neural Descriptor Field (NDF), which learns the 3D geometry of a class of items. The model computes the geometric representation for a specific item using a 3D point cloud, which is a set of data points or coordinates in three dimensions. The data points can be obtained from a depth camera, which provides information on the distance between the object and a viewpoint. While the network was trained in simulation on a large dataset of synthetic 3D shapes, it can now be applied to real-world objects.
The team designed the NDF with a property known as equivariance. With this property, if the model is shown an image of an upright mug, and then shown an image of the same mug on its side, it understands that the second mug is the same object, just rotated.
“This equivariance is what allows us to much more effectively handle cases where the object you observe is in some arbitrary orientation,” Simeonov says. As the NDF learns to reconstruct shapes of similar objects, it also learns to associate related parts of those objects. For instance, it learns that the handles of mugs are similar, even if some mugs are taller or wider than others, or have smaller or longer handles.
“If you wanted to do this with another approach, you’d have to hand-label all the parts. Instead, our approach automatically discovers these parts from the shape reconstruction,” Du says.
The researchers use this trained NDF model to teach a robot a new skill with only a few physical examples. They move the hand of the robot onto the part of an object they want it to grip, like the rim of a bowl or the handle of a mug, and record the locations of the fingertips.
Because the NDF has learned so much about 3D geometry and how to reconstruct shapes, it can infer the structure of a new shape, which enables the system to transfer the demonstrations to new objects in arbitrary poses, Du explains.
Picking a winner
They tested their model in simulations and on a real robotic arm with mugs, bowls, and bottles as objects. On pick-and-place tasks with new objects in new orientations, their method had an 85 percent success rate, while the best baseline had a 45 percent success rate. Success entails grasping a new object and placing it in a specific location, such as hanging mugs on a rack.
Many baselines use 2D image information rather than 3D geometry, making it more difficult for these methods to integrate equivariance. This is one of the reasons why the NDF technique performed so much better.
While the researchers were pleased with its performance, their method is limited to the object category on which it is trained. A robot trained to pick up mugs will not be able to pick up boxes or headphones because their geometric features are too different from what the network was trained on.
“In the future, it would be ideal to scale it up to many categories or to completely abandon the concept of category altogether,” Simeonov says. They also intend to modify the system to accommodate nonrigid objects and, in the long run, to allow the system to perform pick-and-place tasks when the target area changes.