As it learns to complete a task using reinforcement learning, a new technique allows an AI agent to be guided by data crowdsourced asynchronously from nonexpert human users. The method trains the robot faster and more effectively than other methods.
Researchers frequently use reinforcement learning to teach an AI agent a new task, such as how to open a kitchen cabinet. Reinforcement learning is a trial-and-error process in which the agent is rewarded for taking actions that bring it closer to the goal.
A human expert must carefully design a reward function in many cases, which is an incentive mechanism that gives the agent motivation to explore. As the agent explores and tries different actions, the human expert must iteratively update the reward function. This can be time-consuming, inefficient, and difficult to scale up, especially when the task is complex and involves many steps.
MIT, Harvard University, and the University of Washington researchers have developed a new reinforcement learning approach that does not rely on an expertly designed reward function. Instead, it uses crowdsourced feedback gathered from a large number of nonexpert users to guide the agent as it learns to achieve its goal.
Our work proposes a way to scale robot learning by crowdsourcing the design of the reward function and by making it possible for nonexperts to provide useful feedback.
Pulkit Agrawal
While other methods attempt to use nonexpert feedback, this new approach allows the AI agent to learn more quickly, despite the fact that data crowdsourced from users is frequently inaccurate. Other methods may fail as a result of the noisy data. Furthermore, because this new approach allows feedback to be gathered asynchronously, nonexpert users from all over the world can contribute to teaching the agent.
“One of the most time-consuming and challenging parts in designing a robotic agent today is engineering the reward function. Today reward functions are designed by expert researchers — a paradigm that is not scalable if we want to teach our robots many different tasks. Our work proposes a way to scale robot learning by crowdsourcing the design of reward function and by making it possible for nonexperts to provide useful feedback,” says Pulkit Agrawal, an assistant professor in the MIT Department of Electrical Engineering and Computer Science (EECS) who leads the Improbable AI Lab in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the future, this method could help a robot learn to perform specific tasks in a user’s home quickly, without the owner needing to show the robot physical examples of each task. The robot could explore on its own, with crowdsourced nonexpert feedback guiding its exploration.
“In our method, the reward function guides the agent to what it should explore, instead of telling it exactly what it should do to complete the task. So, even if the human supervision is somewhat inaccurate and noisy, the agent is still able to explore, which helps it learn much better,” explains lead author Marcel Torne ’23, a research assistant in the Improbable AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal; senior author Abhishek Gupta, assistant professor at the University of Washington; as well as others at the University of Washington and MIT. The research will be presented at the Conference on Neural Information Processing Systems next month.
Noisy feedback
One method for gathering user feedback for reinforcement learning is to show the user two photos of the agent’s achieved states and then ask the user which state is closer to the goal. For example, a robot’s goal could be to open a kitchen cabinet. One image could show the robot opening the cabinet, while the other could show it opening the microwave. A user would select the “better” state photo.
Some previous approaches attempted to optimize a reward function that the agent would use to learn the task using crowdsourced binary feedback. However, because non-experts are prone to making errors, the reward function can become extremely noisy, causing the agent to become stuck and never reach its goal.
“Basically, the agent would take the reward function too seriously. It would try to match the reward function perfectly. So, instead of directly optimizing over the reward function, we just use it to tell the robot which areas it should be exploring,” Torne says.
He and his collaborators decoupled the process into two separate parts, each directed by its own algorithm. They call their new reinforcement learning method HuGE (Human Guided Exploration).
On one side, a goal selector algorithm is continuously updated with crowdsourced human feedback. The feedback is not used as a reward function, but rather to guide the agent’s exploration. In a sense, the nonexpert users drop breadcrumbs that incrementally lead the agent toward its goal.
On the other side, the agent explores on its own, in a self-supervised manner guided by the goal selector. It collects images or videos of actions that it tries, which are then sent to humans and used to update the goal selector.
This narrows down the area for the agent to explore, leading it to more promising areas that are closer to its goal. But if there is no feedback, or if feedback takes a while to arrive, the agent will keep learning on its own, albeit in a slower manner. This enables feedback to be gathered infrequently and asynchronously.
“The exploration loop can keep going autonomously because it is just going to explore and learn new things. And then when you get a better signal, it is going to explore in more concrete ways. You can just keep them turning at their own pace,” adds Torne. And because the feedback is just gently guiding the agent’s behavior, it will eventually learn to complete the task even if users provide incorrect answers.
Faster learning
The researchers put this method to the test on a variety of simulated and real-world tasks. In simulation, they used HuGE to effectively learn tasks with long sequences of actions, such as stacking blocks in a specific order or navigating a large maze.
In real-world tests, they used HuGE to train robotic arms to draw the letter “U” and pick and place objects. They gathered data from 109 nonexpert users across three continents for these tests. HuGE assisted agents in learning to achieve the goal faster than other methods in both real-world and simulated experiments.
The researchers also discovered that data crowdsourced from nonexperts outperformed synthetic data produced and labeled by the researchers. For non-expert users, labeling 30 images or videos took fewer than two minutes.
“This makes it very promising in terms of being able to scale up this method,” Torne adds.
In a related paper, presented at the recent Conference on Robot Learning, the researchers improved HuGE so that an AI agent could learn to perform the task and then autonomously reset the environment to continue learning. For example, if the agent learns to open a cabinet, the method will also instruct the agent on how to close the cabinet.
“Now we can have it learn completely autonomously without needing human resets,” he said.
The researchers also emphasize the importance of aligning AI agents with human values in this and other learning approaches. They intend to continue refining HuGE in the future so that the agent can learn from other forms of communication, such as natural language and physical interactions with the robot. They are also interested in applying this method to teach multiple agents at once.