Tag Archives: Robotics

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping



From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.

Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.
Building on our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects.
Constructing a Perceptual Reward Function
In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.

However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?
On the left, the gripper is holding the brush and there are some objects (yellow cup, blue plastic block) in the background. On the right, the gripper is holding the yellow cup and the brush is in the background. If the left image was the desired outcome, a good reward function should “understand” that the two images above correspond to different objects.
In order to solve this recognition problem, we need a perception system that extracts meaningful object concepts from unstructured image data (without any human annotations), learning the visual perception of objects in an unsupervised fashion. At their core, unsupervised learning algorithms work because they make structural assumptions about data. It is common to assume that images can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.

What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.
Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.
If we then consider an embedding function that extracts “the set of objects” from images, it should preserve the following subtractive relation:
objects_before_grasp - objects_after_grasp = grasped_object
We implement this equality relation using a fully convolutional architecture and a simple metric learning algorithm. At training time, the architecture shown below embeds the pre-grasp images and post-grasp images into a dense spatial feature map. The maps are mean-pooled into vectors and the difference between the “before grasp” and “after grasp” vectors represents a set of objects. This vector and the corresponding vector representation of the grasped object are pushed to equivalence via the N-Pairs objective.
Add caption
Once trained, two useful properties emerge naturally from our model.

1. Object Similarity
The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.
2. Localizing Target Objects
The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.
Using Grasp2Vec embeddings to localize objects in a scene. The image on the top left shows the objects in the bin. On the bottom left is the query object we wish to grasp. By taking the dot product of the query object vector with the spatial features of the scene image, we get a per-pixel “activation map” (top right image) of how similar that region of the image is to the query. This response map can be used to approach the object for grasping.
Our method also works when there are multiple objects that match the query object, or even if the query consists of multiple objects (the average of two vectors). For example, here is a scenario where it detects multiple orange blocks in a scene.
The resulting “heatmap” can be used to plan the robot approach to the target object(s). We combine Grasp2Vec’s localization and instance recognition capabilities with our “grasp anything” policies to obtain a success rate of 80% on objects seen during data collection and 59% on novel objects the robot hasn’t encountered before.

Conclusion
In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.

Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.

Acknowledgements
This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.

Source: Google AI Blog


A Structured Approach to Unsupervised Depth Learning from Monocular Videos



Perceiving the depth of a scene is an important task for an autonomous robot — the ability to accurately estimate how far from the robot objects are, is crucial for obstacle avoidance, safe planning and navigation. While depth can be obtained (and learned) from sensor data, such as LIDAR, it is also possible to learn it in an unsupervised manner from a monocular camera only, relying on the motion of the robot and the resulting different views of the scene. In doing so, the “ego-motion” (the motion of the robot/camera between two frames) is also learned, which provides localization of the robot itself. While this approach has a long history — coming from the structure-from-motion and multi-view geometry paradigms — new learning based techniques, more specifically for unsupervised learning of depth and ego-motion by using deep neural networks, have advanced the state of the art, including work by Zhou et al., and our own prior research which aligns 3D point clouds of the scene during training.

Despite these efforts, learning to predict scene depth and ego-motion remains an ongoing challenge, specifically when handling highly dynamic scenes and estimating proper depth of moving objects. Because previous research efforts for unsupervised monocular learning do not model moving objects, it can result in consistent misestimation of objects’ depth, often resulting in mapping their depth to infinity.

In “Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos”, to appear in AAAI 2019, we propose a novel approach which is able to model moving objects and produces high quality depth estimation results. Our approach is able to recover the correct depth for moving objects compared to previous methods for unsupervised learning from monocular videos. In our paper, we also propose a seamless online refinement technique that can further improve quality and be applied for transfer across datasets. Furthermore, to encourage even more advanced approaches of onboard robotics learning, we have open sourced the code in TensorFlow.
Previous work (middle row) has not been able to correctly estimate depth of moving objects mapping them to infinity (dark blue regions in the heatmap). Our approach (right) provides much better depth estimates.
Structure
A key idea in our approach is to introduce structure into the learning framework. That is, instead of relying on a neural network to learn depth directly, we treat the monocular scene as 3D, composed of moving objects, including the robot itself. The respective motions are modeled as independent transformations — rotations and translations — in the scene, which is then used to model the 3D geometry and estimate all the objects’ motions. Additionally, knowing which objects may potentially move (e.g., cars, people, bicycles, etc.) helps us learn separate motion vectors for them even if they may be static. By decomposing the scene into 3D and individual objects, better depth and ego-motion in the scene is learned, especially on very dynamic scenes.

We tested this method on both KITTI and Cityscapes urban driving datasets, and found that it outperforms state-of-the-art approaches, and is approaching in quality methods which used stereo pair videos as training supervision. Importantly, we are able to recover correctly the depth of a car moving at the same speed as the ego-motion vehicle. This has been challenging previously — in this case, the moving vehicle appears (in a monocular input) as static, exhibiting the same behavior as the static horizon, resulting in an inferred infinite depth. While stereo inputs can solve that ambiguity, our approach is the first one that is able to correctly infer that from a monocular input.
Previous work with monocular inputs were not able to extract moving objects and incorrectly map them to infinity.
Furthermore, since objects are treated individually in our method, the algorithm is able to provide for the motion vectors for each individual object, i.e. which is an estimate of where it is heading:
Example depth results for a dynamic scene together with estimates of the motion vectors of the individual objects (rotation angles are estimated too, but for simplicity are not shown).
In addition to these results, this research provides motivation for further exploring what an unsupervised learning approach can achieve, as monocular inputs are cheaper and easier to deploy than stereo or LIDAR sensors. As can be seen in the figures below, in both the KITTI and Cityscapes datasets, the supervision sensor (be it stereo or LIDAR) is missing values and may occasionally be misaligned with the camera input, which happens due to time delay.
Depth prediction from monocular video input on the KITTI dataset, middle row, compared to ground truth depth from a Lidar sensor; the latter does not cover the full scene and has missing and noisy values. Ground truth depth is not used during training.
Depth prediction on the Cityscapes dataset. Left to right: image, baseline, our method and ground truth provided by stereo. Note the missing values in the stereo ground truth. Also note that our algorithm is able to achieve these results without any ground truth depth supervision.
Ego-motion
Our results also provide the best among the state-of-the-art estimates in ego-motion, which is crucial for autonomous robots, as it provides localization of the robots while moving in the environment. The video below shows results from our method that visualizes the speed and turning angle, obtained from the inferred ego-motion. While the outputs of both depth and ego-motion are valid up to a scalar, we can see that it is able to estimate its relative speed when slowing down and stopping.
Depth and ego-motion prediction. Follow the speed and the turning angle indicator to see the estimates when the car is taking a turn or stopping for a red light.
Transfer Across Domains
An important characteristic of a learning algorithm is its adaptability when moved to an unknown environment. In this work we further introduce an online refinement approach which continues to learn online while collecting new data. Below are examples of improvement of the estimated depth quality, after training on Cityscapes and online refinement on KITTI.
Online refinement when training on the Cityscapes Data and testing on KITTI. The images show depth prediction of the trained model, and of the trained model with online refinement. Depth prediction with online refinement better outlines the objects in the scene.
We further tested on a notably different dataset and setting, i.e. on an indoor dataset collected by the Fetch robot, while the training is done on the outdoor urban driving Cityscapes dataset. As to be expected, there is a large discrepancy between these datasets. Despite this, we observe that the online learning technique is able to obtain better depth estimates than the baseline.
Results of online adaptation when transferring the learning model from Cityscapes (an outdoors dataset collected from a moving car) to a dataset collected indoors by the Fetch robot. The bottom row shows improved depth after applying online refinement.
In summary, this work addresses unsupervised learning of depth and ego-motion from a monocular camera, and tackles the problem in highly dynamic scenes. It achieves high quality depth and ego-motion results and with quality comparable to stereo and sets forward the idea of incorporating structure in the learning process. More notably, our proposed combination of unsupervised learning of depth and ego-motion from monocular video only and online adaptation demonstrates a powerful concept, because not only can it learn in unsupervised manner from simple video, but it can also be transferred easily to other datasets.

Acknowledgements
This research was conducted by Vincent Casser, Soeren Pirk, Reza Mahjourian and Anelia Angelova. We would like to thank Ayzaan Wahid for his help with data collection and Martin Wicke and Vincent Vanhoucke for their support and encouragement.

Source: Google AI Blog


Scalable Deep Reinforcement Learning for Robotic Manipulation



How can robots acquire skills that generalize effectively to diverse, real-world objects and situations? While designing robotic systems that effectively perform repetitive tasks in controlled environments, like building products on an assembly line, is fairly routine, designing robots that can observe their surroundings and decide the best course of action while reacting to unexpected outcomes is exceptionally difficult. However, there are two tools that can help robots acquire such skills from experience: deep learning, which is excellent at handling unstructured real-world scenarios, and reinforcement learning, which enables longer-term reasoning while exhibiting more complex and robust sequential decision making. Combining these two techniques has the potential to enable robots to learn continuously from their experience, allowing them to master basic sensorimotor skills using data rather than manual engineering.

Designing reinforcement learning algorithms for robot learning introduces its own set of challenges: real-world objects span a wide variety of visual and physical properties, subtle differences in contact forces can make predicting object motion difficult and objects of interest can be obstructed from view. Furthermore, robotic sensors are inherently noisy, adding to the complexity. All of these factors makes it incredibly difficult to learn a general solution, unless there is enough variety in the training data, which takes time to collect. This motivates exploring learning algorithms that can effectively reuse past experience, similar to our previous work on grasping which benefited from large datasets. However, this previous work could not reason about the long-term consequences of its actions, which is important for learning how to grasp. For example, if multiple objects are clumped together, pushing one of them apart (called “singulation”) will make the grasp easier, even if doing so does not directly result in a successful grasp.
Examples of singulation.

To be more efficient, we need to use off-policy reinforcement learning, which can learn from data that was collected hours, days, or weeks ago. To design such an off-policy reinforcement learning algorithm that can benefit from large amounts of diverse experience from past interactions, we combined large-scale distributed optimization with a new fitted deep Q-learning algorithm that we call QT-Opt. A preprint is available on arXiv.

QT-Opt is a distributed Q-learning algorithm that supports continuous action spaces, making it well-suited to robotics problems. To use QT-Opt, we first train a model entirely offline, using whatever data we’ve already collected. This doesn’t require running the real robot, making it easier to scale. We then deploy and finetune that model on the real robot, further training it on newly collected data. As we run QT-Opt, we accumulate more offline data, letting us train better models, which lets us collect better data, and so on.

To apply this approach to robotic grasping, we used 7 real-world robots, which ran for 800 total robot hours over the course of 4 months. To bootstrap collection, we started with a hand-designed policy that succeeded 15-30% of the time. Data collection switched to the learned model when it started performing better. The policy takes a camera image and returns how the arm and gripper should move. The offline data contained grasps on over 1000 different objects.
Some of the training objects used.
In the past, we’ve seen that sharing experience across robots can accelerate learning. We scaled this training and data gathering process to ten GPUs, seven robots, and many CPUs, allowing us to collect and process a large dataset of over 580,000 grasp attempts. At the end of this process, we successfully trained a grasping policy that runs on a real world robot and generalizes to a diverse set of challenging objects that were not seen at training time.
Seven robots collecting grasp data.
Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.
The objects used at evaluation time. To make the task challenging, we aimed for a large variety of object sizes, textures, and shapes.

Notably, the policy exhibits a variety of closed-loop, reactive behaviors that are often not found in standard robotic grasping systems:
  • When presented with a set of interlocking blocks that cannot be picked up together, the policy separates one of the blocks from the rest before picking it up.
  • When presented with a difficult-to-grasp object, the policy figures out it should reposition the gripper and regrasp it until it has a firm hold.
  • When grasping in clutter, the policy probes different objects until the fingers hold one of them firmly, before lifting.
  • When we perturbed the robot by intentionally swatting the object out of the gripper -- something it had not seen during training -- it automatically repositioned the gripper for another attempt.
Crucially, none of these behaviors were engineered manually. They emerged automatically from self-supervised training with QT-Opt, because they improve the model’s long-term grasp success.
Examples of the learned behaviors. In the left GIF, the policy corrects for the moved ball. In the right GIF, the policy tries several grasps until it succeeds at picking up the tricky object.

Additionally, we’ve found that QT-Opt reaches this higher success rate using less training data, albeit with taking longer to converge. This is especially exciting for robotics, where the bottleneck is usually collecting real robot data, rather than training time. Combining this with other data efficiency techniques (such as our prior work on domain adaptation for grasping) could open several interesting avenues in robotics. We’re also interested in combining QT-Opt with recent work on learning how to self-calibrate, which could further improve the generality.

Overall, the QT-Opt algorithm is a general reinforcement learning approach that’s giving us good results on real world robots. Besides the reward definition, nothing about QT-Opt is specific to robot grasping. We see this as a strong step towards more general robot learning algorithms, and are excited to see what other robotics tasks we can apply it to. You can learn more about this work in the short video below.
Acknowledgements
This research was conducted by Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. We’d also like to give special thanks to Iñaki Gonzalo and John-Michael Burke for overseeing the robot operations, Chelsea Finn, Timothy Lillicrap, and Arun Nair for valuable discussions, and other people at Google and X who’ve contributed their expertise and time towards this research. A preprint is available on arXiv.

Source: Google AI Blog


Teaching Uncalibrated Robots to Visually Self-Adapt



People are remarkably proficient at manipulating objects without needing to adjust their viewpoint to a fixed or specific pose. This capability (referred to as visual motor integration) is learned during childhood from manipulating objects in various situations, and governed by a self-adaptation and mistake correction mechanism that uses rich sensory cues and vision as feedback. However, this capability is quite difficult for vision-based controllers in robotics, which until now have been built on a rigid setup for reading visual input data from a fixed mounted camera which should not be moved or repositioned at train and test time. The ability to quickly acquire visual motor control skills under large viewpoint variation would have substantial implications for autonomous robotic systems — for example, this capability would be particularly desirable for robots that can help rescue efforts in emergency or disaster zones.

In “Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control” presented at CVPR 2018 this week, we study a novel deep network architecture (consisting of two fully convolutional networks and a long short-term memory unit) that learns from a past history of actions and observations to self-calibrate. Using diverse simulated data consisting of demonstrated trajectories and reinforcement learning objectives, our visually-adaptive network is able to control a robotic arm to reach a diverse set of visually-indicated goals, from various viewpoints and independent of camera calibration.
Viewpoint invariant manipulation for visually indicated goal reaching with a physical robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from drastically different camera viewpoints. First row shows the visually indicated goals.

The Challenge
Discovering how the controllable degrees of freedom (DoF) affect visual motion can be ambiguous and underspecified from a single image captured from an unknown viewpoint. Identifying the effect of actions on image-space motion and successfully performing the desired task requires a robust perception system augmented with the ability to maintain a memory of past actions. To be able to tackle this challenging problem, we had to address the following essential questions:
  • How can we make it feasible to provide the right amount of experience for the robot to learn the self-adaptation behavior based on pure visual observations that simulate a lifelong learning paradigm?
  • How can we design a model that integrates robust perception and self-adaptive control such that it can quickly transfer to unseen environments?
To do so, we devised a new manipulation task where a seven-DoF robot arm is provided with an image of an object and is directed to reach that particular goal amongst a set of distractor objects, while viewpoints change drastically from one trial to another. In doing so, we were able to simulate both the learning of complex behaviors and the transfer to unseen environments.
Visually indicated goal reaching task with a physical robotic arm and diverse camera viewpoints.
Harnessing Simulation to Learn Complex Behaviors
Collecting robot experience data is difficult and time-consuming. In a previous post, we showed how to scale up learning skills by distributing the data collection and trials to multiple robots. Although this approach expedited learning, it is still not feasibly extendable to learning complex behaviors such as visual self-calibration, where we need to expose robots to a huge space of various viewpoints. Instead, we opt to learn such complex behavior in simulation where we can collect unlimited robot trials and easily move the camera to various random viewpoints. In addition to fast data collection in simulation, we can also surpass hardware limitations requiring the installation of multiple cameras around a robot.
We use domain randomization technique to learn generalizable policies in simulation.
To learn visually robust features to transfer to unseen environments, we used a technique known as domain randomization (a.k.a. simulation randomization) introduced by Sadeghi & Levine (2017), that enables robots to learn vision-based policies entirely in simulation such that they can generalize to the real world. This technique was shown to work well for various robotic tasks such as indoor navigation, object localization, pick and placing, etc. In addition, to learn complex behaviors like self-calibration, we harnessed the simulation capabilities to generate synthetic demonstrations and combined reinforcement learning objectives to learn a robust controller for the robotic arm.
Viewpoint invariant manipulation for visually indicated goal reaching with a simulated seven-DoF robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from dramatically different camera viewpoints.

Disentangling Perception from Control
To enable fast transfer to unseen environments, we devised a deep neural network that combines perception and control trained end-to-end simultaneously, while also allowing each to be learned independently if needed. This disentanglement between perception and control eases transfer to unseen environments, and makes the model both flexible and efficient in that each of its parts (i.e. 'perception' or 'control') can be independently adapted to new environments with small amounts of data. Additionally, while the control portion of the network was entirely trained by the simulated data, the perception part of our network was complemented by collecting a small amount of static images with object bounding boxes without needing to collect the whole action sequence trajectory with a physical robot. In practice, we fine-tuned the perception part of our network with only 76 object bounding boxes coming from 22 images.
Real-world robot and moving camera setup. First row shows the scene arrangements and the second row shows the visual sensory input to the robot.
Early Results
We tested the visually-adapted version of our network on a physical robot and on real objects with drastically different appearances than the ones used in simulation. Experiments were performed with both one or two objects on a table — “seen objects” (as labeled in the figure below) were used for visual adaptation using small collection of real static images, while “unseen objects” had not been seen during visual adaptation. During the test, the robot arm was directed to reach a visually indicated object from various viewpoints. For the two object experiments the second object was to "fool" the robotic arm. While the simulation-only network has good generalization capability (due to being trained with domain randomization technique), the very small amount of static visual data to visually adapt the controller boosted the performance, due to the flexible architecture of our network.
After adapting the visual features with the small amount of real images, performance was boosted by more than 10%. All used real objects are drastically different from the objects seen in simulation.
We believe that learning online visual self-adaptation is an important and yet challenging problem with the goal of learning generalizable policies for robots that can act in diverse and unstructured real world setup. Our approach can be extended to any sort of automatic self-calibration. See the video below for more information on this work.
Acknowledgements
This research was conducted by Fereshteh Sadeghi, Alexander Toshev, Eric Jang and Sergey Levine. We would also like to thank Erwin Coumans and Yunfei Bai for providing pybullet, and Vincent Vanhoucke for insightful discussions.




Source: Google AI Blog


IoT Developer Story: Deeplocal

Posted by Dave Smith, Developer Advocate for IoT

Deeplocal is a Pittsburgh-based innovation studio that makes inventions as marketing to help the world's most loved brands tell their stories. The team at Deeplocal built several fun and engaging robotics projects using Android Things. Leveraging the developer ecosystem surrounding the Android platform and the compute power of Android Things hardware, they were able to quickly and easily create robots powered by computer vision and machine learning.

DrawBot

DrawBot is a DIY drawing robot that transforms your selfies into physical works of art.

"The Android Things platform helped us move quickly from an idea, to prototype, to final product. Switching from phone apps to embedded code was easy in Android Studio, and we were able to pull in OpenCV modules, motor drivers, and other libraries as needed. The final version of our prototype was created two weeks after unboxing our first Android Things developer kit."

- Brian Bourgeois, Producer, Deeplocal

Want to build your own DrawBot? See the Hackster.io project for all the source code, schematics, and 3D models.

HandBot

A robotic hand that learns and reacts to hand gestures, HandBot visually recognizes gestures and applies machine learning.

"The Android Things platform made integration work for Handbot a breeze. Using TensorFlow, we were able to train a neural network to recognize hand gestures. Once this was created, we were able to use Android Things drivers to implement games in easy-to-read Android code. In a matter of weeks, we went from a fresh developer kit to competing against a robot hand in Rock, Paper, Scissors."

- Mike Derrick, Software Engineer, Deeplocal

Want to build your own HandBot? See the Hackster.io project for all the source code, schematics, and 3D models.

Visit the Google Hackster community to explore more inspiring ideas just like these, and join Google's IoT Developers Community on Google+ to get the latest platform updates, ask questions, and discuss ideas.

A Summary of the First Conference on Robot Learning



Whether in the form of autonomous vehicles, home assistants or disaster rescue units, robotic systems of the future will need to be able to operate safely and effectively in human-centric environments. In contrast to to their industrial counterparts, they will require a very high level of perceptual awareness of the world around them, and to adapt to continuous changes in both their goals and their environment. Machine learning is a natural answer to both the problems of perception and generalization to unseen environments, and with the recent rapid progress in computer vision and learning capabilities, applying these new technologies to the field of robotics is becoming a very central research question.

This past November, Google helped kickstart and host the first Conference on Robot Learning (CoRL) at our campus in Mountain View. The goal of CoRL was to bring machine learning and robotics experts together for the first time in a single-track conference, in order to foster new research avenues between the two disciplines. The sold-out conference attracted 350 researchers from many institutions worldwide, who collectively presented 74 original papers, along with 5 keynotes by some of the most innovative researchers in the field.
Prof. Sergey Levine, CoRL 2017 co-chair, answering audience questions.
Sayna Ebrahimi (UC Berkeley) presenting her research.
Videos of the inaugural CoRL are available on the conference website. Additionally, we are delighted to announce that next year, CoRL moves to Europe! CoRL 2018 will be chaired by Professor Aude Billard from the École Polytechnique Fédérale de Lausanne, and will tentatively be held in the Eidgenössische Technische Hochschule (ETH) in Zürich on October 29th-31st, 2018. Looking forward to seeing you there!
Prof. Ken Goldberg, CoRL 2017 co-chair, and Jeffrey Mahler (UC Berkeley) during a break.

Closing the Simulation-to-Reality Gap for Deep Robotic Learning



Each of us can learn remarkably complex skills that far exceed the proficiency and robustness of even the most sophisticated robots, when it comes to basic sensorimotor skills like grasping. However, we also draw on a lifetime of experience, learning over the course of multiple years how to interact with the world around us. Requiring such a lifetime of experience for a learning-based robot system is quite burdensome: the robot would need to operate continuously, autonomously, and initially at a low level of proficiency before it can become useful. Fortunately, robots have a powerful tool at their disposal: simulation.

Simulating many years of robotic interaction is quite feasible with modern parallel computing, physics simulation, and rendering technology. Moreover, the resulting data comes with automatically-generated annotations, which is particularly important for tasks where success is hard to infer automatically. The challenge with simulated training is that even the best available simulators do not perfectly capture reality. Models trained purely on synthetic data fail to generalize to the real world, as there is a discrepancy between simulated and real environments, in terms of both visual and physical properties. In fact, the more we increase the fidelity of our simulations, the more effort we have to expend in order to build them, both in terms of implementing complex physical phenomena and in terms of creating the content (e.g., objects, backgrounds) to populate these simulations. This difficulty is compounded by the fact that powerful optimization methods based on deep learning are exceptionally proficient at exploiting simulator flaws: the more powerful the machine learning algorithm, the more likely it is to discover how to "cheat" the simulator to succeed in ways that are infeasible in the real world. The question then becomes: how can a robot utilize simulation to enable it to perform useful tasks in the real world?

The difficulty of transferring simulated experience into the real world is often called the "reality gap." The reality gap is a subtle but important discrepancy between reality and simulation that prevents simulated robotic experience from directly enabling effective real-world performance. Visual perception often constitutes the widest part of the reality gap: while simulated images continue to improve in fidelity, the peculiar and pathological regularities of synthetic pictures, and the wide, unpredictable diversity of real-world images, makes bridging the reality gap particularly difficult when the robot must use vision to perceive the world, as is the case for example in many manipulation tasks. Recent advances in closing the reality gap with deep learning in computer vision for tasks such as object classification and pose estimation provide promising solutions.  For example,  Shrivastava et al. and Bousmalis et al. explored pixel-level domain adaptation. Ganin et al. and Bousmalis and Trigeorgis et al. focus on feature-level domain adaptation. These advances required a rethinking of the approaches used to solve the simulation-to-reality domain shift problem for robotic manipulation as well. Although a number of recent works have sought to address the reality gap in robotics, through techniques such as machine learning-based domain adaptation (Tzeng et al.) and randomization of simulated environments (Sadeghi and Levine), effective transfer in robotic manipulation has been limited to relatively simple tasks, such as grasping rectangular, brightly-colored objects (Tobin et al. and James et al.) and free-space motion (Christiano et al.).  In this post, we describe how learning in simulation, in our case PyBullet, and using domain adaptation methods such as machine learning methods that deal with the simulation-to-reality domain shift, can accelerate learning of robotic grasping in the real world. This approach can enable real robots to grasp a large of variety physical objects, unseen during training, with a high degree of proficiency.

The performance effect of using 8 million simulated samples of procedural objects with no randomization and various amounts of real data.

Before we consider introducing simulated experience, what does it take for our robots to learn to reliably grasp such not-before-seen objects with only real-world experience? In a previous post, we discussed how the Google Brain team and X’s robotics teams teach robots how to grasp a variety of ordinary objects by just using images from a single monocular camera. It takes tens to hundreds of thousands of grasp attempts, the equivalent of thousands of robot-hours of real-world experience. Although distributing the learning across multiple robots expedites this, the realities of real-world data collection, including maintenance and wear-and-tear, mean that these kinds of data collection efforts still take a significant amount of real time. As mentioned above, an appealing alternative is to use off-the-shelf simulators and learn basic sensorimotor skills like grasping in a virtual environment. Training a robot how to grasp in simulation can be parallelized easily over any number of machines, and can provide large amounts of experience in dramatically less time (e.g., hours rather than months) and at a fraction of the cost.


If the goal is to bridge the reality gap for vision-based robotic manipulation, we must answer a few critical questions. First, how do we design simulation so that simulated experience appears realistic to a neural network? And second, how should we integrate simulated and real experience in a way that maximizes transfer to the real world? We studied these questions in the context of a particularly challenging and important robotic manipulation task: vision-based grasping of diverse objects. We extensively evaluated the effect of various simulation design decisions in combination with various techniques for integrating simulated and real experience for maximal performance.
The setup we used for collecting the simulated and real-world datasets.

Images used during training of simulated grasping experience with procedurally generated objects (left) and of real-world experience with a varied collection of everyday physical objects (right). In both cases, we see pairs of image inputs with and without the robot arm present.
When it comes to simulation, there are a number of choices we have to make: the type of objects to use for simulated grasping, whether to use appearance and/or dynamics randomization, and whether to extract any additional information from the simulator that could aid adaptation to the real world. The types of objects we use in simulation is a particularly important one, and there are a number of choices. A question that comes naturally is: how realistic do the objects used in simulation need to be? Using randomly generated procedural objects is the most desirable choice, because these objects are generated effortlessly on demand, and are easy to parameterize if we change the requirements of the task. However, they are not realistic and one could imagine they might not be useful for transferring the experience of grasping them to the real world. Using realistic 3D object models from a publicly available model library, such as the widely used ShapeNet, is another choice, which however restricts our findings to be related to the characteristics of the specific models we are using. In this work, we compared the effect of using procedurally-generated and realistic objects from the ShapeNet model repository, and found that simply using random objects generated programmatically was not just sufficient for efficient experience transfer from simulation to reality, but also generalized better to the real world than using ShapeNet ones.

Some of the procedurally-generated objects used in simulation.

Some of the ShapeNet objects used in simulation.

Some of the physical objects used to collect real grasping experience.
Another decision about our simulated environment has to do with the randomization of the simulation. Simulation randomization has shown promise in providing generalization to real-world environments in previous work. We further evaluate randomization as a way to provide generalization by separately evaluating the effect of using appearance randomization (randomly changing textures of different visual components of the virtual environment), and dynamics randomization (randomly changing object mass, and friction properties). For our task, visual randomization had a positive effect when we did not use domain adaptation methods to aid with generalization, and had no effect when we included domain adaptation. Using dynamics randomization did not show a significant improvement for this particular task, however it is possible that dynamics randomization might be more relevant in other tasks. These results suggest that, although randomization can be an important part of simulation-to-real-world transfer, the inclusion of effective domain adaptation can have a substantially more pronounced impact for vision-based manipulation tasks.

Appearance randomization in simulation.
Finally, the information we choose to extract and use for our domain adaptation methods has a significant impact on performance. In one of our proposed methods, we utilize the extracted semantic map of the simulated image, ie the description of each pixel in the simulated image, and use it to ground our proposed domain adaptation approach to produce semantically-meaningful realistic samples, as we discuss below.

Our main proposed approach to integrating simulated and real experience, which we call GraspGAN, takes as input synthetic images generated by a simulator, along with their semantic maps, and produces adapted images that look similar to real-world ones. This is possible with adversarial training, a powerful idea proposed by Goodfellow et al. In our framework, a convolutional neural network, the generator, takes as input synthetic images and generates images that another neural network, the discriminator, cannot distinguish from actual real images. The generator and discriminator networks are trained simultaneously and improve together, resulting in a generator that can produce images that are both realistic and useful for learning a grasping model that will generalize to the real world. One way to make sure that these images are useful is the use of the semantic maps of the synthetic images to ground the generator. By using the prediction of these masks as an auxiliary task, the generator is encouraged to produce meaningful adapted images that correspond to the original label attributed to the simulated experience. We train a deep vision-based grasping model with both visually-adapted simulated and real images, and attempt to account for the domain shift further by using a feature-level domain adaptation technique which helps produce a domain-invariant model. See below the GraspGAN adapting simulated images to realistic ones and a semantic map it infers.


By using synthetic data and domain adaptation we are able to reduce the number of real-world samples required to achieve a given level of performance by up to 50 times, using only randomly generated objects in simulation. This means that we have no prior information about the objects in the real world, other than pre-specified size limits for the graspable objects. We have shown that we are able to increase performance with various amounts of real-world data, and also that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with hundreds of thousands of labeled real-world samples. This suggests that, instead of collecting labeled experience, it may be sufficient in the future to simply record raw unlabeled images, use them to train a GraspGAN model, and then learn the skills themselves in simulation.

Although this work has not addressed all the issues around closing the reality gap, we believe that our results show that using simulation and domain adaptation to integrate simulated and real robotic experience is an attractive choice for training robots. Most importantly, we have extensively evaluated the performance gains for different available amounts of labeled real-world samples, and for the different design choices for both the simulator and the domain adaptation methods used. This evaluation can hopefully serve as a guide for practitioners to use for their own design decisions and for weighing the advantages and disadvantages of incorporating such an approach in their experimental design.

This research was conducted by K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M, Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, V. Vanhoucke, with special thanks to colleagues at Google Research and X who've contributed their expertise and time to this research. An early preprint is available on arXiv.

The collection of procedurally-generated objects we used in simulation was made publicly available here by Laura Downs.











Teaching Robots to Understand Semantic Concepts



Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means.

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.
Examples of human demonstrations (left) and the corresponding robotic imitation (right).
Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.
Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.
After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.
Learning progression.
Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations.

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.
Self-supervised human pose imitation by a robot.
A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linear joints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.”
In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).
To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous post and prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.
The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.
A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:
Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understanding, robotic perception, grasping, and imitation learning has considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

Acknowledgements
The research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.

Unsupervised Perceptual Rewards for Imitation Learning was presented at RSS 2017 by Kelvin Xu, and Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation will be presented this week at the CVPR Workshop on Deep Learning for Robotic Vision.

Introducing Cartographer

We are happy to announce the open source release of Cartographer, a real-time simultaneous localization and mapping (SLAM) library in 2D and 3D with ROS support.

SLAM algorithms combine data from various sensors (e.g. LIDAR, IMU and cameras) to simultaneously compute the position of the sensor and a map of the sensor’s surroundings. For example, consider this approach to drawing a floor plan of your living room:
  • Grab a laser rangefinder, stand in the middle of the room, and draw an X on a piece of paper.
  • Measure the distance from where you’re standing to any wall.
  • Draw a line on the paper where the wall is and write down the distance between the X (your position) and the wall.
  • Measure the distance from where you’re standing to another wall and add it to the drawing as well.
  • Now, move to another part of the room.
  • Since the walls (hopefully) haven’t moved, you can measure your distance to the same two walls to determine your new position.


SLAM is an essential component of autonomous platforms such as self driving cars, automated forklifts in warehouses, robotic vacuum cleaners, and UAVs.

Cartographer builds globally consistent maps in real-time across a broad range of sensor configurations common in academia and industry. The following video is a demonstration of Cartographer’s real-time loop closure:


A detailed description of Cartographer’s 2D algorithms can be found in our ICRA 2016 paper.

Thanks to ROS integration and support from external contributors, Cartographer is ready to use on several robot platforms with ROS support:
At Google, Cartographer has enabled a range of applications from mapping museums and transit hubs to enabling new visualizations of famous buildings.

We recognize the value of high quality datasets to the research community. That’s why, thanks to cooperation with the Deutsches Museum (the largest tech museum in the world), we are also releasing three years of LIDAR and IMU data collected using our 2D and 3D mapping backpack platforms during the development and testing of Cartographer.


Our focus is on advancing and democratizing SLAM as a technology. Currently, Cartographer is heavily focused on LIDAR SLAM. Through continued development and community contributions, we hope to add both support for more sensors and platforms as well as new features, such as lifelong mapping and localizing in a pre-existing map.

By Damon Kohler, Wolfgang Hess, and Holger Rapp, Google Engineering

How Robots Can Acquire New Skills from Their Shared Experience



The ability to learn from experience will likely be a key in enabling robots to help with complex real-world tasks, from assisting the elderly with chores and daily activities, to helping us in offices and hospitals, to performing jobs that are too dangerous or unpleasant for people. However, if each robot must learn its full repertoire of skills for these tasks only from its own experience, it could take far too long to acquire a rich enough range of behaviors to be useful. Could we bridge this gap by making it possible for robots to collectively learn from each other’s experiences?

While machine learning algorithms have made great strides in natural language understanding and speech recognition, the kind of symbolic high-level reasoning that allows people to communicate complex concepts in words remains out of reach for machines. However, robots can instantaneously transmit their experience to other robots over the network - sometimes known as "cloud robotics" - and it is this ability that can let them learn from each other.

This is true even for seemingly simple low-level skills. Humans and animals excel at adaptive motor control that integrates their senses, reflexes, and muscles in a closely coordinated feedback loop. Robots still struggle with these basic skills in the real world, where the variability and complexity of the environment demands well-honed behaviors that are not easily fooled by distractors. If we enable robots to transmit their experiences to each other, could they learn to perform motion skills in close coordination with sensing in realistic environments?

We previously wrote about how multiple robots could pool their experiences to learn a grasping task. Here, we will discuss new experiments that we conducted to investigate three possible approaches for general-purpose skill learning across multiple robots: learning motion skills directly from experience, learning internal models of physics, and learning skills with human assistance. In all three cases, multiple robots shared their experiences to build a common model of the skill. The skills learned by the robots are still relatively simple -- pushing objects and opening doors -- but by learning such skills more quickly and efficiently through collective learning, robots might in the future acquire richer behavioral repertoires that could eventually make it possible for them to assist us in our daily lives.

Learning from raw experience with model-free reinforcement learning.
Perhaps one of the simplest ways for robots to teach each other is to pool information about their successes and failures in the world. Humans and animals acquire many skills by direct trial-and-error learning. During this kind of ‘model-free’ learning -- so called because there is no explicit model of the environment formed -- they explore variations on their existing behavior and then reinforce and exploit the variations that give bigger rewards. In combination with deep neural networks, model-free algorithms have recently proved to be surprisingly effective and have been key to successes with the Atari video game system and playing Go. Having multiple robots allows us to experiment with sharing experiences to speed up this kind of direct learning in the real world.

In these experiments we tasked robots with trying to move their arms to goal locations, or reaching to and opening a door. Each robot has a copy of a neural network that allows it to estimate the value of taking a given action in a given state. By querying this network, the robot can quickly decide what actions might be worth taking in the world. When a robot acts, we add noise to the actions it selects, so the resulting behavior is sometimes a bit better than previously observed, and sometimes a bit worse. This allows each robot to explore different ways of approaching a task. Records of the actions taken by the robots, their behaviors, and the final outcomes are sent back to a central server. The server collects the experiences from all of the robots and uses them to iteratively improve the neural network that estimates value for different states and actions. The model-free algorithms we employed look across both good and bad experiences and distill these into a new network that is better at understanding how action and success are related. Then, at regular intervals, each robot takes a copy of the updated network from the server and begins to act using the information in its new network. Given that this updated network is a bit better at estimating the true value of actions in the world, the robots will produce better behavior. This cycle can then be repeated to continue improving on the task. In the video below, a robot explores the door opening task.
With a few hours of practice, robots sharing their raw experience learn to make reaches to targets, and to open a door by making contact with the handle and pulling. In the case of door opening, the robots learn to deal with the complex physics of the contacts between the hook and the door handle without building an explicit model of the world, as can be seen in the example below:
Learning how the world works by interacting with objects.
Direct trial-and-error reinforcement learning is a great way to learn individual skills. However, humans and animals don’t learn exclusively by trial and error. We also build mental models about our environment and imagine how the world might change in response to our actions.

We can start with the simplest of physical interactions, and have our robots learn the basics of cause and effect from reflecting on their own experiences. In this experiment, we had the robots play with a wide variety of common household objects by randomly prodding and pushing them inside a tabletop bin. The robots again shared their experiences with each other and together built a single predictive model that attempted to forecast what the world might look like in response to their actions. This predictive model can make simple, if slightly blurry, forecasts about future camera images when provided with the current image and a possible sequence of actions that the robot might execute:
Top row: robotic arms interacting with common household items.
Bottom row: Predicted future camera images given an initial image and a sequence of actions.
Once this model is trained, the robots can use it to perform purposeful manipulations, for example based on user commands. In our prototype, a user can command the robot to move a particular object simply by clicking on that object, and then clicking on the point where the object should go:
The robots in this experiment were not told anything about objects or physics: they only see that the command requires a particular pixel to be moved to a particular place. However, because they have seen so many object interactions in their shared past experiences, they can forecast how particular actions will affect particular pixels. In order for such an implicit understanding of physics to emerge, the robots must be provided with a sufficient breadth of experience. This requires either a lot of time, or sharing the combined experiences of many robots. An extended video on this project may be found here.
Learning with the help of humans.
So far, we discussed how robots can learn entirely on their own. However, human guidance is important, not just for telling the robot what to do, but also for helping the robots along. We have a lot of intuition about how various manipulation skills can be performed, and it only seems natural that transferring this intuition to robots can help them learn these skills a lot faster. In the next experiment, we provided each robot with a different door, and guided each of them by hand to show how these doors can be opened. These demonstrations are encoded into a single combined strategy for all robots, called a policy. The policy is a deep neural network which converts camera images to robot actions, and is maintained on a central server. The following video shows the instructor demonstrating the door-opening skill to a robot:
Next, the robots collectively improve this policy through a trial-and-error learning process. Each robot attempts to open its own door using the latest available policy, with some added noise for exploration. These attempts allow each robot to plan a better strategy for opening the door the next time around, and improve the policy accordingly:
Not surprisingly, we find that robots learn more effectively if they are trained on a curriculum of tasks that are gradually increasing in difficulty. In our experiment, each robot starts off by practicing the door-opening skill on a specific position and orientation of the door that the instructor had previously shown it. As it gets better at performing the task, the instructor starts to alter the position and orientation of the door to be just a bit beyond the current capabilities of the policy, but not so difficult that it fails entirely. This allows the robots to gradually increase their skill level over time, and expands the range of situations they can handle. The combination of human-guidance with trial-and-error learning allowed the robots to collectively learn the skill of door-opening in just a couple of hours. Since the robots were trained on doors that look different from each other, the final policy succeeds on a door with a handle that none of the robots had seen before:
In all three of the experiments described above, the ability to communicate and exchange their experiences allows the robots to learn more quickly and effectively. This becomes particularly important when we combine robotic learning with deep learning, as is the case in all of the experiments discussed above. We’ve seen before that deep learning works best when provided with ample training data. For example, the popular ImageNet benchmark uses over 1.5 million labeled examples. While such a quantity of data is not impossible for a single robot to gather over a few years, it is much more efficient to gather the same volume of experience from multiple robots over the course of a few weeks. Besides faster learning times, this approach might benefit from the greater diversity of experience: a real-world deployment might involve multiple robots in different places and different settings, sharing heterogeneous, varied experiences to build a single highly generalizable representation.

Of course, the kinds of behaviors that robots today can learn are still quite limited. Even basic motion skills, such as picking up objects and opening doors, remain in the realm of cutting edge research. In all of these experiments, a human engineer is still needed to tell the robots what they should learn to do by specifying a detailed objective function. However, as algorithms improve and robots are deployed more widely, their ability to share and pool their experiences could be instrumental for enabling them to assist us in our daily lives.

The experiments on learning by trial-and-error were conducted by Shixiang (Shane) Gu and Ethan Holly from the Google Brain team, and Timothy Lillicrap from DeepMind. Work on learning predictive models was conducted by Chelsea Finn from the Google Brain team, and the research on learning from demonstration was conducted by Yevgen Chebotar, Ali Yahya, Adrian Li, and Mrinal Kalakrishnan from X. We would also like to acknowledge contributions by Peter Pastor, Gabriel Dulac-Arnold, and Jon Scholz. Articles about each of the experiments discussed in this blog post can be found below:

Deep Reinforcement Learning for Robotic Manipulation. Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine. [video]

Deep Visual Foresight for Planning Robot Motion. Chelsea Finn, Sergey Levine. [video] [data]

Collective Robot Reinforcement Learning with Distributed Asynchronous Guided Policy Search.
Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, Sergey Levine.  [video]

Path Integral Guided Policy Search. Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, Sergey Levine. [video]