Tag Archives: reinforcement learning

PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

Evolution strategy (ES) is a family of optimization techniques inspired by the ideas of natural selection: a population of candidate solutions are usually evolved over generations to better adapt to an optimization objective. ES has been applied to a variety of challenging decision making problems, such as legged locomotion, quadcopter control, and even power system control.

Compared to gradient-based reinforcement learning (RL) methods like proximal policy optimization (PPO) and soft actor-critic (SAC), ES has several advantages. First, ES directly explores in the space of controller parameters, while gradient-based methods often explore within a limited action space, which indirectly influences the controller parameters. More direct exploration has been shown to boost learning performance and enable large scale data collection with parallel computation. Second, a major challenge in RL is long-horizon credit assignment, e.g., when a robot accomplishes a task in the end, determining which actions it performed in the past were the most critical and should be assigned a greater reward. Since ES directly considers the total reward, it relieves researchers from needing to explicitly handle credit assignment. In addition, because ES does not rely on gradient information, it can naturally handle highly non-smooth objectives or controller architectures where gradient computation is non-trivial, such as meta–reinforcement learning. However, a major weakness of ES-based algorithms is their difficulty in scaling to problems that require high-dimensional sensory inputs to encode the environment dynamics, such as training robots with complex vision inputs.

In this work, we propose “PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations”, a learning algorithm that combines representation learning and ES to effectively solve high dimensional problems in a scalable way. The core idea is to leverage predictive information, a representation learning objective, to obtain a compact representation of the high-dimensional environment dynamics, and then apply Augmented Random Search (ARS), a popular ES algorithm, to transform the learned compact representation into robot actions. We tested PI-ARS on the challenging problem of visual-locomotion for legged robots. PI-ARS enables fast training of performant vision-based locomotion controllers that can traverse a variety of difficult environments. Furthermore, the controllers trained in simulated environments successfully transfer to a real quadruped robot.

PI-ARS trains reliable visual-locomotion policies that are transferable to the real world.

Predictive Information
A good representation for policy learning should be both compressive, so that ES can focus on solving a much lower dimensional problem than learning from raw observations would entail, and task-critical, so the learned controller has all the necessary information needed to learn the optimal behavior. For robotic control problems with high-dimensional input space, it is critical for the policy to understand the environment, including the dynamic information of both the robot itself and its surrounding objects.

As such, we propose an observation encoder that preserves information from the raw input observations that allows the policy to predict the future states of the environment, thus the name predictive information (PI). More specifically, we optimize the encoder such that the encoded version of what the robot has seen and planned in the past can accurately predict what the robot might see and be rewarded in the future. One mathematical tool to describe such a property is that of mutual information, which measures the amount of information we obtain about one random variable X by observing another random variable Y. In our case, X and Y would be what the robot saw and planned in the past, and what the robot sees and is rewarded in the future. Directly optimizing the mutual information objective is a challenging problem because we usually only have access to samples of the random variables, but not their underlying distributions. In this work we follow a previous approach that uses InfoNCE, a contrastive variational bound on mutual information to optimize the objective.

Left: We use representation learning to encode PI of the environment. Right: We train the representation by replaying trajectories from the replay buffer and maximize the predictability between the observation and motion plan in the past and the observation and reward in the future of the trajectory.

Predictive Information with Augmented Random Search
Next, we combine PI with Augmented Random Search (ARS), an algorithm that has shown excellent optimization performance for challenging decision-making tasks. At each iteration of ARS, it samples a population of perturbed controller parameters, evaluates their performance in the testing environment, and then computes a gradient that moves the controller towards the ones that performed better.

We use the learned compact representation from PI to connect PI and ARS, which we call PI-ARS. More specifically, ARS optimizes a controller that takes as input the learned compact representation PI and predicts appropriate robot commands to achieve the task. By optimizing a controller with smaller input space, it allows ARS to find the optimal solution more efficiently. Meanwhile, we use the data collected during ARS optimization to further improve the learned representation, which is then fed into the ARS controller in the next iteration.

An overview of the PI-ARS data flow. Our algorithm interleaves between two steps: 1) optimizing the PI objective that updates the policy, which is the weights for the neural network that extracts the learned representation; and 2) sampling new trajectories and updating the controller parameters using ARS.

Visual-Locomotion for Legged Robots
We evaluate PI-ARS on the problem of visual-locomotion for legged robots. We chose this problem for two reasons: visual-locomotion is a key bottleneck for legged robots to be applied in real-world applications, and the high-dimensional vision-input to the policy and the complex dynamics in legged robots make it an ideal test-case to demonstrate the effectiveness of the PI-ARS algorithm. A demonstration of our task setup in simulation can be seen below. Policies are first trained in simulated environments, and then transferred to hardware.

An illustration of the visual-locomotion task setup. The robot is equipped with two cameras to observe the environment (illustrated by the transparent pyramids). The observations and robot state are sent to the policy to generate a high-level motion plan, such as feet landing location and desired moving speed. The high-level motion plan is then achieved by a low-level Motion Predictive Control (MPC) controller.

Experiment Results
We first evaluate the PI-ARS algorithm on four challenging simulated tasks:

  • Uneven stepping stones: The robot needs to walk over uneven terrain while avoiding gaps.
  • Quincuncial piles: The robot needs to avoid gaps both in front and sideways.
  • Moving platforms: The robot needs to walk over stepping stones that are randomly moving horizontally or vertically. This task illustrates the flexibility of learning a vision-based policy in comparison to explicitly reconstructing the environment.
  • Indoor navigation: The robot needs to navigate to a random location while avoiding obstacles in an indoor environment.

As shown below, PI-ARS is able to significantly outperform ARS in all four tasks in terms of the total task reward it can obtain (by 30-50%).

Left: Visualization of PI-ARS policy performance in simulation. Right: Total task reward (i.e., episode return) for PI-ARS (green line) and ARS (red line). The PI-ARS algorithm significantly outperforms ARS on four challenging visual-locomotion tasks.

We further deploy the trained policies to a real Laikago robot on two tasks: random stepping stone and indoor navigation. We demonstrate that our trained policies can successfully handle real-world tasks. Notably, the success rate of the random stepping stone task improved from 40% in the prior work to 100%.

PI-ARS trained policy enables a real Laikago robot to navigate around obstacles.

Conclusion
In this work, we present a new learning algorithm, PI-ARS, that combines gradient-based representation learning with gradient-free evolutionary strategy algorithms to leverage the advantages of both. PI-ARS enjoys the effectiveness, simplicity, and parallelizability of gradient-free algorithms, while relieving a key bottleneck of ES algorithms on handling high-dimensional problems by optimizing a low-dimensional representation. We apply PI-ARS to a set of challenging visual-locomotion tasks, among which PI-ARS significantly outperforms the state of the art. Furthermore, we validate the policy learned by PI-ARS on a real quadruped robot. It enables the robot to walk over randomly-placed stepping stones and navigate in an indoor space with obstacles. Our method opens the possibility of incorporating modern large neural network models and large-scale data into the field of evolutionary strategy for robotics control.

Acknowledgements
We would like to thank our paper co-authors: Ofir Nachum, Tingnan Zhang, Sergio Guadarrama, and Jie Tan. We would also like to thank Ian Fischer and John Canny for valuable feedback.

Source: Google AI Blog


Table Tennis: A Research Platform for Agile Robotics

Robot learning has been applied to a wide range of challenging real world tasks, including dexterous manipulation, legged locomotion, and grasping. It is less common to see robot learning applied to dynamic, high-acceleration tasks requiring tight-loop human-robot interactions, such as table tennis. There are two complementary properties of the table tennis task that make it interesting for robotic learning research. First, the task requires both speed and precision, which puts significant demands on a learning algorithm. At the same time, the problem is highly-structured (with a fixed, predictable environment) and naturally multi-agent (the robot can play with humans or another robot), making it a desirable testbed to investigate questions about human-robot interaction and reinforcement learning. These properties have led to several research groups developing table tennis research platforms [1, 2, 3, 4].

The Robotics team at Google has built such a platform to study problems that arise from robotic learning in a multi-player, dynamic and interactive setting. In the rest of this post we introduce two projects, Iterative-Sim2Real (to be presented at CoRL 2022) and GoalsEye (IROS 2022), which illustrate the problems we have been investigating so far. Iterative-Sim2Real enables a robot to hold rallies of over 300 hits with a human player, while GoalsEye enables learning goal-conditioned policies that match the precision of amateur humans.

Iterative-Sim2Real policies playing cooperatively with humans (top) and a GoalsEye policy returning balls to different locations (bottom).

Iterative-Sim2Real: Leveraging a Simulator to Play Cooperatively with Humans
In this project, the goal for the robot is cooperative in nature: to carry out a rally with a human for as long as possible. Since it would be tedious and time-consuming to train directly against a human player in the real world, we adopt a simulation-based (i.e., sim-to-real) approach. However, because it is difficult to simulate human behavior accurately, applying sim-to-real learning to tasks that require tight, close-loop interaction with a human participant is difficult.

In Iterative-Sim2Real, (i.e., i-S2R), we present a method for learning human behavior models for human-robot interaction tasks, and instantiate it on our robotic table tennis platform. We have built a system that can achieve rallies of up to 340 hits with an amateur human player (shown below).

A 340-hit rally lasting over 4 minutes.

Learning Human Behavior Models: a Chicken and Egg Problem
The central problem in learning accurate human behavior models for robotics is the following: if we do not have a good-enough robot policy to begin with, then we cannot collect high-quality data on how a person might interact with the robot. But without a human behavior model, we cannot obtain robot policies in the first place. An alternative would be to train a robot policy directly in the real world, but this is often slow, cost-prohibitive, and poses safety-related challenges, which are further exacerbated when people are involved. i-S2R, visualized below, is a solution to this chicken and egg problem. It uses a simple model of human behavior as an approximate starting point and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

i-S2R Methodology.

Results
To evaluate i-S2R, we repeated the training process five times with five different human opponents and compared it with a baseline approach of ordinary sim-to-real plus fine-tuning (S2R+FT). When aggregated across all players, the i-S2R rally length is higher than S2R+FT by about 9% (below on the left). The histogram of rally lengths for i-S2R and S2R+FT (below on the right) shows that a large fraction of the rallies for S2R+FT are shorter (i.e., less than 5), while i-S2R achieves longer rallies more frequently.

Summary of i-S2R results. Boxplot details: The white circle is the mean, the horizontal line is the median, box bounds are the 25th and 75th percentiles.

We also break down the results based on player type: beginner (40% players), intermediate (40% of players) and advanced (20% players). We see that i-S2R significantly outperforms S2R+FT for both beginner and intermediate players (80% of players).

i-S2R Results by player type.

More details on i-S2R can be found on our preprint, website, and also in the following summary video.

GoalsEye: Learning to Return Balls Precisely on a Physical Robot
While we focused on sim-to-real learning in i-S2R, it is sometimes desirable to learn using only real-world data — closing the sim-to-real gap in this case is unnecessary. Imitation learning (IL) provides a simple and stable approach to learning in the real world, but it requires access to demonstrations and cannot exceed the performance of the teacher. Collecting expert human demonstrations of precise goal-targeting in high speed settings is challenging and sometimes impossible (due to limited precision in human movements). While reinforcement learning (RL) is well-suited to such high-speed, high-precision tasks, it faces a difficult exploration problem (especially at the start), and can be very sample inefficient. In GoalsEye, we demonstrate an approach that combines recent behavior cloning techniques [5, 6] to learn a precise goal-targeting policy, starting from a small, weakly-structured, non-targeting dataset.

Here we consider a different table tennis task with an emphasis on precision. We want the robot to return the ball to an arbitrary goal location on the table, e.g. “hit the back left corner" or ''land the ball just over the net on the right side" (see left video below). Further, we wanted to find a method that can be applied directly on our real world table tennis environment with no simulation involved. We found that the synthesis of two existing imitation learning techniques, Learning from Play (LFP) and Goal-Conditioned Supervised Learning (GCSL), scales to this setting. It is safe and sample efficient enough to train a policy on a physical robot which is as accurate as amateur humans at the task of returning balls to specific goals on the table.

 
GoalsEye policy aiming at a 20cm diameter goal (left). Human player aiming at the same goal (right).

The essential ingredients of success are:

  1. A minimal, but non-goal-directed “bootstrap” dataset of the robot hitting the ball to overcome an initial difficult exploration problem.
  2. Hindsight relabeled goal conditioned behavioral cloning (GCBC) to train a goal-directed policy to reach any goal in the dataset.
  3. Iterative self-supervised goal reaching. The agent improves continuously by setting random goals and attempting to reach them using the current policy. All attempts are relabeled and added into a continuously expanding training set. This self-practice, in which the robot expands the training data by setting and attempting to reach goals, is repeated iteratively.
GoalsEye methodology.

Demonstrations and Self-Improvement Through Practice Are Key
The synthesis of techniques is crucial. The policy’s objective is to return a variety of incoming balls to any location on the opponent’s side of the table. A policy trained on the initial 2,480 demonstrations only accurately reaches within 30 cm of the goal 9% of the time. However, after a policy has self-practiced for ~13,500 attempts, goal-reaching accuracy rises to 43% (below on the right). This improvement is clearly visible as shown in the videos below. Yet if a policy only self-practices, training fails completely in this setting. Interestingly, the number of demonstrations improves the efficiency of subsequent self-practice, albeit with diminishing returns. This indicates that demonstration data and self-practice could be substituted depending on the relative time and cost to gather demonstration data compared with self-practice.

Self-practice substantially improves accuracy. Left: simulated training. Right: real robot training. The demonstration datasets contain ~2,500 episodes, both in simulation and the real world.
 
Visualizing the benefits of self-practice. Left: policy trained on initial 2,480 demonstrations. Right: policy after an additional 13,500 self-practice attempts.

More details on GoalsEye can be found in the preprint and on our website.

Conclusion and Future Work
We have presented two complementary projects using our robotic table tennis research platform. i-S2R learns RL policies that are able to interact with humans, while GoalsEye demonstrates that learning from real-world unstructured data combined with self-supervised practice is effective for learning goal-conditioned policies in a precise, dynamic setting.

One interesting research direction to pursue on the table tennis platform would be to build a robot “coach” that could adapt its play style according to the skill level of the human participant to keep things challenging and exciting.

Acknowledgements
We thank our co-authors, Saminda Abeyruwan, Alex Bewley, Krzysztof Choromanski, David B. D’Ambrosio, Tianli Ding, Deepali Jain, Corey Lynch, Pannag R. Sanketi, Pierre Sermanet and Anish Shankar. We are also grateful for the support of many members of the Robotics Team who are listed in the acknowledgement sections of the papers.

Source: Google AI Blog


Quantization for Fast and Environmentally Sustainable Reinforcement Learning

Deep reinforcement learning (RL) continues to make great strides in solving real-world sequential decision-making problems such as balloon navigation, nuclear physics, robotics, and games. Despite its promise, one of its limiting factors is long training times. While the current approach to speed up RL training on complex and difficult tasks leverages distributed training scaling up to hundreds or even thousands of computing nodes, it still requires the use of significant hardware resources which makes RL training expensive, while increasing its environmental impact. However, recent work [1, 2] indicates that performance optimizations on existing hardware can reduce the carbon footprint (i.e., total greenhouse gas emissions) of training and inference.

RL can also benefit from similar system optimization techniques that can reduce training time, improve hardware utilization and reduce carbon dioxide (CO2) emissions. One such technique is quantization, a process that converts full-precision floating point (FP32) numbers to lower precision (int8) numbers and then performs computation using the lower precision numbers. Quantization can save memory storage cost and bandwidth for faster and more energy-efficient computation. Quantization has been successfully applied to supervised learning to enable edge deployments of machine learning (ML) models and achieve faster training. However, there remains an opportunity to apply quantization to RL training.

To that end, we present “QuaRL: Quantization for Fast and Environmentally SustainableReinforcement Learning”, published in the Transactions of Machine Learning Research journal, which introduces a new paradigm called ActorQ that applies quantization to speed up RL training by 1.5-5.4x while maintaining performance. Additionally, we demonstrate that compared to training in full-precision, the carbon footprint is also significantly reduced by a factor of 1.9-3.8x.

Applying Quantization to RL Training
In traditional RL training, a learner policy is applied to an actor, which uses the policy to explore the environment and collect data samples. The samples collected by the actor are then used by the learner to continuously refine the initial policy. Periodically, the policy trained on the learner side is used to update the actor's policy. To apply quantization to RL training, we develop the ActorQ paradigm. ActorQ performs the same sequence described above, with one key difference being that the policy update from learner to actors is quantized, and the actor explores the environment using the int8 quantized policy to collect samples.

Applying quantization to RL training in this fashion has two key benefits. First, it reduces the memory footprint of the policy. For the same peak bandwidth, less data is transferred between learners and actors, which reduces the communication cost for policy updates from learners to actors. Second, the actors perform inference on the quantized policy to generate actions for a given environment state. The quantized inference process is much faster when compared to performing inference in full precision.

An overview of traditional RL training (left) and ActorQ RL training (right).

In ActorQ, we use the ACME distributed RL framework. The quantizer block performs uniform quantization that converts the FP32 policy to int8. The actor performs inference using optimized int8 computations. Though we use uniform quantization when designing the quantizer block, we believe that other quantization techniques can replace uniform quantization and produce similar results. The samples collected by the actors are used by the learner to train a neural network policy. Periodically the learned policy is quantized by the quantizer block and broadcasted to the actors.

Quantization Improves RL Training Time and Performance
We evaluate ActorQ in a range of environments, including the Deepmind Control Suite and the OpenAI Gym. We demonstrate the speed-up and improved performance of D4PG and DQN. We chose D4PG as it was the best learning algorithm in ACME for Deepmind Control Suite tasks, and DQN is a widely used and standard RL algorithm.

We observe a significant speedup (between 1.5x and 5.41x) in training RL policies. More importantly, performance is maintained even when actors perform int8 quantized inference. The figures below demonstrate this for the D4PG and DQN agents for Deepmind Control Suite and OpenAI Gym tasks.

A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for D4PG agents on various Deepmind Control Suite tasks. Quantization achieves speed-ups of 1.5x to 3.06x.
A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for DQN agents in the OpenAI Gym environment. Quantization achieves a speed-up of 2.2x to 5.41x.

Quantization Reduces Carbon Emission
Applying quantization in RL using ActorQ improves training time without affecting performance. The direct consequence of using the hardware more efficiently is a smaller carbon footprint. We measure the carbon footprint improvement by taking the ratio of carbon emission when using the FP32 policy during training over the carbon emission when using the int8 policy during training.

In order to measure the carbon emission for the RL training experiment, we use the experiment-impact-tracker proposed in prior work. We instrument the ActorQ system with carbon monitor APIs to measure the energy and carbon emissions for each training experiment.

Compared to the carbon emission when running in full precision (FP32), we observe that the quantization of policies reduces the carbon emissions anywhere from 1.9x to 3.76x, depending on the task. As RL systems are scaled to run on thousands of distributed hardware cores and accelerators, we believe that the absolute carbon reduction (measured in kilograms of CO2) can be quite significant.

Carbon emission comparison between training using a FP32 policy and an int8 policy. The X-axis scale is normalized to the carbon emissions of the FP32 policy. Shown by the red bars greater than 1, ActorQ reduces carbon emissions.

Conclusion and Future Directions
We introduce ActorQ, a novel paradigm that applies quantization to RL training and achieves speed-up improvements of 1.5-5.4x while maintaining performance. Additionally, we demonstrate that ActorQ can reduce RL training’s carbon footprint by a factor of 1.9-3.8x compared to training in full-precision without quantization.

ActorQ demonstrates that quantization can be effectively applied to many aspects of RL, from obtaining high-quality and efficient quantized policies to reducing training times and carbon emissions. As RL continues to make great strides in solving real-world problems, we believe that making RL training sustainable will be critical for adoption. As we scale RL training to thousands of cores and GPUs, even a 50% improvement (as we have experimentally demonstrated) will generate significant savings in absolute dollar cost, energy, and carbon emissions. Our work is the first step toward applying quantization to RL training to achieve efficient and environmentally sustainable training.

While our design of the quantizer in ActorQ relied on simple uniform quantization, we believe that other forms of quantization, compression and sparsity can be applied (e.g., distillation, sparsification, etc.). We hope that future work will consider applying more aggressive quantization and compression methods, which may yield additional benefits to the performance and accuracy tradeoff obtained by the trained RL policies.

Acknowledgments
We would like to thank our co-authors Max Lam, Sharad Chitlangia, Zishen Wan, and Vijay Janapa Reddi (Harvard University), and Gabriel Barth-Maron (DeepMind), for their contribution to this work. We also thank the Google Cloud team for providing research credits to seed this work.

Source: Google AI Blog


Training Generalist Agents with Multi-Game Decision Transformers

Current deep reinforcement learning (RL) methods can train specialist artificial agents that excel at decision-making on various individual tasks in specific environments, such as Go or StarCraft. However, little progress has been made to extend these results to generalist agents that would not only be capable of performing many different tasks, but also upon a variety of environments with potentially distinct embodiments.

Looking across recent progress in the fields of natural language processing, vision, and generative models (such as PaLM, Imagen, and Flamingo), we see that breakthroughs in making general-purpose models are often achieved by scaling up Transformer-based models and training them on large and semantically diverse datasets. It is natural to wonder, can a similar strategy be used in building generalist agents for sequential decision making? Can such models also enable fast adaptation to new tasks, similar to PaLM and Flamingo?

As an initial step to answer these questions, in our recent paper “Multi-Game Decision Transformers” we explore how to build a generalist agent to play many video games simultaneously. Our model trains an agent that can play 41 Atari games simultaneously at close-to-human performance and that can also be quickly adapted to new games via fine-tuning. This approach significantly improves upon the few existing alternatives to learning multi-game agents, such as temporal difference (TD) learning or behavioral cloning (BC).

A Multi-Game Decision Transformer (MGDT) can play multiple games at desired level of competency from training on a range of trajectories spanning all levels of expertise.

Don’t Optimize for Return, Just Ask for Optimality
In reinforcement learning, reward refers to the incentive signals that are relevant to completing a task, and return refers to cumulative rewards in a course of interactions between an agent and its surrounding environment. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. At every time step, an agent observes the environment (some also consider the interactions that happened in the past) and decides what action to take to help itself achieve a higher return magnitude in future interactions.

In this work, we use Decision Transformers as our backbone approach to training an RL agent. A Decision Transformer is a sequence model that predicts future actions by considering past interactions between an agent and the surrounding environment, and (most importantly) a desired return to be achieved in future interactions. Instead of learning a policy to achieve high return magnitude as in traditional reinforcement learning, Decision Transformers map diverse experiences, ranging from expert-level to beginner-level, to their corresponding return magnitude during training. The idea is that training an agent on a range of experiences (from beginner to expert level) exposes the model to a wider range of variations in gameplay, which in turn helps it extract useful rules of gameplay that allow it to succeed under any circumstance. So during inference, the Decision Transformer can achieve any return value in the range it has seen during training, including the optimal return.

But, how do you know if a return is both optimal and stably achievable in a given environment? Previous applications of Decision Transformers relied on customized definitions of the desired return for each individual task, which required manually defining a plausible and informative range of scalar values that are appropriately interpretable signals for each specific game — a task that is non-trivial and rather unscalable. To address this issue, we instead model a distribution of return magnitudes based on past interactions with the environment during training. At inference time, we simply add an optimality bias that increases the probability of generating actions that are associated with higher returns.

To more comprehensively capture spatial-temporal patterns of agent-environment interactions, we also modified the Decision Transformer architecture to consider image patches instead of a global image representation. Patches allow the model to focus on local dynamics, which helps model game specific information in further detail.

These pieces together give us the backbone of Multi-Game Decision Transformers:

Each observation image is divided into a set of M patches of pixels which are denoted O. Return R, action a, and reward r follows these image patches in each input casual sequence. A Decision Transformer is trained to predict the next input (except for the image patches) to establish causality.

Training a Multi-Game Decision Transformer to Play 41 Games at Once
We train one Decision Transformer agent on a large (~1B) and broad set of gameplay experiences from 41 Atari games. In our experiments, this agent, which we call the Multi-Game Decision Transformer (MGDT), clearly outperforms existing reinforcement learning and behavioral cloning methods — by almost 2 times — on learning to play 41 games simultaneously and performs near human-level competency (100% in the following figure corresponds to the level of human gameplay). These results hold when comparing across training methods in both settings where a policy must be learned from static datasets (offline) as well as those where new data can be gathered from interacting with the environment (online).

Each bar is a combined score across 41 games, where 100% indicates human-level performance. Each blue bar is from a model trained on 41 games simultaneously, whereas each gray bar is from 41 specialist agents. Multi-Game Decision Transformer achieves human-level performance, significantly better than other multi-game agents, even comparable to specialist agents.

This result indicates that Decision Transformers are well-suited for multi-task, multi-environment, and multi-embodiment agents.

A concurrent work, “A Generalist Agent”, shows a similar result, demonstrating that large transformer-based sequence models can memorize expert behaviors very well across many more environments. In addition, their work and our work have nicely complementary findings: They show it’s possible to train across a wide range of environments beyond Atari games, while we show it’s possible and useful to train across a wide range of experiences.

In addition to the performance shown above, empirically we found that MGDT trained on a wide variety of experience is better than MDGT trained only on expert-level demonstrations or simply cloning demonstration behaviors.

Scaling Up Multi-Game Model Size to Achieve Better Performance
Argurably, scale has become the main driving force in many recent machine learning breakthroughs, and it is usually achieved by increasing the number of parameters in a transformer-based model. Our observation on Multi-Game Decision Transformers is similar: the performance increases predictably with larger model size. In particular, its performance appears to have not yet hit a ceiling, and compared to other learning systems performance gains are more significant with increases in model size.

Performance of Multi-Game Decision Transformer (shown by the blue line) increases predictably with larger model size, whereas other models do not.

Pre-trained Multi-Game Decision Transformers Are Fast Learners
Another benefit of MGDTs is that they can learn how to play a new game from very few gameplay demonstrations (which don’t need to all be expert-level). In that sense, MGDTs can be considered pre-trained models capable of being fine-tuned rapidly on small new gameplay data. Compared with other popular pre-training methods, it clearly shows consistent advantages in obtaining higher scores.

Multi-Game Decision Transformer pre-training (DT pre-training, shown in light blue) demonstrates consistent advantages over other popular models in adaptation to new tasks.

Where Is the Agent Looking?
In addition to the quantitative evaluation, it’s insightful (and fun) to visualize the agent’s behavior. By probing the attention heads, we find that the MGDT model consistently places weight in its field of view to areas of the observed images that contain meaningful game entities. We visualize the model’s attention when predicting the next action for various games and find it consistently attends to entities such as the agent’s on screen avatar, agent’s free movement space, non-agent objects, and key environment features. For example, in an interactive setting, having an accurate world model requires knowing how and when to focus on known objects (e.g., currently present obstacles) as well as expecting and/or planning over future unknowns (e.g., negative space). This diverse allocation of attention to many key components of each environment ultimately improves performance.

Here we can see the amount of weight the model places on each key asset of the game scene. Brighter red indicates more emphasis on that patch of pixels.

The Future of Large-Scale Generalist Agents
This work is an important step in demonstrating the possibility of training general-purpose agents across many environments, embodiments, and behavior styles. We have shown the benefit of increased scale on performance and the potential with further scaling. These findings seem to point to a generalization narrative similar to other domains like vision and language — we look forward to exploring the great potential of scaling data and learning from diverse experiences.

We look forward to future research towards developing performant agents for multi-environment and multi-embodiment settings. Our code and model checkpoints can soon be accessed here.

Acknowledgements
We’d like to thank all remaining authors of the paper including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski.

Source: Google AI Blog


MLGO: A Machine Learning Framework for Compiler Optimization

The question of how to compile faster and smaller code arose together with the birth of modem computers. Better code optimization can significantly reduce the operational cost of large datacenter applications. The size of compiled code matters the most to mobile and embedded systems or software deployed on secure boot partitions, where the compiled binary must fit in tight code size budgets. With advances in the field, the headroom has been heavily squeezed with increasingly complicated heuristics, impeding maintenance and further improvements.

Recent research has shown that machine learning (ML) can unlock more opportunities in compiler optimization by replacing complicated heuristics with ML policies. However, adopting ML in general-purpose, industry-strength compilers remains a challenge.

To address this, we introduce “MLGO: a Machine Learning Guided Compiler Optimizations Framework”, the first industrial-grade general framework for integrating ML techniques systematically in LLVM (an open-source industrial compiler infrastructure that is ubiquitous for building mission-critical, high-performance software). MLGO uses reinforcement learning (RL) to train neural networks to make decisions that can replace heuristics in LLVM. We describe two MLGO optimizations for LLVM: 1) reducing code size with inlining; and 2) improving code performance with register allocation (regalloc). Both optimizations are available in the LLVM repository, and have been deployed in production.

How Does MLGO Work? With Inlining-for-Size As a Case Study
Inlining helps reduce code size by making decisions that enable the removal of redundant code. In the example below, the caller function foo() calls the callee function bar(), which itself calls baz(). Inlining both callsites returns a simple foo() function that reduces the code size.

Inlining reduces code size by removing redundant code.

In real code, there are thousands of functions calling each other, and thus comprise a call graph. During the inlining phase, the compiler traverses over the call graph on all caller-callee pairs, and makes decisions on whether to inline a caller-callee pair or not. It is a sequential decision process as previous inlining decisions will alter the call graph, affecting later decisions and the final result. In the example above, the call graph foo()bar()baz() needs a “yes” decision on both edges to make the code size reduction happen.

Before MLGO, the inline / no-inline decision was made by a heuristic that, over time, became increasingly difficult to improve. MLGO substitutes the heuristic with an ML model. During the call graph traversal, the compiler seeks advice from a neural network on whether to inline a particular caller-callee pair by feeding in relevant features (i.e., inputs) from the graph, and executes the decisions sequentially until the whole call graph is traversed.

Illustration of MLGO during inlining. “#bbs”, “#users”, and “callsite height” are example caller-callee pair features.

MLGO trains the decision network (policy) with RL using policy gradient and evolution strategies algorithms. While there is no ground truth about best decisions, online RL iterates between training and running compilation with the trained policy to collect data and improve the policy. In particular, given the current model under training, the compiler consults the model for inline / no-inline decision making during the inlining stage. After the compilation finishes, it produces a log of the sequential decision process (state, action, reward). The log is then passed to the trainer to update the model. This process repeats until we obtain a satisfactory model.

Compiler behavior during training. The compiler compiles the source code foo.cpp to an object file foo.o with a sequence of optimization passes, one of which is the inline pass.

The trained policy is then embedded into the compiler to provide inline / no-inline decisions during compilation. Unlike the training scenario, the policy does not produce a log. The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency and overhead, minimizing the extra time and memory cost introduced by ML model inference at compilation time.

Compiler behavior in production.

We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction. In addition to the generalizability across software, generalizability across time is also important — both the software and compiler are under active development so the trained policy needs to retain good performance for a reasonable time. We evaluated the model’s performance on the same set of software three months later and found only slight degradation.

Inlining-for-size policy size reduction percentages. The x-axis presents different software and the y-axis represents the percentage size reduction. “Training” is the software on which the model was trained and “Infra[1|2|3]” are different internal software packages.

The MLGO inlining-for-size training has been deployed on Fuchsia — a general purpose open source operating system designed to power a diverse ecosystem of hardware and software, where binary size is critical. Here, MLGO showed a 6.3% size reduction for C++ translation units.

Register-Allocation (for performance)
As a general framework, we used MLGO to improve the register allocation pass, which improves the code performance in LLVM. Register Allocation solves the problem of assigning physical registers to live ranges (i.e., variables).

As the code executes, different live ranges are completed at different times, freeing up registers for use by subsequent processing stages. In the example below, each “add” and “multiply” instruction requires all operands and the result to be in physical registers. The live range x is allocated to the green register and is completed before either live ranges in the blue or yellow registers. After x is completed, the green register becomes available and is assigned to live range t.

Register allocation example.

When it's time to allocate live range q, there are no available registers, so the register allocation pass must decide which (if any) live range can be "evicted" from its register to make room for q. This is referred to as the “live range eviction” problem, and is the decision for which we train the model to replace original heuristics. In this particular example, it evicts z from the yellow register, and assigns it to q and the first half of z.

We now consider the unassigned second half of live range z. We have a conflict again, and this time the live range t is evicted and split, and the first half of t and the final part of z end up using the green register. The middle part of z corresponds to the instruction q = t * y, where z is not being used, so it is not assigned to any register and its value is stored in the stack from the yellow register, which later gets reloaded to the green register. The same happens to t. This adds extra load/store instructions to the code and degrades performance. The goal of the register allocation algorithm is to reduce such inefficiencies as much as possible. This is used as the reward to guide RL policy training.

Similar to the inlining-for-size policy, the register allocation (regalloc-for-performance) policy is trained on a large Google internal software package, and is generalizable across different software, with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications. The QPS improvement has persisted for months after its deployment, showing the model’s generalizability across the time horizon.

Conclusion and Future Work
We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler, LLVM. MLGO is a general framework that can be expanded to be: 1) deeper, e.g., adding more features, and applying better RL algorithms; and 2) broader, by applying it to more optimization heuristics beyond inlining and regalloc. We are enthusiastic about the possibilities MLGO can bring to the compiler optimization domain and look forward to its further adoption and to future contributions from the research community.

Try it Yourself
Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.

Acknowledgements
We’d like to thank MLGO’s contributors and collaborators Eugene Brevdo, Jacob Hegna, Gaurav Jain, David Li, Zinan Lin, Kshiteej Mahajan, Jack Morris, Girish Mururu, Jin Xin Ng, Robert Ormandi, Easwaran Raman, Ondrej Sykora, Maruf Zaber, Weiye Zhao. We would also like to thank Petr Hosek, Yuqian Li, Roland McGrath, Haowei Wu for trusting us and deploying MLGO in Fuchsia as MLGO’s very first customer; thank David Blaikie, Eric Christopher, Brooks Moses, Jordan Rupprecht for helping to deploy MLGO in Google internal large-scale datacenter applications; and thank Ed Chi, Tipp Moseley for their leadership support.

Source: Google AI Blog


Learning Locomotion Skills Safely in the Real World

The promise of deep reinforcement learning (RL) in solving complex, high-dimensional problems autonomously has attracted much interest in areas such as robotics, game playing, and self-driving cars. However, effectively training an RL policy requires exploring a large set of robot states and actions, including many that are not safe for the robot. This is a considerable risk, for example, when training a legged robot. Because such robots are inherently unstable, there is a high likelihood of the robot falling during learning, which could cause damage.

The risk of damage can be mitigated to some extent by learning the control policy in computer simulation and then deploying it in the real world. However, this approach usually requires addressing the difficult sim-to-real gap, i.e., the policy trained in simulation can not be readily deployed in the real world for various reasons, such as sensor noise in deployment or the simulator not being realistic enough during training. Another approach to solve this issue is to directly learn or fine-tune a control policy in the real world. But again, the main challenge is to assure safety during learning.

In “Safe Reinforcement Learning for Legged Locomotion”, we introduce a safe RL framework for learning legged locomotion while satisfying safety constraints during training. Our goal is to learn locomotion skills autonomously in the real world without the robot falling during the entire learning process. Our learning framework adopts a two-policy safe RL framework: a “safe recovery policy” that recovers robots from near-unsafe states, and a “learner policy” that is optimized to perform the desired control task. The safe learning framework switches between the safe recovery policy and the learner policy to enable robots to safely acquire novel and agile motor skills.

The Proposed Framework
Our goal is to ensure that during the entire learning process, the robot never falls, regardless of the learner policy being used. Similar to how a child learns to ride a bike, our approach teaches an agent a policy while using "training wheels", i.e., a safe recovery policy. We first define a set of states, which we call a “safety trigger set”, where the robot is close to violating safety constraints but can still be saved by a safe recovery policy. For example, the safety trigger set can be defined as a set of states with the height of the robots being below a certain threshold and the roll, pitch, yaw angles being too large, which is an indication of falls. When the learner policy results in the robot being within the safety trigger set (i.e., where it is likely to fall), we switch to the safe recovery policy, which drives the robot back to a safe state. We determine when to switch back to the learner policy by leveraging an approximate dynamics model of the robot to predict the future robot trajectory. For example, based on the position of the robot’s legs and the current angle of the robot based on sensors for roll, pitch, and yaw, is it likely to fall in the future? If the predicted future states are all safe, we hand the control back to the learner policy, otherwise, we keep using the safe recovery policy.

The state diagram of the proposed approach. (1) If the learner policy violates the safety constraint, we switch to the safe recovery policy. (2) If the learner policy cannot ensure safety in the near future after switching to the safe recovery policy, we keep using the safe recovery policy. This allows the robot to explore more while ensuring safety.

This approach ensures safety in complex systems without resorting to opaque neural networks that may be sensitive to distribution shifts in application. In addition, the learner policy is able to explore states that are near safety violations, which is useful for learning a robust policy.

Because we use “approximated” dynamics to predict the future trajectory, we also examine how much safer a robot would be if we used a much more accurate model for its dynamics. We provide a theoretical analysis of this problem and show that our approach can achieve minimal safety performance loss compared to one with a full knowledge about the system dynamics.

Legged Locomotion Tasks
To demonstrate the effectiveness of the algorithm, we consider learning three different legged locomotion skills:

  1. Efficient Gait: The robot learns how to walk with low energy consumption and is rewarded for consuming less energy.
  2. Catwalk: The robot learns a catwalk gait pattern, in which the left and right two feet are close to each other. This is challenging because by narrowing the support polygon, the robot becomes less stable.
  3. Two-leg Balance: The robot learns a two-leg balance policy, in which the front-right and rear-left feet are in stance, and the other two are lifted. The robot can easily fall without delicate balance control because the contact polygon degenerates into a line segment.
Locomotion tasks considered in the paper. Top: efficient gait. Middle: catwalk. Bottom: two-leg balance.

Implementation Details
We use a hierarchical policy framework that combines RL and a traditional control approach for the learner and safe recovery policies. This framework consists of a high-level RL policy, which produces gait parameters (e.g., stepping frequency) and feet placements, and pairs it with a low-level process controller called model predictive control (MPC) that takes in these parameters and computes the desired torque for each motor in the robot. Because we do not directly command the motors’ angles, this approach provides more stable operation, streamlines the policy training due to a smaller action space, and results in a more robust policy. The input of the RL policy network includes the previous gait parameters, the height of the robot, base orientation, linear, angular velocities, and feedback to indicate whether the robot is approaching the safety trigger set. We use the same setup for each task.

We train a safe recovery policy with a reward for reaching stability as soon as possible. Furthermore, we design the safety trigger set with inspiration from capturability theory. In particular, the initial safety trigger set is defined to ensure that the robot’s feet can not fall outside of the positions from which the robot can safely recover using the safe recovery policy. We then fine-tune this set on the real robot with a random policy to prevent the robot from falling.

Real-World Experiment Results
We report the real-world experimental results showing the reward learning curves and the percentage of safe recovery policy activations on the efficient gait, catwalk, and two-leg balance tasks. To ensure that the robot can learn to be safe, we add a penalty when triggering the safe recovery policy. Here, all the policies are trained from scratch, except for the two-leg balance task, which was pre-trained in simulation because it requires more training steps.

Overall, we see that on these tasks, the reward increases, and the percentage of uses of the safe recovery policy decreases over policy updates. For instance, the percentage of uses of the safe recovery policy decreases from 20% to near 0% in the efficient gait task. For the two-leg balance task, the percentage drops from near 82.5% to 67.5%, suggesting that the two-leg balance is substantially harder than the previous two tasks. Still, the policy does improve the reward. This observation implies that the learner can gradually learn the task while avoiding the need to trigger the safe recovery policy. In addition, this suggests that it is possible to design a safe trigger set and a safe recovery policy that does not impede the exploration of the policy as the performance increases.

The reward learning curve (blue) and the percentage of safe recovery policy activations (red) using our safe RL algorithm in the real world.

In addition, the following video shows the learning process for the two-leg balance task, including the interplay between the learner policy and the safe recovery policy, and the reset to the initial position when an episode ends. We can see that the robot tries to catch itself when falling by putting down the lifted legs (front left and rear right) outward, creating a support polygon. After the learning episode ends, the robot walks back to the reset position automatically. This allows us to train policy autonomously and safely without human supervision.

Early training stage.
Late training stage.
Without a safe recovery policy.

Finally, we show the clips of learned policies. First, in the catwalk task, the distance between two sides of the legs is 0.09m, which is 40.9% smaller than the nominal distance. Second, in the two-leg balance task, the robot can maintain balance by jumping up to four times via two legs, compared to one jump from the policy pre-trained from simulation.

Final learned two-leg balance.

Conclusion
We presented a safe RL framework and demonstrated how it can be used to train a robotic policy with no falls and without the need for a manual reset during the entire learning process for the efficient gait and catwalk tasks. This approach even enables training of a two-leg balance task with only four falls. The safe recovery policy is triggered only when needed, allowing the robot to more fully explore the environment. Our results suggest that learning legged locomotion skills autonomously and safely is possible in the real world, which could unlock new opportunities including offline dataset collection for robot learning.

No model is without limitation. We currently ignore the model uncertainty from the environment and non-linear dynamics in our theoretical analysis. Including these would further improve the generality of our approach. In addition, some hyper-parameters of the switching criteria are currently being heuristically tuned. It would be more efficient to automatically determine when to switch based on the learning progress. Furthermore, it would be interesting to extend this safe RL framework to other robot applications, such as robot manipulation. Finally, designing an appropriate reward when incorporating the safe recovery policy can impact learning performance. We use a penalty-based approach that obtained reasonable results in these experiments, but we plan to investigate this in future work to make further performance improvements.

Acknowledgements
We would like to thank our paper co-authors: Tingnan Zhang, Linda Luu, Sehoon Ha, Jie Tan, and Wenhao Yu. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Source: Google AI Blog


Extracting Skill-Centric State Abstractions from Value Functions

Advances in reinforcement learning (RL) for robotics have enabled robotic agents to perform increasingly complex tasks in challenging environments. Recent results show that robots can learn to fold clothes, dexterously manipulate a rubik’s cube, sort objects by color, navigate complex environments and walk on difficult, uneven terrain. But "short-horizon" tasks such as these, which require very little long-term planning and provide immediate failure feedback, are relatively easy to train compared to many tasks that may confront a robot in a real-world setting. Unfortunately, scaling such short-horizon skills to the abstract, long horizons of real-world tasks is difficult. For example, how would one train a robot capable of picking up objects to rearrange a room?

Hierarchical reinforcement learning (HRL), a popular way of solving this problem, has achieved some success in a variety of long-horizon RL tasks. HRL aims to solve such problems by reasoning over a bank of low-level skills, thus providing an abstraction for actions. However, the high-level planning problem can be further simplified by abstracting both states and actions. For example, consider a tabletop rearrangement task, where a robot is tasked with interacting with objects on a desk. Using recent advances in RL, imitation learning, and unsupervised skill discovery, it is possible to obtain a set of primitive manipulation skills such as opening or closing drawers, picking or placing objects, etc. However, even for the simple task of putting a block into the drawer, chaining these skills together is not straightforward. This may be attributed to a combination of (i) challenges with planning and reasoning over long horizons, and (ii) dealing with high dimensional observations while parsing the semantics and affordances of the scene, i.e., where and when the skill can be used.

In “Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning”, presented at ICLR 2022, we address the task of learning suitable state and action abstractions for long-range problems. We posit that a minimal, but complete, representation for a higher-level policy in HRL must depend on the capabilities of the skills available to it. We present a simple mechanism to obtain such a representation using skill value functions and show that such an approach improves long-horizon performance in both model-based and model-free RL and enables better zero-shot generalization.

Our method, VFS, can compose low-level primitives (left) to learn complex long-horizon behaviors (right).

Building a Value Function Space
The key insight motivating this work is that the abstract representation of actions and states is readily available from trained policies via their value functions. The notion of “value” in RL is intrinsically linked to affordances, in that the value of a state for skill reflects the probability of receiving a reward for successfully executing the skill. For any skill, its value function captures two key properties: 1) the preconditions and affordances of the scene, i.e., where and when the skill can be used, and 2) the outcome, which indicates whether the skill executed successfully when it was used.

Given a decision process with a finite set of k skills trained with sparse outcome rewards and their corresponding value functions, we construct an embedding space by stacking these skill value functions. This gives us an abstract representation that maps a state to a k-dimensional representation that we call the Value Function Space, or VFS for short. This representation captures functional information about the exhaustive set of interactions that the agent can have with the environment, and is thus a suitable state abstraction for downstream tasks.

Consider a toy example of the tabletop rearrangement setup discussed earlier, with the task of placing the blue object in the drawer. There are eight elementary actions in this environment. The bar plot on the right shows the values of each skill at any given time, and the graph at the bottom shows the evolution of these values over the course of the task.

Value functions corresponding to each skill (top-right; aggregated in bottom) capture functional information about the scene (top-left) and aid decision-making.

At the beginning, the values corresponding to the “Place on Counter” skill are high since the objects are already on the counter; likewise, the values corresponding to “Close Drawer” are high. Through the trajectory, when the robot picks up the blue cube, the corresponding skill value peaks. Similarly, the values corresponding to placing the objects in the drawer increase when the drawer is open and peak when the blue cube is placed inside it. All the functional information required to affect each transition and predict its outcome (success or failure) is captured by the VFS representation, and in principle, allows a high-level agent to reason over all the skills and chain them together — resulting in an effective representation of the observations.

Additionally, since VFS learns a skill-centric representation of the scene, it is robust to exogenous factors of variation, such as background distractors and appearances of task-irrelevant components of the scene. All configurations shown below are functionally equivalent — an open drawer with the blue cube in it, a red cube on the countertop, and an empty gripper — and can be interacted with identically, despite apparent differences.

The learned VFS representation can ignore task-irrelevant factors such as arm pose, distractor objects (green cube) and background appearance (brown desk).

Robotic Manipulation with VFS
This approach enables VFS to plan out complex robotic manipulation tasks. Take, for example, a simple model-based reinforcement learning (MBRL) algorithm that uses a simple one-step predictive model of the transition dynamics in value function space and randomly samples candidate skill sequences to select and execute the best one in a manner similar to the model-predictive control. Given a set of primitive pushing skills of the form “move Object A near Object B” and a high-level rearrangement task, we find that VFS can use MBRL to reliably find skill sequences that solve the high-level task.

A rollout of VFS performing a tabletop rearrangement task using a robotic arm. VFS can reason over a sequence of low-level primitives to achieve the desired goal configuration.

To better understand the attributes of the environment captured by VFS, we sample the VFS-encoded observations from a large number of independent trajectories in the robotic manipulation task and project them onto a two-dimensional axis using the t-SNE technique, which is useful for visualizing clusters in high-dimensional data. These t-SNE embeddings reveal interesting patterns identified and modeled by VFS. Looking at some of these clusters closely, we find that VFS can successfully capture information about the contents (objects) in the scene and affordances (e.g., a sponge can be manipulated when held by the robot’s gripper), while ignoring distractors like the relative positions of the objects on the table and the pose of the robotic arm. While these factors are certainly important to solve the task, the low-level primitives available to the robot abstract them away and hence, make them functionally irrelevant to the high-level controller.

Visualizing the 2D t-SNE projections of VFS embeddings show emergent clustering of equivalent configurations of the environment while ignoring task-irrelevant factors like arm pose.

Conclusions and Connections to Future Work
Value function spaces are representations built on value functions of underlying skills, enabling long-horizon reasoning and planning over skills. VFS is a compact representation that captures the affordances of the scene and task-relevant information while robustly ignoring distractors. Empirical experiments reveal that such a representation improves planning for model-based and model-free methods and enables zero-shot generalization. Going forward, this representation has the promise to continue improving along with the field of multitask reinforcement learning. The interpretability of VFS further enables integration into fields such as safe planning and grounding language models.

Acknowledgements
We thank our co-authors Sergey Levine, Ted Xiao, Alex Toshev, Peng Xu and Yao Lu for their contributions to the paper and feedback on this blog post. We also thank Tom Small for creating the informative visualizations used in this blog post.

Source: Google AI Blog


Efficiently Initializing Reinforcement Learning With Prior Policies

Reinforcement learning (RL) can be used to train a policy to perform a task via trial and error, but a major challenge in RL is learning policies from scratch in environments with hard exploration challenges. For example, consider the setting depicted in the door-binary-v0 environment from the adroit manipulation suite, where an RL agent must control a hand in 3D space to open a door placed in front of it.

An RL agent must control a hand in 3D space to open a door placed in front of it. The agent receives a reward signal only when the door is completely open.

Since the agent receives no intermediary rewards, it cannot measure how close it is to completing the task, and so must explore the space randomly until it eventually opens the door. Given how long the task takes and the precise control required, this is extremely unlikely.

For tasks like this, we can avoid exploring the state space randomly by using prior information. This prior information helps the agent understand which states of the environment are good, and should be further explored. We could use offline data (i.e., data collected by human demonstrators, scripted policies, or other RL agents) to train a policy, then use it to initialize a new RL policy. In the case where we use neural networks to represent the policies, this would involve copying the pre-trained policy’s neural network over to the new RL policy. This procedure makes the new RL policy behave like the pre-trained policy. However, naïvely initializing a new RL policy like this often works poorly, especially for value-based RL methods, as shown below.

A policy is pre-trained on the antmaze-large-diverse-v0 D4RL environment with offline data (negative steps correspond to pre-training). We then use the policy to initialize actor-critic fine-tuning (positive steps starting from step 0) with this pre-trained policy as the initial actor. The critic is initialized randomly. The actor’s performance immediately drops and does not recover, as the untrained critic provides a poor learning signal and causes the good initial policy to be forgotten.

With the above in mind, in “Jump-Start Reinforcement Learning” (JSRL), we introduce a meta-algorithm that can use a pre-existing policy of any form to initialize any type of RL algorithm. JSRL uses two policies to learn tasks: a guide-policy, and an exploration-policy. The exploration-policy is an RL policy that is trained online with new experience that the agent collects from the environment, and the guide-policy is a pre-existing policy of any form that is not updated during online training. In this work, we focus on scenarios where the guide-policy is learned from demonstrations, but many other kinds of guide-policies can be used. JSRL creates a learning curriculum by rolling in the guide-policy, which is then followed by the self-improving exploration-policy, resulting in performance that compares to or improves on competitive IL+RL methods.

The JSRL Approach
The guide-policy can take any form: it could be a scripted policy, a policy trained with RL, or even a live human demonstrator. The only requirements are that the guide-policy is reasonable (i.e., better than random exploration), and it can select actions based on observations of the environment. Ideally, the guide-policy can reach poor or medium performance in the environment, but cannot further improve itself with additional fine-tuning. JSRL then allows us to leverage the progress of this guide-policy to take the performance even higher.

At the beginning of training, we roll out the guide-policy for a fixed number of steps so that the agent is closer to goal states. The exploration-policy then takes over and continues acting in the environment to reach these goals. As the performance of the exploration-policy improves, we gradually reduce the number of steps that the guide-policy takes, until the exploration-policy takes over completely. This process creates a curriculum of starting states for the exploration-policy such that in each curriculum stage, it only needs to learn to reach the initial states of prior curriculum stages.

Here, the task is for the robot arm to pick up the blue block. The guide-policy can move the arm to the block, but it cannot pick it up. It controls the agent until it grips the block, then the exploration-policy takes over, eventually learning to pick up the block. As the exploration-policy improves, the guide-policy controls the agent less and less.

Comparison to IL+RL Baselines
Since JSRL can use a prior policy to initialize RL, a natural comparison would be to imitation and reinforcement learning (IL+RL) methods that train on offline datasets, then fine-tune the pre-trained policies with new online experience. We show how JSRL compares to competitive IL+RL methods on the D4RL benchmark tasks. These tasks include simulated robotic control environments, along with datasets of offline data from human demonstrators, planners, and other learned policies. Out of the D4RL tasks, we focus on the difficult ant maze and adroit dexterous manipulation environments.

Example ant maze (left) and adroit dexterous manipulation (right) environments.

For each experiment, we train on an offline dataset and then run online fine-tuning. We compare against algorithms designed specifically for each setting, which include AWAC, IQL, CQL, and behavioral cloning. While JSRL can be used in combination with any initial guide-policy or fine-tuning algorithm, we use our strongest baseline, IQL, as a pre-trained guide and for fine-tuning. The full D4RL dataset includes one million offline transitions for each ant maze task. Each transition is a sequence of format (S, A, R, S’) which specifies what state the agent started in (S), the action the agent took (A), the reward the agent received (R), and the state the agent ended up in (S’) after taking action A. We find that JSRL performs well with as few as ten thousand offline transitions.

Average score (max=100) on the antmaze-medium-diverse-v0 environment from the D4RL benchmark suite. JSRL can improve even with limited access to offline transitions.

Vision-Based Robotic Tasks
Utilizing offline data is especially challenging in complex tasks such as vision-based robotic manipulation due to the curse of dimensionality. The high dimensionality of both the continuous-control action space and the pixel-based state space present scaling challenges for IL+RL methods in terms of the amount of data required to learn good policies. To study how JSRL scales to such settings, we focus on two difficult simulated robotic manipulation tasks: indiscriminate grasping (i.e., lifting any object) and instance grasping (i.e., lifting a specific target object).

A simulated robot arm is placed in front of a table with various categories of objects. When the robot lifts any object, a sparse reward is given for the indiscriminate grasping task. For the instance grasping task, a sparse reward is only given when a specific target object is grasped.

We compare JSRL against methods that are able to scale to complex vision-based robotics settings, such as QT-Opt and AW-Opt. Each method has access to the same offline dataset of successful demonstrations and is allowed to run online fine-tuning for up to 100,000 steps.

In these experiments, we use behavioral cloning as a guide-policy and combine JSRL with QT-Opt for fine-tuning. The combination of QT-Opt+JSRL improves faster than all other methods while achieving the highest success rate.

Mean grasping success for indiscriminate and instance grasping environments using 2k successful demonstrations.

Conclusion
We proposed JSRL, a method for leveraging a prior policy of any form to improve exploration for initializing RL tasks. Our algorithm creates a learning curriculum by rolling in a pre-existing guide-policy, which is then followed by the self-improving exploration-policy. The job of the exploration-policy is greatly simplified since it starts exploring from states closer to the goal. As the exploration-policy improves, the effect of the guide-policy diminishes, leading to a fully capable RL policy. In the future, we plan to apply JSRL to problems such as Sim2Real, and explore how we can leverage multiple guide-policies to train RL agents.

Acknowledgements
This work would not have been possible without Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, and Karol Hausman. Special thanks to Tom Small for creating the animations for this post.

Source: Google AI Blog


The Balloon Learning Environment

Benchmark challenges have been a driving force in the advancement of machine learning (ML). In particular, difficult benchmark environments for reinforcement learning (RL) have been crucial for the rapid progress of the field by challenging researchers to overcome increasingly difficult tasks. The Arcade Learning Environment, Mujoco, and others have been used to push the envelope in RL algorithms, representation learning, exploration, and more.

In “Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning”, published in Nature two years ago, we demonstrated how deep RL can be used to create a high-performing flight agent that can control stratospheric balloons in the real world. This research confirmed that deep RL can be successfully applied outside of simulated environments, and contributed practical knowledge for integrating RL algorithms with complex dynamical systems.

Today we are excited to announce the open-source release of the Balloon Learning Environment (BLE), a new benchmark emulating the real-world problem of controlling stratospheric balloons. The BLE is a high-fidelity simulator, which we hope will provide researchers with a valuable resource for deep RL research.

Station-Keeping Stratospheric Balloons
Stratospheric balloons are filled with a buoyant gas that allows them to float for weeks or months at a time in the stratosphere, about twice as high as a passenger plane’s cruising altitude. Though there are many potential variations of stratospheric balloons, the kind emulated in the BLE are equipped with solar panels and batteries, which allow them to adjust their altitude by controlling the weight of air in their ballast using an electric pump. However, they have no means to propel themselves laterally, which means that they are subject to wind patterns in the air around them.

By changing its altitude, a stratospheric balloon can surf winds moving in different directions.

The goal of an agent in the BLE is to station-keep — i.e., to control a balloon to stay within 50km of a fixed ground station — by changing its altitude to catch winds that it finds favorable. We measure how successful an agent is at station-keeping by measuring the fraction of time the balloon is within the specified radius, denoted TWR50 (i.e., the time within a radius of 50km).

A station-seeking balloon must navigate a changing wind field to stay above a ground station. Left: Side elevation of a station-keeping balloon. Right: Birds-eye-view of the same balloon.

The Challenges of Station-Keeping
To create a realistic simulator (without including copious amounts of historical wind data), the BLE uses a variational autoencoder (VAE) trained on historical data to generate wind forecasts that match the characteristics of real winds. A wind noise model is then used to make the windfields more realistic to match what a balloon would encounter in real-world conditions.

Navigating a stratospheric balloon through a wind field can be quite challenging. The winds at any given altitude rarely remain ideal for long, and a good balloon controller will need to move up and down through its wind column to discover more suitable winds. In RL parlance, the problem of station-keeping is partially observable because the agent only has access to forecasted wind data to make those decisions. An agent has access to wind forecasts at every altitude and the true wind at its current altitude. The BLE returns an observation which includes a notion of wind uncertainty.

A stratospheric balloon must explore winds at different altitudes in order to find favorable winds. The observation returned by the BLE includes wind predictions and a measure of uncertainty, made by mixing a wind forecast and winds measured at the balloon’s altitude.

In some situations, there may not be suitable winds anywhere in the balloon’s wind column. In this case, an expert agent is still able to fly towards the station by taking a more circuitous route through the wind field (a common example is when the balloon moves in a zig-zag fashion, akin to tacking on a sailboat). Below we demonstrate that even just remaining in range of the station usually requires significant acrobatics.

An agent must handle long planning horizons to succeed in station-keeping. In this case, StationSeeker (an expert-designed controller) heads directly to the center of the station-keeping area and is pushed out, while Perciatelli44 (an RL agent) is able to plan ahead and stay in range longer by hugging the edge of the area.

Night-time adds a fresh element of difficulty to station-keeping in the BLE, which reflects the reality of night-time changes in physical conditions and power availability. While during the day the air pump is powered by solar panels, at night the balloon relies on its on-board batteries for energy. Using too much power early in the night typically results in limited maneuverability in the hours preceding dawn. This is where RL agents can discover quite creative solutions — such as reducing altitude in the afternoon in order to store potential energy.

An agent needs to balance the station-keeping objective with a finite energy allowance at night.

Despite all these challenges, our research demonstrates that agents trained with reinforcement learning can learn to perform better than expert-designed controllers at station-keeping. Along with the BLE, we are releasing the main agents from our research: Perciatelli44 (an RL agent) and StationSeeker (an expert-designed controller). The BLE can be used with any reinforcement learning library, and to showcase this we include Dopamine’s DQN and QR-DQN agents, as well as Acme’s QR-DQN agent (supporting both standalone and distributed training with Launchpad).

Evaluation performance by the included benchmark agents on the BLE. “Finetuned” is a fine-tuned Perciatelli44 agent, and Acme is a QR-DQN agent trained with the Acme library.

The BLE source code contains information on how to get started with the BLE, including training and evaluating agents, documentation on the various components of the simulator, and example code. It also includes the historical windfield data (as a TensorFlow DataSet) used to train the VAE to allow researchers to experiment with their own models for windfield generation. We are excited to see the progress that the community will make on this benchmark.

Acknowledgements
We would like to thank the Balloon Learning Environment team: Sal Candido, Marc G. Bellemare, Vincent Dumoulin, Ross Goroshin, and Sam Ponda. We’d also like to thank Tom Small for his excellent animation in this blog post and graphic design help, along with our colleagues, Bradley Rhodes, Daniel Eisenberg, Piotr Staczyk, Anton Raichuk, Nikola Momchev, Geoff Hinton, Hugo Larochelle, and the rest of the Brain team in Montreal.

Source: Google AI Blog


Robot See, Robot Do

People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.

Our motivation is to have robots watch people do tasks, naturally with their hands, and then use that data as demonstrations for learning. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

However, an obvious but often overlooked problem is that a robot is physically different from a human, which means it often completes tasks differently than we do. For example, in the pen manipulation task below, the hand can grab all the pens together and quickly transfer them between containers, whereas the two-fingered gripper must transport one at a time. Prior research assumes that humans and robots can do the same task similarly, which makes manually specifying one-to-one correspondences between human and robot actions easy. But with stark differences in physique, defining such correspondences for seemingly easy tasks can be surprisingly difficult and sometimes impossible.

Physically different end-effectors (i.e., “grippers”) (i.e., the part that interacts with the environment) induce different control strategies when solving the same task. Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, presented as an oral paper at CoRL 2021, we explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Learning (XIRL)
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment research, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences.

We leverage TCC to train an encoder to temporally align video demonstrations of different experts performing the same task. The TCC loss tries to maximize the number of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences using a differentiable formulation of soft nearest-neighbors. Once the encoder is trained, we define our reward function as simply the negative Euclidean distance between the current observation and the goal observation in the learned embedding space. We can subsequently insert the reward into a standard MDP and use an RL algorithm to learn the demonstrated behavior. Surprisingly, we find that this simple reward formulation is effective for cross-embodiment imitation.

XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations.

X-MAGICAL Benchmark
To evaluate the performance of XIRL and baseline alternatives (e.g., TCN, LIFS, Goal Classifier) in a consistent environment, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL features a diverse set of agent embodiments, with differences in their shapes and end-effectors, designed to solve tasks in different ways. This leads to differences in execution speeds and state-action trajectories, which poses challenges for current imitation learning techniques, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The ability to generalize across embodiments is precisely what X-MAGICAL evaluates.

The SweepToTop task we considered for our experiments is a simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment. We chose this task specifically because its long-horizon nature highlights how different agent embodiments can generate entirely different trajectories (shown below). X-MAGICAL features a Gym API and is designed to be easily extendable to new tasks and embodiments. You can try it out today with pip install x-magical.

Different agent shapes in the SweepToTop task in the X-MAGICAL benchmark need to use different strategies to reposition objects into the target area (pink), i.e., to “clear the debris”. For example, the long-stick can clear them all in one fell swoop, whereas the short-stick needs to do multiple consecutive back-and-forths.
Left: Heatmap of state visitation for each embodiment across all expert demonstrations. Right: Examples of expert trajectories for each embodiment.

Highlights
In our first set of experiments, we checked whether our learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. We find that XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).

Same-embodiment setting: Comparison of XIRL with baseline reward functions, using SAC for RL policy learning. XIRL is roughly 2 to 4 times more sample efficient than some of the baselines on the harder agents (short-stick and gripper).

We also find that our approach shows great potential for learning reward functions that generalize to novel embodiments. For instance, when reward learning is performed on embodiments that are different from the ones on which the policy is trained, we find that it results in significantly more sample efficient agents compared to the same baselines. Below, in the gripper subplot (bottom right) for example, the reward is first learned on demonstration videos from long-stick, medium-stick and short-stick, after which the reward function is used to train the gripper agent.

Cross-embodiment setting: XIRL performs favorably when compared with other baseline reward functions, trained on observation-only demonstrations from different embodiments. Each agent (long-stick, medium-stick, short-stick, gripper) had its reward trained using demonstrations from the other three embodiments.

We also find that we can train on real-world human demonstrations, and use the learned reward to train a Sawyer arm in simulation to push a puck to a designated target zone. In these experiments as well, our method outperforms baseline alternatives. For example, our XIRL variant trained only on the real-world demonstrations (purple in the plots below) reaches 80% of the total performance roughly 85% faster than the RLV baseline (orange).

What Do The Learned Reward Functions Look Like?
To further explore the qualitative nature of our learned rewards in more challenging real-world scenarios, we collect a dataset of the pen transfer task using various household tools.

Below, we show rewards extracted from a successful (top) and unsuccessful (bottom) demonstration. Both demonstrations follow a similar trajectory at the start of the task execution. The successful one nets a high reward for placing the pens consecutively into the mug then into the glass cup, while the unsuccessful one obtains a low reward because it drops the pens outside the glass cup towards the end of the execution (orange circle). These results are promising because they show that our learned encoder can represent fine-grained visual differences relevant to a task.

Conclusion
We highlighted XIRL, our approach to tackling the cross-embodiment imitation problem. XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using our reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels. Overall, we are excited about this direction of work, and hope that our benchmark promotes further research in this area. For more details, please check out our paper and download the code from our GitHub repository.

Acknowledgments
Kevin and Andy summarized research performed together with Pete Florence, Jonathan Tompson, Jeannette Bohg (faculty at Stanford University) and Debidatta Dwibedi. All authors would additionally like to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable help with setting up the simulated benchmark.

Source: Google AI Blog