Tag Archives: reinforcement learning

Chip Design with Deep Reinforcement Learning



The revolution of modern computing has been largely enabled by remarkable advances in computer systems and hardware. With the slowing of Moore’s Law and Dennard scaling, the world is moving toward specialized hardware to meet the exponentially growing demand for compute. However, today’s chips take years to design, resulting in the need to speculate about how to optimize the next generation of chips for the machine learning (ML) models of 2-5 years from now. Dramatically shortening the chip design cycle would allow hardware to adapt to the rapidly advancing field of ML. What if ML itself could provide the means to shorten the chip design cycle, creating a more integrated relationship between hardware and ML, with each fueling advances in the other?

In “Chip Placement with Deep Reinforcement Learning”, we pose chip placement as a reinforcement learning (RL) problem, where we train an agent (i.e, an RL policy) to optimize the quality of chip placements. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. Whereas existing baselines require human experts in the loop and take several weeks to generate, our method can generate placements in under six hours that outperform or match their manually designed counterparts. While we show that we can generate optimized placements for Google accelerator chips (TPUs), our methods are applicable to any kind of chip (ASIC).

The Chip Floorplanning Problem
A computer chip is divided into dozens of blocks, each of which is an individual module, such as a memory subsystem, compute unit, or control logic system. These blocks can be described by a netlist, a graph of circuit components, such as macros (memory components) and standard cells (logic gates like NAND, NOR, and XOR), all of which are connected by wires. Determining the layout of a chip block, a process called chip floorplanning, is one of the most complex and time-consuming stages of the chip design process and involves placing the netlist onto a chip canvas (a 2D grid), such that power, performance, and area (PPA) are minimized, while adhering to constraints on density and routing congestion. Despite decades of research on this topic, it is still necessary for human experts to iterate for weeks to produce solutions that meet multi-faceted design criteria. This problem’s complexity arises from the size of the netlist graph (millions to billions of nodes), the granularity of the grid onto which that graph must be placed, and the exorbitant cost of computing the true target metrics, which can take many hours (sometimes over a day) using industry-standard electronic design automation tools.

The Deep Reinforcement Learning Model
The input to our model is the chip netlist (node types and graph adjacency information), the ID of the current node to be placed, and some netlist metadata, such as the total number of wires, macros, and standard cell clusters. The netlist graph and the current node are passed through an edge-based graph neural network that we developed to encode the input state. This generates embeddings of the partially placed graph and the candidate node.
A graph neural network generates embeddings that are concatenated with the metadata embeddings to form the input to the policy and value networks.
The edge, macro and netlist metadata embeddings are then concatenated to form a single state embedding, which is passed to a feedforward neural network. The output of the feedforward network is a learned representation that captures the useful features and serves as input to the policy and value networks. The policy network generates a probability distribution over all possible grid cells onto which the current node could be placed.

In each iteration of training, the macros are sequentially placed by the RL agent, after which the standard cell clusters are placed by a force-directed method, which models the circuit as a system of springs to minimize wirelength. RL training is guided by a fast-but-approximate reward signal calculated for each of the agent’s chip placements using the weighted average of approximate wirelength (i.e., the half-perimeter wirelength, HPWL) and approximate congestion (the fraction of routing resources consumed by the placed netlist).
During each training iteration, the macros are placed by the policy one at a time and the standard cell clusters are placed by a force-directed method. The reward is calculated from the weighted combination of approximate wirelength and congestion.
Results
To our knowledge, this method is the first chip placement approach that has the ability to generalize, meaning that it can leverage what it has learned while placing previous netlists to generate better placements for new unseen netlists. We show that as we increase the number of chip netlists on which we perform pre-training (i.e., as our method becomes more experienced in placement optimization), our policy better generalizes to new netlists.

For example, the pre-trained policy organically identifies an arrangement that places the macros near the edges of the chip with a convex space in the center in which to place the standard cells. This results in lower wirelength between the macros and standard cells without introducing excessive routing congestion. In contrast, the policy trained from scratch starts with random placements and takes much longer to converge to a high-quality solution, rediscovering the need to leave an opening in the center of the chip canvas. This is demonstrated in the animation below.
Macro placements of Ariane, an open-source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement. Notice how the cavity discovered by the from-scratch policy is already present from the outset in the pre-trained policy’s placement.
We observe that pre-training improves sample efficiency and placement quality. We compare the quality of placements generated using pre-trained policies to those generated by training the policy from scratch. To generate placements for previously unseen chip blocks, we use a zero-shot method, meaning that we simply use a pre-trained policy (with no fine-tuning) to place a new block, yielding a placement in less than a second. The results can be further improved by fine-tuning the policy on the new block. The policy trained from scratch takes much longer to converge, and even after 24 hours, its chip placements are worse than what the fine-tuned policy achieves after 12 hours.
Convergence plots for two policies on Ariane blocks. One is training from scratch and the other is finetuning a pre-trained policy.
The performance of our approach improves as we train on a larger dataset. We observed that as we increase the training set from two blocks to five blocks, and then to 20 blocks, the policy generates better placements, both at zero-shot and after being fine-tuned for the same training wall-clock time.
Training data size vs. fine-tuning performance.
The ability of our approach to learn from experience and improve over time unlocks new possibilities for chip designers. As the agent is exposed to a greater volume and variety of chips, it becomes both faster and better at generating optimized placements for new chip blocks. A fast, high-quality, automatic chip placement method could greatly accelerate chip design and enable co-optimization with earlier stages of the chip design process. Although we evaluate primarily on accelerator chips, our proposed method is broadly applicable to any chip placement problem. After all that hardware has done for machine learning, we believe that it is time for machine learning to return the favor.

Acknowledgements
This project was a collaboration between Google Research and Google Hardware and Architecture teams. We would like to thank our coauthors: Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc Le, James Laudon, Roger Carpenter, Richard Ho, and Jeff Dean for their support and contributions to this work.

Source: Google AI Blog


Exploring Evolutionary Meta-Learning in Robotics



Rapid development of more accurate simulator engines has given robotics researchers a unique opportunity to generate sufficient amounts of data that can be used to train robotic policies for real-world deployment. However, moving trained policies from “sim-to-real” remains one of the greatest challenges of modern robotics, due to the subtle differences encountered between the simulation and real domains, termed the “reality gap”. While some recent approaches leverage existing data, such as imitation learning and offline reinforcement learning, to prepare a policy for the reality gap, a more common approach is to simply provide more data by varying properties of the simulated environment, a process called domain randomization.

However, domain randomization can sacrifice performance for stability, as it seeks to optimize for a decent, stable policy across all tasks, but offers little room for improving the policy on a specific task. This lack of a common optimal policy between simulation and reality is frequently a problem in robotic locomotion applications, where there are varying physical forces at play, such as leg friction, body mass, and terrain differences. For example, given the same initial conditions for the robot’s position and balance, the surface type will determine the optimal policy — for an incoming flat surface encountered in simulation, the robot could accelerate to a higher speed, while for an incoming rugged and bumpy surface encountered in the real world, it should walk slowly and carefully to prevent falling.

In “Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning”, we present a particular type of meta-learning based on evolutionary strategies (ES), an approach generally believed to only work well in simulation, we can effectively and efficiently adapt a policy to a real-world robot in a completely model-free manner. Compared to previous approaches for adapting meta-policies, such as standard policy gradients which do not allow sim-to-teal adaptation, ES enables a robot to quickly overcome the reality gap and adapt to dynamic changes in the real world, some of which may not be encountered in simulation. This represents the first instance of successfully using ES for on-robot adaptation.
Our algorithm quickly adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8V to 10V which reduced motor power, and a 500g mass was also placed on the robot's side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes (or 150s of real-world data).
Meta-Learning
This research falls under the general class of meta-learning techniques, and is demonstrated on a legged robot. At a high level, meta-learning learns to solve an incoming task quickly without completely retraining from scratch, by combining past experiences with small amounts of experience from the incoming task. This is especially beneficial in the sim-to-real case, where most of the past experiences come cheaply from simulation, while a minimal, yet necessary amount of experience is generated from the real world task. The simulation experiences allow the policy to possess a general level of behavior for solving a distribution of tasks, while the real-world experiences allow the policy to fine-tune specifically to the real-world task at hand.

In order to train a policy to meta-learn, it is necessary to encourage a policy to adapt during simulation. Normally, this can be achieved by applying model-agnostic meta-learning (MAML), which searches for a meta-policy that can adapt to a specific task quickly using small amounts of task-specific data. The standard approach to computing such meta-policies is by using policy gradient methods, which seek to improve the likelihood of selecting the same action given the same state. In order to determine the likelihood of a given action, the policy must be stochastic, allowing for the action selected by the policy to have a randomized component. The real-world environment for deploying such robotic policies is also highly stochastic, as there can be slight differences in motion arising naturally, even if starting from the exact same state and action sequence. The combination of using a stochastic policy inside a stochastic environment creates two conflicting objectives:
  1. Decreasing the policy’s stochasticity may be crucial, as otherwise the high-noise problem might be exacerbated by the additional randomness from the policy’s actions.

  2. However, increasing the policy’s stochasticity may also benefit exploration, as the policy needs to use random actions to probe the type of environment to which it adapts.
These two competing objectives, which have been noted before, seek to both decrease and increase the policy’s stochasticity and may cause complications.

Evolutionary Strategies in Robotics
Instead, we resolve these challenges by applying ES-MAML, an algorithm that leverages a drastically different paradigm for high-dimensional optimization — evolutionary strategies. The ES-MAML approach updates the policy based solely on the sum of rewards collected by the agent in the environment. The function used for optimizing the policy is black-box, mapping the policy parameters directly to this reward. Unlike policy gradient methods, this approach does not need to collect state/action/reward tuples and does not need to estimate action likelihoods. This allows the use of deterministic policies and exploration based on parameter changes and avoiding the conflict between stochasticity in the policy and in the environment.

In this paradigm, querying usually involves running episodes in the simulator, but we show that ES can be applied also for episodes collected on real hardware. ES optimization can be easily distributed and also works well for training efficient compact policies, a phenomenon with profound robotic implications, since policies with fewer parameters can be easier deployed on real hardware and often lead to more efficient inference and power usage. We confirm the effectiveness of ES in training compact policies by learning adaptable meta-policies with <130 parameters.

The ES optimization paradigm is very flexible. It can be used to optimize non-differentiable objectives, such as the total reward objective in our robotics case. It also works in the presence of substantial (potentially adversarial) noise. In addition, the most recent forms of ES methods (e.g., guided ES) are much more sample-efficient than previous versions.

This flexibility is critical for efficient adaptation of locomotion meta-policies. Our results show that adaptation with ES can be conducted with a small number of additional on-robot episodes. Thus, ES is no longer just an attractive alternative to the state-of-the-art algorithms, but defines a new state of the art for several challenging RL tasks.

Adaptation in Simulation
We first examine the types of adaptation that emerge when training with ES-MAML in simulation. When testing the policy in simulation, we found that the meta-policy forces the robot to fall down when the dynamics become too unstable, whereas the adapted policy allows the robot to re-stabilize and walk again. Furthermore, when the robot’s leg settings change, the meta-policy de-synchronizes the robot’s legs causing the robot to turn sharply, while the adapted policy corrects the robot so it can walk straight again.
The meta-policy’s gait, which experiences issues when facing a difficult dynamics task. Left: The meta-policy lets the robot fall down. Center: The adapted policy ensures the robot continues to walk correctly. Right: Comparative measurement of the robot’s height.
The meta-policy’s gait, under changes to the robot’s leg settings. Left: The meta-policy allows the robot veer to the right. Center: The adapted policy ensures the robot continues to walk in a straight line. Right: Comparative measurement of the robot’s walking direction.
Adaptation in the Real World
Despite the good performance of ES-MAML in simulation, applying it to a real robot is still a challenge. To effectively adapt in the noisy environment of the real world while requiring as little real-world data as possible, we introduce batch hill-climbing, an add-on to ES-MAML based on previous work for zeroth-order blackbox optimization. Rather than performing hill-climbing which iteratively updates the input one-by-one according to a deterministic objective, batch hill-climbing samples a parallel batch of queries to determine the next input, making it robust to large amounts of noise in the objective.

We then test our method on the following 2 tasks, which are designed to significantly change the dynamics from the normal setting of the robot:
In the mass-voltage task (left), a 500g weight is placed on the robot’s side and the voltage is dropped to 10.0V from 16.8V. In the friction task (right), we replaced the rubber feet with tennis balls, to significantly reduce friction and hinder walking.
For the mass-voltage task, the initial meta-policy steered the robot significantly to the right due to the extra mass and voltage change, which caused an imbalance in the robot’s body and leg motors. However, after 30 episodes of adaptation using our method, the robot straightens the walking pose, and after 50 episodes, the robot is able to balance its body completely and is able to walk longer distances. In comparison, training from scratch on an easier, noiseless task from only simulation required approximately 90,000 episodes, showing that our method significantly reduces sample complexity on expensive real world data.
Qualitative changes during the adaptation phase under the mass-voltage task.
We compared our method to domain randomization and the standard policy gradient approach to MAML (PG-MAML) only, presenting the final policies qualitatively, as well as metrics from the real robot to show how our method adapts. We found that both domain randomization and PG-MAML baselines do not adapt as well as our method.
Comparisons between Domain Randomization and PG-MAML, and metric differences between our method’s meta-policy and adapted policy. Top: Comparison for the mass-voltage task. Our method stabilizes the robot’s roll angle. Bottom: Comparison for the friction task. Our method results in longer trajectories.
Future Work
This work exposes several avenues for future development. One option is to make algorithmic improvements to reduce the number of real-world rollouts required for adaptation. Another area for advancement is the use of model-based reinforcement learning techniques for a lifelong learning system, in which the robot can continuously collect data and quickly adjust its policy to learn new skills and to operate optimally in new environments.

Acknowledgements
This research was conducted by the core ES-MAML team: Xingyou Song, Yuxiang Yang, Krzysztof Choromanski, Ken Caluwaerts, Wenbo Gao, Chelsea Finn, and Jie Tan. We would like to give special thanks to Vikas Sindhwani for his support on ES methods, and Daniel Seita for feedback on our paper.

Source: Google AI Blog


Off-Policy Estimation for Infinite-Horizon Reinforcement Learning



In conventional reinforcement learning (RL) settings, an agent interacts with an environment in an online fashion, meaning that it collects data from its interaction with the environment that is then used to inform changes to the policy governing its behavior. In contrast, offline RL refers to the setting where historical data are used to either learn good policies for acting in an environment, or to evaluate the performance of new policies. As RL is increasingly applied to crucial real-life problems like robotics and recommendation systems, evaluating new policies in the offline setting — estimating the expected reward of a target policy given historical data generated from actions that are based on a behavior policy becomes more critical. However, despite its importance, evaluating the overall effectiveness of a target policy based on historical behavior policies can be a bit tricky, due to the difficulty in building high-fidelity simulators and also the mismatch in data distributions.
Agent-environment interaction in reinforcement learning. At each step, an agent takes an action based on a policy, receives a reward and makes a transition to a new state.
As a simple example, consider the game Pong: one might like to predict if a new strategy (the target policy) increases the chance of winning when considering only historical data collected from previous strategies (behavior policies) and without actually playing the game. If one were interested only in the performance of the behavior policy, a good metric might be to average the rewards of all the time steps from the historical data. However, since historical data is based on actions determined by the behavior policy and not the target policy, this simple average of rewards in the off-policy data would not yield a good estimate of the target policy’s long-term reward. Instead, proper correction must be made to remove the bias resulting from having two different policies (i.e., the difference in data distribution).
In off-policy evaluation, unlike the behavior policy, we do not have any data from the target policy. Therefore, we cannot compute the expected reward of the target policy without using information from the behavior policy.
In “Black-Box Off-Policy Estimation for Infinite-Horizon Reinforcement Learning”, accepted at ICLR 2020, we propose a new approach to evaluate a given policy from offline data based on estimating the expected reward of the target policy as a weighted average of rewards in off-policy data. Since meaningful weights for the off-policy data are not known a priori, we propose a novel way of learning them. Unlike most of previous works, our method is particularly suitable when we plan to use historical data where trajectories are significantly lengthy or have infinite horizons. We empirically demonstrate the effectiveness of this approach using a number of classical control benchmarks.

Background
In general, one approach to solve the off-policy evaluation problem is to build a simulator that mimics the interaction of the agent with the environment, and then evaluate the target policy against the simulation. While the idea is natural, building a high-fidelity simulator for many domains can be extremely challenging, particularly those that involve human interactions.

An alternative approach is to use the weighted average of rewards from the off-policy data as an estimate of the average reward of the target policy. This approach can be more robust than using a simulator as it does not require modeling assumptions about real world dynamics. Indeed, most previous efforts using this approach have found success on short-horizon problems where the number of time steps (i.e., the length of data trajectory) is limited. However, as the horizon is extended, the variance in predictions made by most of the previous estimators often grows exponentially, necessitating novel solutions for long-horizon problems, and even more so in the extreme case of the infinite-horizon problem.

Our Approach for Infinite-Horizon RL
Our method of OPE leverages a well-known statistical technique called importance sampling through which one can estimate the properties of a particular distribution (e.g., the mean) from samples generated by another distribution. In particular, we estimate the long-term average reward of the target policy using the weighted average of rewards from the behavior policy data. The difficulty in this approach is how to choose the weights in order to remove the bias between the off-policy data distribution and that of the target policy while achieving the best estimate of the target policy’s average reward.

One important point is that if the weights are normalized to be positive and sum up to one, then they define a probability distribution over the set of possible states and actions of the agent. On the other hand, an individual policy defines a distribution on how often an agent visits a particular state or performs a particular action. In other words, it defines a unique distribution on states and actions. Under reasonable assumptions, this distribution does not change over time, and is called a stationary distribution. Since we are using importance sampling, we naturally want to optimize weights of the estimator such that the stationary distribution of the target policy matches the distribution induced by the weights of our estimator. However, the problem remains that we do not know the stationary distribution of the target policy, since we do not have any data generated by that policy.

One way to overcome this problem is to make sure that the distribution of weights satisfies properties that the target policy distribution has, without actually knowing what this distribution is. Luckily, we can take advantage of some mathematical "trickery" to solve this. While the full details are found in our paper, the upshot is that while we do not know the stationary distribution of the target policy (since we have no data collected from it) we can determine that distribution by solving an optimization problem involving a backward operator, which describes how an agent transitions from other states and actions to a particular state and action using probability distributions as both input and output. Once we are done, the weighted average of rewards from historic data gives us an estimate of the expected reward of the target policy.

Experimental Results
Using a toy environment called ModelWin that has three states and two actions, we compare our work with a previous state-of-the-art approach (labeled “IPS”), along with a naive method in which we simply average rewards from the behavior policy data. The figure below shows the log of the root-mean-square error (RMSE) with respect to the target policy reward as we change the number of steps collected by the behavior policy. The naive method suffers from a large bias and its error does not change even with more data collected by increasing the length of the episode. The estimation error of the IPS method decreases with increasing horizon length. On the other hand, the error exhibited by our method is small, even for short horizon length.
Left: The ModelWin environment with three states (s1-s3) and two actions (a1, a2). Right: RMSE of different approaches on the ModelWin problem in logarithmic scale. Our approach has converged very quickly in this simple problem, compared to the previous state-of-the-art method (IPS) that needs longer horizon length to converge.
We also compare the performance of our approach with other approaches (including naive estimator, IPS, and model-based estimator) on several classic control problems. As we can see in figures below, the naive averaging performance is almost independent of the number of trajectories. Our method outperforms other approaches in three sample environments: CartPole, Pendulum, and MountainCar.

Comparison of different methods on three environments: Cartpole, Pendulum, and Mountaincar. The left column shows the environments. The right column shows the log of the RMSE with respect to the target policy reward as the number of trajectories collected by the behavior policy changes. The results are based on 50 runs.
To summarize, in this post we described how one can use historic data gathered according to a behavior policy to assess the quality of a new target policy. An interesting future direction of this work is to use structural domain knowledge to improve the algorithm. We invite interested readers to read our paper to learn more about this work.

Acknowledgements.
Special thanks to Qiang Liu and Denny Zhou for contributing to this project.

Source: Google AI Blog


An Optimistic Perspective on Offline Reinforcement Learning



“The potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery.” — Sutton & Barto

Most reinforcement learning (RL) algorithms assume that an agent actively interacts with an online environment to learn from its own collected experience. These algorithms are challenging to apply to complex real-world problems (such as robotics and autonomous driving) since extensive data collection from the real world can be extremely sample inefficient and lead to unintended behavior, while those operating in simulation require high-fidelity simulators that are challenging to build. However, for many real-world RL applications, there already exist a large amount of previously collected interaction data which can be utilized to make RL feasible for those problems, and enable better generalization by incorporating diverse prior experiences.

Existing interaction data can be used effectively using offline RL, which is the fully off-policy RL setting in which an agent is trained from a fixed dataset of logged experiences, without any further interactions with the environment. Offline RL can help (1) pretrain an RL agent using existing data, (2) empirically evaluate RL algorithms based on their ability to utilize a fixed dataset of interactions, and (3) deliver real-world impact. However, offline RL is considered challenging due to the distribution mismatch between online interactions and any fixed dataset of logged interactions, i.e., when the learned agent takes an action different from the data collection agent, we don’t know the reward that should be provided.
RL with online interactions vs. Offline RL.
In “An Optimistic Perspective on Offline RL”, we propose a simple experimental setup for offline RL on Atari 2600 games, based on logged experiences of a DQN agent. We demonstrate that it is possible to train agents with high returns that outperform the data collection agents using standard off-policy RL algorithms, without explicitly correcting for any distribution mismatch. We also develop a robust RL algorithm, called random ensemble mixture (REM), which shows promising results on offline RL. Overall, we present an optimistic perspective that robust RL algorithms trained on sufficiently large and diverse offline datasets can lead to high quality behaviour, strengthening the emerging data-driven RL paradigm. To facilitate the development and evaluation of offline RL methods, we are also publicly releasing the DQN Replay Dataset and have open-sourced our code. More details can be found at offline-rl.github.io.

A Primer on Off-policy and Offline RL
We summarize various approaches to RL below:
Online, off-policy RL agents, such as DQN, achieve human-level performance on Atari 2600 games by just observing the game screen, without any explicit knowledge about the game. DQN estimates the effectiveness of an action at a given state of the environment in terms of maximum achievable future rewards (i.e., Q-values). Furthermore, recent distributional RL agents, such as QR-DQN, model the entire distribution of probable future rewards, rather than a single expected value for each state-action pair. Agents such as DQN and QR-DQN are considered “online” because they alternate between optimizing a policy (how an agent acts at a given state) and using that policy to collect more data.

In principle, off-policy RL agents can learn from data collected by any policy, not just the policy being optimized. However, in the offline RL setting, recent work presents a discouraging view that standard off-policy agents diverge or otherwise yield poor performance. To fix this, previous work proposes remedies by regularizing the learned policy to stay close to the dataset of offline interactions.

The DQN Replay Dataset for Offline RL
In this work, we revisit offline RL by first creating the DQN Replay Dataset. This dataset is generated using DQN agents trained on 60 Atari 2600 games for 200 million frames each, while using sticky actions (with 25% probability that the agent’s previous action is executed instead of the current action) to make the problem more challenging. For each of the 60 games, we train 5 DQN agents with different random initializations, and store all of the (state, action, reward, next state) tuples encountered during training into 5 replay datasets per game, resulting in a total of 300 datasets.
Offline RL on Atari games using the DQN Replay Dataset.
The DQN Replay Dataset can then be used for training offline RL agents, without any interaction with the environment during training. Each game replay dataset is approximately 3.5 times larger than ImageNet and includes samples from all of the intermediate policies seen during the optimization of online DQN.

Training Offline Agents on the DQN Replay Dataset
We trained offline variants of DQN and distributional QR-DQN on the DQN Replay Dataset. Although the offline datasets contain data experienced by a DQN agent improving over time as training progresses, we compared the performance of offline agents against the best performing online DQN agent obtained after training (i.e., a fully-trained DQN). For each game, we evaluated the 5 offline agents trained (one per dataset), using online returns, reporting the best averaged performance.

Offline DQN underperforms fully-trained online DQN on all except a few games, where it achieves higher scores with the same amount of data. Offline QR-DQN, on the other hand, outperforms offline DQN and fully-trained DQN on most of the games. These results demonstrate that it is possible to optimize strong agents offline using standard deep RL algorithms. Furthermore, the disparity between the performance of offline QR-DQN and DQN indicates the difference in their ability to exploit offline data.
Offline DQN. Normalized improvement over a fully-trained DQN, per game, of offline DQN trained using DQN replay. On the normalized scale, fully-trained DQN corresponds to 100% performance while random agent corresponds to 0%.
Offline QR-DQN. Normalized performance improvement (in %) over a fully-trained DQN agent, per game, of offline QR-DQN trained offline using DQN replay.
Introducing Two Robust Offline RL Agents
In online RL, an agent chooses actions that it thinks will lead to high rewards, and then receives corrective feedback. Since it is not possible to collect additional data in offline RL, it is essential to reason about generalization using a fixed dataset. Leveraging methods from supervised learning that use an ensemble of models to improve generalization, we present two new offline RL agents:
  • Ensemble-DQN is a simple extension of DQN that trains multiple Q-value estimates and averages them for evaluation.
  • Random Ensemble Mixture (REM) is an easy to implement extension of DQN inspired by Dropout. The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.
Neural Network architectures for DQN, distributional QR-DQN and the expected RL variants with the same multi-head QR-DQN architecture, i.e., Ensemble-DQN and REM. In QR-DQN, each head (red rectangles) corresponds to a specific fraction of the return distribution, while in the proposed variants, each head approximates the Q-function.
To utilize the DQN Replay Dataset more efficiently, we train offline agents for five times as many training iterations as online DQN and report their performance below. Offline REM outperforms offline DQN and offline QR-DQN. The comparison with fully-trained online C51, a strong distributional agent, illustrates that the gains from offline REM are more than the gains from C51.
Offline REM vs. baselines. Median normalized scores averaged over 5 runs across 60 Atari games of offline agents trained using DQN replay for 5x iterations, compared to online DQN.
Using the standard training protocols on Atari, online REM performs on par with QR-DQN in the standard online RL setting. This suggests that we can use the insights gained from the DQN Replay Dataset and the offline RL setting to build effective online RL methods.
Online REM vs. baselines. Median normalized evaluation scores averaged over 5 runs (shown as traces) across stochastic 60 Atari 2600 games of online agents trained for 200 million frames. Online REM with 4 Q-networks performs comparably to online QR-DQN.
Comparison of Results: Important Factors in Offline RL
The discrepancy between these results and prior work that reports failure of standard RL agents in the offline setting could be attributed to the following factors:
  • Offline Dataset Size. We trained offline QR-DQN and REM with reduced data obtained via randomly subsampling the entire DQN Replay Dataset, maintaining the same data distribution. Analogous to supervised learning, performance tends to increase as the size of data increases. With only 10% of the entire dataset, REM and QR-DQN approximately recover the performance of fully-trained DQN.
  • Offline Dataset Composition. We trained offline RL agents on the first 20 million frames per game in the DQN Replay Dataset. Offline REM and QR-DQN outperform the best policy in this lower quality dataset, indicating that standard RL agents work well in the offline setting with sufficiently diverse datasets.
    Offline RL with Lower Quality Dataset. REM and QR-DQN trained on offline data collected from DQN trained for 20 iterations (by using the first 20M frames from each game replay dataset). The horizontal line shows the performance of best policy in this dataset, which is significantly worse than fully-trained DQN.
  • Offline Algorithm Choice. There are claims that standard off-policy agents are ineffective on continuous control tasks when trained offline. However, we found that recent continuous control agents, such as TD3, perform comparably to a sophisticated offline agent when trained on large and diverse offline datasets.
Future Work. Our results emphasize the need for a rigorous characterization of the role of generalization due to neural networks when learning from offline data collected from a large mixture of diverse policies. Benchmarking offline RL with various data collection strategies by subsampling DQN replay (e.g., first / last k million frames) is another important direction. We currently employ online policy evaluation, however, “true” offline RL requires offline policy evaluation for hyperparameter tuning and early stopping. Finally, model-based RL and self-supervised learning approaches are also promising for offline RL.

Acknowledgements.
This research was conducted in collaboration with Dale Schuurmans. We’d like to thank members of the Google Research, Brain Team for valuable discussions. A prior version of this work was presented as a contributed talk at NeurIPS 2019 DRL Workshop.

Source: Google AI Blog


Exploring Nature-Inspired Robot Agility



Whether it’s a dog chasing after a ball or a horse jumping over obstacles, animals can effortlessly perform an incredibly rich repertoire of agile skills. Developing robots that are able to replicate these agile behaviors can open opportunities to deploy robots for sophisticated tasks in the real world. But designing controllers that enable legged robots to perform these agile behaviors can be a very challenging task. While reinforcement learning (RL) is an approach often used for automating development of robotic skills, a number of technical hurdles remain and, in practice, there is still substantial manual overhead. Designing reward functions that lead to effective skills can itself require a great deal of expert insight, and often involves a lengthy reward tuning process for each desired skill. Furthermore, applying RL to legged robots requires not only efficient algorithms, but also mechanisms to enable the robots to remain safe and recover after falling, without frequent human assistance.

In this post, we will discuss two of our recent projects aimed at addressing these challenges. First, we describe how robots can learn agile behaviors by imitating motions from real animals, producing fast and fluent movements like trotting and hopping. Then, we discuss a system for automating the training of locomotion skills in the real world, which allows robots to learn to walk on their own, with minimal human assistance.

Learning Agile Robotic Locomotion Skills by Imitating Animals
In “Learning Agile Robotic Locomotion Skills by Imitating Animals”, we present a framework that takes a reference motion clip recorded from an animal (a dog, in this case) and uses RL to train a control policy that enables a robot to imitate the motion in the real world. By providing the system with different reference motions, we are able to train a quadruped robot to perform a diverse set of agile behaviors, ranging from fast walking gaits to dynamic hops and turns. The policies are trained primarily in simulation, and then transferred to the real world using a latent space adaptation technique that can efficiently adapt a policy using only a few minutes of data from the real robot.

Motion Imitation
We start by collecting motion capture clips of a real dog performing various locomotion skills. Then, we use RL to train a control policy to imitate the dog’s motions. The policies are trained in a physics simulation to track the pose of the reference motion at each timestep. Then, by using different reference motions in the reward function, we can train a simulated robot to imitate a variety of different skills.
Reinforcement learning is used to train a simulated robot to imitate the reference motions from a dog. All simulations are performed using PyBullet.
However, since simulators generally provide only a coarse approximation of the real world, policies trained in simulation often perform poorly when deployed on a real robot. Therefore, we use a sample-efficient latent space adaptation technique to transfer a policy trained in simulation to the real world.

First, to encourage the policy to learn behaviors that are robust to variations in the dynamics, we randomize the dynamics of the simulation by varying physical quantities, such as the robot’s mass and friction. Since we have access to the values of these parameters during training in simulation, we can also map them to a low-dimensional representation using a learned encoder. This encoding is then passed as an additional input to the policy during training. Since the physical parameters of the real robot are not known a priori, when deploying the policy to a real robot, we remove the encoder and directly search for a set of parameters in the latent space that enables the robot to successfully execute the desired skills in the real world. This technique is often able to adapt a policy to the real world using less than 8 minutes of real-world data.
Comparison of policies before and after adaptation on the real robot. Before adaptation, the robot is prone to falling. But after adaptation, the policies are able to more consistently execute the desired skills.
Results
Using this approach, the robot learns to imitate various locomotion skills from a dog, including different walking gaits, such as pacing and trotting, as well as an agile spinning motion.
Robot imitating various skills from a dog.
In addition to imitating motions from real dogs, it is also possible to imitate artist-animated keyframe motions, including a dynamic hop-turn:
Skills learned by imitating artist-animated keyframe motions: side-steps, turn, and hop-turn.
More details are available in the following video:
Learning to Walk in the Real World with Minimal Human Effort
The above approach is able to train policies in simulation and then adapt them to the real world. However, when the task involves complex and diverse physical phenomena, it is also necessary to directly learn from real-world experience. Although learning on real robots has achieved state-of-the-art performance for manipulation tasks (e.g., QT-Opt), applying the same methods to legged robots is difficult since the robot may fall and damage itself, or leave the training area, which can then require human intervention.
An automated learning system for legged robots must resolve safety and automation challenges.
In “Learning to Walk in the Real World with Minimal Human Effort”, we developed an automated learning system with software and hardware components, using a multi-task learning procedure, a safety-constrained learner, and several carefully designed hardware and software components. Multi-task learning prevents the robot from leaving the training area by generating a learning schedule that drives the robot towards the center of the workspace. We also reduce the number of falls by designing a safety constraint, which we solve with dual gradient descent.

For each roll-out, the scheduler selects a task in which the desired walking direction is pointing towards the center. For instance, assuming we have two tasks, forward and backward walking, the scheduler will select the forward task if the robot is at the back of the workspace, and vice-versa for the backward task. In the middle of the episode, the learner takes dual gradient descent steps to iteratively optimize both the task objective and safety constraints, rather than treating them as a single goal. If the robot has fallen, we invoke an automated get-up controller and proceed to the next episode.
We solve automation and safety challenges with multi-task learning, a safety-constrained SAC algorithm, and an automatic reset controller.
Results
This framework successfully trains policies from scratch to walk in different directions without any human intervention.
Snapshots of the training process on the flat surface with zero human resets.
Once trained, it is possible to steer the robot with a remote controller. Notice how it's possible to command the robot to turn in place using the controller. This action would be difficult to manually design due to the planar leg structure of the robot, but is discovered automatically using our automated multi-instance learner.
We train locomotion policies to walk in four directions, which allow us to interactively control the robot with a game controller.
The system also enables the robot to navigate more challenging surfaces, such as a memory foam mattress and a doormat with crevices.
Learned locomotion gaits on challenging terrains.
More details can be found in the following video:
Conclusion
In these two papers, we present methods to reproduce a diverse corpus of behaviors with quadruped robots. Extending this line of work to learn skills from videos would also be an exciting direction, which can substantially increase the volume of data from which robots can learn. We are also interested in applying the automated training system to more complex real-world environments and tasks.

Acknowledgments
We would like to thank our coauthors, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, Sergey Levine, Peng Xu and Zhenyu Tan. We would also like to thank Julian Ibarz, Byron David, Thinh Nguyen, Gus Kouretas, Krista Reymann, and Bonny Ho for their support and contributions to this work.

Source: Google AI Blog


Massively Scaling Reinforcement Learning with SEED RL



Reinforcement learning (RL) has seen impressive advances over the last few years as demonstrated by the recent success in solving games such as Go and Dota 2. Models, or agents, learn by exploring an environment, such as a game, while optimizing for specified goals. However, current RL techniques require increasingly large amounts of training to successfully learn even simple games, which makes iterating research and product ideas computationally expensive and time consuming.

In “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference”, we present an RL agent that scales to thousands of machines, which enables training at millions of frames per second, and significantly improves computational efficiency. This is achieved with a novel architecture that takes advantage of accelerators (GPUs or TPUs) at scale by centralizing model inference and introducing a fast communication layer. We demonstrate the performance of SEED RL on popular RL benchmarks, such as Google Research Football, Arcade Learning Environment and DeepMind Lab, and show that by using larger models, data efficiency can be increased. The code has been open sourced on Github together with examples for running on Google Cloud with GPUs.

Current Distributed Architectures
The previous generation of distributed reinforcement learning agents, such as IMPALA, made use of accelerators specialized for numerical calculations, taking advantage of the speed and efficiency from which (un)supervised learning has benefited for years. The architecture of an RL agent is usually separated into actors and learners. The actors typically run on CPUs and iterate between taking steps in the environment and running inference on the model to predict the next action. Frequently the actor will update the parameters of the inference model, and after collecting a sufficient amount of observations, will send a trajectory of observations and actions to the learner, which then optimizes the model. In this architecture, the learner trains the model on GPUs using input from distributed inference on hundreds of machines.

Example architecture for an earlier generation RL agent, IMPALA. Inference is done on the actors, usually using inefficient CPUs. Updated model parameters are frequently sent from the learner to the actors increasing bandwidth requirements.
The architecture of RL agents (such as IMPALA) have a number of drawbacks:
  1. Using CPUs for neural network inference is much less efficient and slower than using accelerators and becomes problematic as models become larger and more computationally expensive.
  2. The bandwidth required for sending parameters and intermediate model states between the actors and learner can be a bottleneck.
  3. Handling two completely different tasks on one machine (i.e., environment rendering and inference) is unlikely to utilize machine resources optimally.
SEED RL Architecture
The SEED RL architecture is designed to solve these drawbacks. With this approach, neural network inference is done centrally by the learner on specialized hardware (GPUs or TPUs), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local. While observations are sent to the learner at every environment step, latency is kept low due to a very efficient network library based on the gRPC framework with asynchronous streaming RPCs. This makes it possible to achieve up to a million queries per second on a single machine. The learner can be scaled to thousands of cores (e.g., up to 2048 on Cloud TPUs) and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. SEED RL is based on the TensorFlow 2 API and, in our experiments, was accelerated by TPUs.
Overview of the architecture of SEED RL. In contrast to the IMPALA architecture, the actors only take actions in environments. Inference is executed centrally by the learner on accelerators using batches of data from multiple actors.
In order for this architecture to be successful, two state-of-the-art algorithms are integrated into SEED RL. The first is V-trace, a policy gradient-based method, first introduced with IMPALA. In general, policy gradient-based methods predict an action distribution from which an action can be sampled. However, because the actors and the learner execute asynchronously in SEED RL, the policy of actors is slightly behind the policy of the learner, i.e., they become off-policy. The usual policy gradient-based methods are on-policy, meaning that they have the same policy for actors and learner, and suffer from convergence and numerical issues in off-policy settings. V-trace is an off-policy method and thus works well in the asynchronous SEED RL architecture.

The second algorithm is R2D2, a Q-learning method that selects an action based on the predicted future value of that action using recurrent distributed replay. This approach allows the Q-learning algorithm to be run at scale, while still allowing the use of recurrent neural networks that can predict future values based on the information of all past frames in an episode.

Experiments
SEED RL is benchmarked on the commonly used Arcade Learning Environment, DeepMind Lab environments, and on the recently released Google Research Football environment.
Frames per second comparing IMPALA and various configurations of SEED RL on DeepMind Lab. SEED RL achieves 2.4M frames per second using 4,160 CPUs. Assuming the same speed, IMPALA would need 14,000 CPUs.
On DeepMind Lab, we achieve 2.4 million frames per second with 64 Cloud TPU cores, which represents an improvement of 80x over the previous state-of-the-art distributed agent, IMPALA. This results in a significant speed-up in wall-clock time and computational efficiency. IMPALA requires 3-4x as many CPUs as SEED RL for the same speed.
Episode return (i.e., the sum of rewards) over time on the DeepMind Lab game “explore_goal_locations_small” using IMPALA and SEED RL. With SEED RL, the time to train is significantly reduced.
With an architecture optimized for use on modern accelerators, it’s natural to increase the model size in an attempt to increase data efficiency. We show that by increasing the size of the model and the input resolution, we are able to solve a previously unsolved Google Research Football task, “Hard”.
The score of different architectures on the Google Research Football “Hard” task. We show that by using an input resolution and a larger model, the score is improved, and with more training, the model can significantly outperform the builtin AI.
Additional details are provided in the paper, including our results on the Arcade Learning Environment. We believe SEED RL and the results presented, demonstrate that reinforcement learning has once again caught up with the rest of the deep learning field in terms of taking advantage of accelerators.

Acknowledgements
This project was done in collaboration with Raphaël Marinier, Piotr Stanczyk, Ke Wang, Marcin Andrychowicz and Marcin Michalski. We would also like to thank Tom Small for the visualizations.

Source: Google AI Blog


Introducing Dreamer: Scalable Reinforcement Learning Using World Models



Research into how artificial agents can choose actions to achieve goals is making rapid progress in large part due to the use of reinforcement learning (RL). Model-free approaches to RL, which learn to predict successful actions through trial and error, have enabled DeepMind's DQN to play Atari games and AlphaStar to beat world champions at Starcraft II, but require large amounts of environment interaction, limiting their usefulness for real-world scenarios.

In contrast, model-based RL approaches additionally learn a simplified model of the environment. This world model lets the agent predict the outcomes of potential action sequences, allowing it to play through hypothetical scenarios to make informed decisions in new situations, thus reducing the trial and error necessary to achieve goals. In the past, it has been challenging to learn accurate world models and leverage them to learn successful behaviors. While recent research, such as our Deep Planning Network (PlaNet), has pushed these boundaries by learning accurate world models from images, model-based approaches have still been held back by ineffective or computationally expensive planning mechanisms, limiting their ability to solve difficult tasks.

Today, in collaboration with DeepMind, we present Dreamer, an RL agent that learns a world model from images and uses it to learn long-sighted behaviors. Dreamer leverages its world model to efficiently learn behaviors via backpropagation through model predictions. By learning to compute compact model states from raw images, the agent is able to efficiently learn from thousands of predicted sequences in parallel using just one GPU. Dreamer achieves a new state-of-the-art in performance, data efficiency and computation time on a benchmark of 20 continuous control tasks given raw image inputs. To stimulate further advancement of RL, we are releasing the source code to the research community.

How Does Dreamer Work?
Dreamer consists of three processes that are typical for model-based methods: learning the world model, learning behaviors from predictions made by the world model, and executing its learned behaviors in the environment to collect new experience. To learn behaviors, Dreamer uses a value network to take into account rewards beyond the planning horizon and an actor network to efficiently compute actions. The three processes, which can be executed in parallel, are repeated until the agent has achieved its goals:
The three processes of the Dreamer agent. The world model is learned from past experience. From predictions of this model, the agent then learns a value network to predict future rewards and an actor network to select actions. The actor network is used to interact with the environment.
Learning the World Model
Dreamer leverages the PlaNet world model, which predicts outcomes based on a sequence of compact model states that are computed from the input images, instead of directly predicting from one image to the next. It automatically learns to produce model states that represent concepts helpful for predicting future outcomes, such as object types, positions of objects, and the interaction of the objects with their surroundings. Given a sequence of images, actions, and rewards from the agent's dataset of past experience, Dreamer learns the world model as shown:
Dreamer learns a world model from experience. Using past images (o1–o3) and actions (a1–a2), it computes a sequence of compact model states (green circles) from which it reconstructs the images (ô1–ô3) and predicts the rewards (r̂1–r̂3).
An advantage to using the PlaNet world model is that predicting ahead using compact model states instead of images greatly improves the computational efficiency. This enables the model to predict thousands of sequences in parallel on a single GPU. The approach can also facilitate generalization, leading to accurate long-term video predictions. To gain insights into how the model works, we can visualize the predicted sequences by decoding the compact model states back into images, as shown below for a task of the DeepMind Control Suite and for a task of the DeepMind Lab environment:
Predicting ahead using compact model states enables long-term predictions in complex environments. Shown here are two sequences that the agent has not encountered before. Given five input images, the model reconstructs them and predicts the future images up to time step 50.
Efficient Behavior Learning
Previously developed model-based agents typically select actions either by planning through many model predictions or by using the world model in place of a simulator to reuse existing model-free techniques. Both designs are computationally demanding and do not fully leverage the learned world model. Moreover, even powerful world models are limited in how far ahead they can accurately predict, rendering many previous model-based agents shortsighted. Dreamer overcomes these limitations by learning a value network and an actor network via backpropagation through predictions of its world model.

Dreamer efficiently learns the actor network to predict successful actions by propagating gradients of rewards backwards through predicted state sequences, which is not possible for model-free approaches. This tells Dreamer how small changes to its actions affect what rewards are predicted in the future, allowing it to refine the actor network in the direction that increases the rewards the most. To consider rewards beyond the prediction horizon, the value network estimates the sum of future rewards for each model state. The rewards and values are then backpropagated to refine the actor network to select improved actions:
Dreamer learns long-sighted behaviors from predicted sequences of model states. It first learns the long-term value (v̂2–v̂3) of each state, and then predicts actions (â1–â2) that lead to high rewards and values by backpropagating them through the state sequence to the actor network.
Dreamer differs from PlaNet in several ways. For a given situation in the environment, PlaNet searches for the best action among many predictions for different action sequences. In contrast, Dreamer side-steps this expensive search by decoupling planning and acting. Once its actor network has been trained on predicted sequences, it computes the actions for interacting with the environment without additional search. In addition, Dreamer considers rewards beyond the planning horizon using a value function and leverages backpropagation for efficient planning.

Performance on Control Tasks
We evaluated Dreamer on a standard benchmark of 20 diverse tasks with continuous actions and image inputs. The tasks include balancing and catching objects, as well as locomotion of various simulated robots. The tasks are designed to pose a variety of challenges to the RL agent, including difficult to predict collisions, sparse rewards, chaotic dynamics, small but relevant objects, high degrees of freedom, and 3D perspectives:
Dreamer learns to solve 20 challenging continuous control tasks with image inputs, 5 of which are displayed here. The visualizations show the same 64x64 images that the agent receives from the environment.
We compare the performance of Dreamer to that of PlaNet, the previous best model-based agent, the popular model-free agent, A3C, as well as the current best model-free agent on this benchmark, D4PG, which combines several advances of model-free RL. The model-based agents learn efficiently in under 5 million frames, corresponding to 28 hours inside the simulation. The model-free agents learn more slowly and require 100 million frames, corresponding to 23 days inside the simulation.

On the benchmark of 20 tasks, Dreamer outperforms the best model-free agent (D4PG) with an average score of 823 compared to 786, while learning from 20 times fewer environment interactions. Moreover, it exceeds the final performance of the previously best model-based agent (PlaNet) across almost all of the tasks. The computation time of 16 hours for training Dreamer is less than the 24 hours required for the other methods. The final performance of the four agents is shown below:
Dreamer outperforms the previous best model-free (D4PG) and model-based (PlaNet) methods on the benchmark of 20 tasks in terms of final performance, data efficiency, and computation time.
In addition to our main experiments on continuous control tasks, we demonstrate the generality of Dreamer by applying it to tasks with discrete actions. For this, we select Atari games and DeepMind Lab levels that require both reactive and long-sighted behavior, spatial awareness, and understanding of visually more diverse scenes. The resulting behaviors are visualized below, showing that Dreamer also efficiently learns to solve these more challenging tasks:
Dreamer learns successful behaviors on Atari games and DeepMind Lab levels, which feature discrete actions and visually more diverse scenes, including 3D environments with multiple objects.
Conclusion
Our work demonstrates that learning behaviors from sequences predicted by world models alone can solve challenging visual control tasks from image inputs, surpassing the performance of previous model-free approaches. Moreover, Dreamer demonstrates that learning behaviors by backpropagating value gradients through predicted sequences of compact model states is successful and robust, solving a diverse collection of continuous and discrete control tasks. We believe that Dreamer offers a strong foundation for further pushing the limits of reinforcement learning, including better representation learning, directed exploration with uncertainty estimates, temporal abstraction, and multi-task learning.

Acknowledgements
This project is a collaboration with Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. We further thank everybody in the Brain Team and beyond who commented on our paper draft and provided feedback at any point throughout the project.

Source: Google AI Blog


RecSim: A Configurable Simulation Platform for Recommender Systems

Originally posted on the Google AI Blog

Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs)—recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs—have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems

One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy—research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ

RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red)

Applications

We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work

While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs—e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

By Martin Mladenov, Research Scientist and Chih-wei Hsu, Software Engineer, Google Research

RecSim: A Configurable Simulation Platform for Recommender Systems



Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs) recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems
One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy — research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ
RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red).
Applications
We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work
While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs — e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

Source: Google AI Blog


ROBEL: Robotics Benchmarks for Learning with Low-Cost Robots



Learning-based methods for solving robotic control problems have recently seen significant momentum, driven by the widening availability of simulated benchmarks (like dm_control or OpenAI-Gym) and advancements in flexible and scalable reinforcement learning techniques (DDPG, QT-Opt, or Soft Actor-Critic). While learning through simulation is effective, these simulated environments often encounter difficulty in deploying to real-world robots due to factors such as inaccurate modeling of physical phenomena and system delays. This motivates the need to develop robotic control solutions directly in the real world, on real physical hardware.

The majority of current robotics research on physical hardware is conducted on high-cost, industrial-quality robots (PR2, Kuka-arms, ShadowHand, Baxter, etc.) intended for precise, monitored operation in controlled environments. Furthermore, these robots are designed around traditional control methods that focus on precision, repeatability, and ease of characterization. This stands in sharp contrast with the learning-based methods that are robust to imperfect sensing and actuation, and demand (a) a high degree of resilience to allow real-world trial-and-error learning, (b) low cost and ease of maintenance to enable scalability through replication and (c) a reliable reset mechanism to alleviate strict human monitoring requirements.

In “ROBEL: Robotics Benchmarks for Learning with Low-Cost Robots”, to be presented at CoRL 2019, we introduce an open-source platform of cost-effective robots and curated benchmarks designed primarily to facilitate research and development on physical hardware in the real world. Analogous to an optical table in the field of optics, ROBEL serves as a rapid experimentation platform, supporting a wide range of experimental needs and the development of new reinforcement learning and control methods. ROBEL consists of D'Claw, a three-fingered hand robot that facilitates learning of dexterous manipulation tasks and D'Kitty, a four-legged robot that enables the learning of agile legged locomotion tasks. The robotic platforms are low-cost, modular, easy to maintain, and are robust enough to sustain on-hardware reinforcement learning from scratch.
Left: The 12 DoF D’Kitty; Middle: The 9 DoF D’Claw; Right: A functional D’Claw setup D’Lantern.
In order to make the robots relatively inexpensive and easy to build, we based ROBEL’s designs on off-the-shelf components and commonly-available prototyping tools (3D-printed or laser cut). Designs are easy to assemble and require only a few hours to build. Detailed part lists (with CAD details), assembly instructions, and software instructions for getting started are available here.

ROBEL Benchmarks
We devised a set of tasks suitable for each platform, D’Claw and D’Kitty, which can be used for benchmarking real-world robotic learning. ROBEL’s task definitions include both dense and sparse task objectives, and introduce metrics for hardware-safety in the task definition, which for example, indicate if joints are exceeding “safe” operating bounds or force thresholds. ROBEL also supports a simulator for all tasks to facilitate algorithmic development and rapid prototyping. D’Claw tasks are centered around three commonly observed manipulation behaviors — Pose, Turn, and Screw.
Left: Pose — Conform to the shape of the environment. Center: Turn — Turn the object to a specified angle. Right: Screw — Continuously rotate the object. (Click images for video.)
D’Kitty tasks are centered around three commonly observed locomotion behaviors — Stand, Orient, and Walk.
Left: Stand — Stand upright. Center: Orient — Align heading with the target. Right: Walk — Move to the target. (Click images for video.)
We evaluated several classes (on-policy, off policy, demo-accelerated, supervised) of deep reinforcement learning methods on each of these benchmark tasks. The evaluation results and the final policies are included as baselines in the software package for comparison. Full task details and baseline performances are available in the technical report.

Reproducibility & Robustness
ROBEL platforms are robust to sustain direct hardware training, and have clocked over 14,000 hours of real-world experience to-date. The platforms have significantly matured over the year. Owing to the modularity of the design, repairs are trivial and require minimal to no domain expertise, making the overall system easy to maintain.

To establish the replicability of the platforms and reproducibility of the benchmarks, ROBEL was studied in isolation by two different research labs. Only software distribution and documentation was used in this study. No in-person visits were allowed. Using ROBEL’s design files and assembly instructions both sites were able to replicate both hardware platforms. Benchmark tasks were trained on robots built at both sites. In the figure below we see that two D’Claw robots built at two different sites not only exhibit similar training progress but also converge to the same final performance, establishing reproducibility of the ROBEL benchmarks.
SAC training performance of a task on two real D’Claw robots developed at different laboratory locations.
Results Gallery
ROBEL has been useful in a variety of reinforcement learning studies so far. Below we highlight a few of the key results, and you can find all our results in this comprehensive gallery. D’Claw platforms are completely autonomous and can sustain reliable experimentation for an extended period of time, and has facilitated experimentation with a wide variety of reinforcement learning paradigms and tasks using both rigid and flexible objects.
Left: Flexible Objects — On-hardware training with DAPG effectively learns to turn flexible objects. We observe manipulation targeting the center of the valve where there is more rigidity. D'Claw is robust to on-hardware training, facilitating successful outcomes on hard to simulate tasks. Center: Disturbance Rejection — A Sim2Real policy trained via Natural Policy Gradient on MuJoCo simulation with object perturbations (amongst others) being tested on hardware. We observe fingers working together to resist external disturbances. Right: Obstructed Finger — A Sim2Real policy trained via Natural Policy Gradient on MuJoCo simulation with external perturbations (amongst others) being tested on hardware. We observe that free fingers fill in for the missing finger.
Importantly, D’Claw platforms are modular and easy to replicate, which facilitates scalable experimentation. With our scaled setup, we find that multiple D’Claws can collectively learn tasks faster by sharing experience.
On-hardware training with distributed version of SAC leaning to turn multiple objects to arbitrary angles in conjunction by sharing experience. Five tasks only need twice the amount of experience of single tasks, thanks to the multi-task formulation. In the video we observe five D'Claws turning different objects to 180 degrees (picked for visual effectiveness, actual policy can turn to any angle).
We have also been successful in deploying robust locomotion policies on the D’Kitty platform. Below we show a blind D’Kitty walking over indoor and outdoor terrains exhibiting the robustness of its gait in presence of unseen disturbances.
Left: Indoor – Walking in Clutter — A Sim2Real policy trained via Natural Policy Gradient on MuJoCo simulation with randomized perturbations learns to walk in clutter and step over objects. Center: Outdoor – Gravel and Branches — A Sim2Real policy trained via Natural Policy Gradient on MuJoCo simulation with randomized height field learns to walk outdoors over gravel and branches. Right: Outdoor – Slope and Grass — A Sim2Real policy trained via Natural Policy Gradient on MuJoCo simulation with randomized height field learns to handle moderate slopes.
When presented with information about its torso and objects present in the scene, D’Kitty can learn to interact with these objects exhibiting complex behaviors.
Left: Avoid Moving Obstacles — Policy trained via Hierarchical Sim2Real learns to avoid a moving block and reach the target (marked by the controller on the floor). Center: Push to Moving Goal — Policy trained via Hierarchical Sim2Real learns to push block towards a moving target (marked by the controller in the hand). Right: Co-ordinate — Policy trained via Hierarchical Sim2Real learns to coordinate two D'Kitties to push a heavy block towards a target (marked by two + signs on the floor).
In conclusion, ROBEL platforms are low cost, robust, reliable and are designed to accommodate the needs of the emerging learning-based paradigms that need scalability and resilience. We are proud to announce the release of ROBEL to the open source community and are excited to learn about the diversity of research and experimentation they will enable. For getting started on ROBEL platforms and ROBEL benchmarks refer to roboticsbenchmarks.org.

Acknowledgments
Google's ROBEL D'Claw evolved from earlier designs Vikash Kumar developed at the Universities of Washington and Berkeley. Multiple people across organizations have contributed towards the ROBEL projects. We thank our co-authors Henry Zhu (UC Berkeley), Kristian Hartikainen (UC Berkeley), Abhishek Gupta (UC Berkeley) and Sergey Levine (Google and UC Berkeley) for their contributions and extensive feedback throughout the project. We would like to acknowledge Matt Neiss (Google) and Chad Richards (Google) for their significant contribution to the platform designs. We would also like to thank Aravind Rajeshwaran (U-Washington), Emo Todorov (U-Washington), and Vincent Vanhoucke (Google) for their helpful discussions and comments throughout the project.

Source: Google AI Blog