AMAZON
We are announcing the release of our state-of-the-art off-policy model-free
reinforcement learning algorithm, soft actor-critic (SAC). This algorithm has
been developed jointly at UC Berkeley and Google, and we have been using
it internally for our robotics experiment. Soft actor-critic is, to our
knowledge, one of the most efficient model-free algorithms available today,
making it especially well-suited for real-world robotic learning. In this post,
we will benchmark SAC against state-of-the-art model-free RL algorithms and
showcase a spectrum of real-world robot examples, ranging from manipulation to
locomotion. We also release our implementation of SAC, which is particularly
designed for real-world robotic systems.
Desired Features for Deep RL for Real Robots
What makes an ideal deep RL algorithm for real-world systems? Real-world
experimentation brings additional challenges, such as constant interruptions in
the data stream, requirement for a low-latency inference and smooth exploration
to avoid mechanical wear and tear on the robot, which set additional requirement
for both the algorithm and also the implementation of the algorithm.
Regarding the algorithm, several properties are desirable:
- Sample Efficiency. Learning skills in the real world can take a
substantial amount of time. Prototyping a new task takes several trials, and
the total time required to learn a new skill quickly adds up. Thus good sample
complexity is the first prerequisite for successful skill acquisition. - No Sensitive Hyperparameters. In the real world, we want to avoid
parameter tuning for the obvious reason. Maximum entropy RL provides a robust
framework that minimizes the need for hyperparameter tuning. - Off-Policy Learning. An algorithm is off-policy if we can reuse data collected
for another task. In a typical scenario, we need to adjust parameters and
shape the reward function when prototyping a new task, and use of an
off-policy algorithm allows reusing the already collected data.
Soft actor-critic (SAC), described below, is an off-policy model-free deep RL
algorithm that is well aligned with these requirements. In particular, we show
that it is sample efficient enough to solve real-world robot tasks in only a
handful of hours, robust to hyperparameters and works on a variety of simulated
environments with a single set of hyperparameters.
In addition to the desired algorithmic properties, experimentation in the
real-world sets additional requirements for the implementation. Our release
supports many of these features that we have found crucial when learning with
real robots, perhaps the most importantly:
- Asynchronous Sampling. Inference needs to be fast to minimize delay in the
control loop, and we typically want to keep training during the environment
resets too. Therefore, data sampling and training should run in independent
threads or processes. - Stop / Resume Training. When working with real hardware, whatever can go
wrong, will go wrong. We should expect constant interruptions in the data
stream. - Action smoothing. Typical Gaussian exploration makes the actuators jitter
at high frequency, potentially damaging the hardware. Thus temporally
correlating the exploration is important.
Soft Actor-Critic
Soft actor-critic is based on the maximum entropy reinforcement learning
framework, which considers the entropy augmented objective
J(pi) = mathbbE_pi left[ sum_t r(mathbfs_t, mathbfa_t)
– alpha log (pi(mathbfa_t right],
where $mathbfs_t$ and $mathbfa_t$ are the state and the action, and the
expectation is taken over the policy and the true dynamics of the system. In
other words, the optimal policy not only maximizes the expected return (first
summand) but also the expected entropy of itself (second summand). The trade-off
between the two is controlled by the non-negative temperature parameter
$alpha$, and we can always recover the conventional, maximum expected return
objective by setting $alpha=0$. In a technical report, we show that we can
view this objective as an entropy constrained maximization of the expected
return, and learn the temperature parameter automatically instead of treating
it as a hyperparameter.
This objective can be interpreted in several ways. We can view the entropy term
as an uninformative (uniform) prior over the policy, but we can also view it as
a regularizer or as an attempt to trade off between exploration (maximize
entropy) and exploitation (maximize return). In our previous post, we gave
a broader overview and proposed applications that are unique to maximum entropy
RL, and a probabilistic view of the objective is discussed in a recent
tutorial. Soft actor-critic maximizes this objective by parameterizing a
Gaussian policy and a Q-function with a neural network, and optimizing them
using approximate dynamic programming. We defer further details of soft
actor-critic to the technical report. In this post, we will view the objective as
a grounded way to derive better reinforcement learning algorithms that perform
consistently and are sample efficient enough to be applicable to real-world
robotic applications, and—perhaps surprisingly—can yield state-of-the-art
performance under the conventional, maximum expected return objective (without
entropy regularization) in simulated benchmarks.
Simulated Benchmarks
Before we jump into real-world experiments, we compare SAC on standard benchmark
tasks to other popular deep RL algorithms, deep deterministic policy gradient
(DDPG), twin delayed deep deterministic policy gradient (TD3), and proximal
policy optimization (PPO). The figures below compare the algorithms on three
challenging locomotion tasks, HalfCheetah, Ant, and Humanoid, from OpenAI Gym.
The solid lines depict the total average return and the shadings correspond to
the best and the worst trial over five random seeds. Indeed, soft actor-critic,
which is shown in blue, achieves the best performance, and—what’s even more
important for real-world applications—it performs well also in the worst case.
We have included more benchmark results in the technical report.
Deep RL in the Real World
We tested soft actor-critic in the real world by solving three tasks from
scratch without relying on simulation or demonstrations.
Our first real-world task involves the Minitaur robot, a small-scale quadruped
with eight direct-drive actuators. The action space consists of the swing angle
and the extension of each leg, which are then mapped to desired motor positions
and tracked with a PD controller. The observations include the motor angles as
well as roll and pitch angles and angular velocities of the base. This learning
task presents substantial challenges for real-world reinforcement learning. The
robot is underactuated, and must therefore delicately balance contact forces on
the legs to make forward progress. An untrained policy can lose balance and
fall, and too many falls will eventually damage the robot, making
sample-efficient learning essentially. The video below illustrates the learned
skill. Although we trained our policy only on flat terrain, we then tested it on
varied terrains and obstacles. Because soft actor-critic learns robust policies,
due to entropy maximization at training time, the policy can readily generalize
to these perturbations without any additional learning.
The Minitaur robot (Google, Tuomas Haarnoja, Sehoon Ha, Jie Tan, and
Sergey Levine).
Our second real-world robotic task involves training a 3-finger dexterous
robotic hand to manipulate an object. The hand is based on the Dynamixel Claw
hand, discussed in another post. This hand has 9 DoFs, each controlled by a
Dynamixel servo-motor. The policy controls the hand by sending target joint
angle positions for the on-board PID controller. The manipulation task requires
the hand to rotate a “valve’‘-like object as shown in the animation below. In
order to perceive the valve, the robot must use raw RGB images shown in the
inset at the bottom right. The robot must rotate the valve so that the colored
peg faces the right (see video below). The initial position of the valve is reset
uniformly at random for each episode, forcing the policy to learn to use the raw
RGB images to perceive the current valve orientation. A small motor is attached
to the valve to automate resets and to provide the ground truth position for the
determination of the reward function. The position of this motor is not provided
to the policy. This task is exceptionally challenging due to both the perception
challenges and the need to control a hand with 9 degrees of freedom.
Rotating a valve with a dexterous hand, learned directly from raw pixels
(UC Berkeley, Kristian Hartikainen, Vikash Kumar, Henry Zhu, Abhishek Gupta,
Tuomas Haarnoja, and Sergey Levine).
In the final task, we trained a 7-DoF Sawyer robot to stack Lego blocks. The
policy receives the joint positions and velocities, as well as end-effector
force as an input and outputs torque commands to each of the seven joints. The
biggest challenge is to accurately align the studs before exerting a
downward force to overcome the friction between them.
Stacking Legos with Sawyer (UC Berkeley, Aurick Zhou, Tuomas Haarnoja, and
Sergey Levine).
Soft actor-critic solves all of these tasks quickly: the Minitaur
locomotion and the block-stacking tasks both take 2 hours, and the valve-turning
task from image observations takes 20 hours. We also learned a policy for the
valve-turning task without images by providing the actual valve position as an
observation to the policy. Soft actor-critic can learn this easier version of
the valve task in 3 hours. For comparison, prior work has used PPO to learn
the same task without images in 7.4 hours.
Conclusion
Soft actor-critic is a step towards feasible deep RL with real-world robots.
Work still needs to be done to scale up these methods to more challenging tasks,
but we believe we are getting closer to the critical point where deep RL can
become a practical solution for robotic tasks. Meanwhile, you can connect your
robot to our toolbox and get learning started!
Acknowledgements
We would like to thank the amazing teams at Google and UC
Berkeley—specifically Pieter Abbeel, Abhishek Gupta, Sehoon Ha, Vikash Kumar,
Sergey Levine, Jie Tan, George Tucker, Vincent Vanhoucke, Henry Zhu—who
contributed to the development of the algorithm, spent long days running
experiments, and provided the support and resources that made the project
possible.
Links:
- Project website
- Technical description of SAC
- softlearning (our robot learning toolbox, including a SAC implementation in Tensorflow)
- rlkit (another SAC implementation from UC Berkeley in PyTorch)