Proximal Policy Gradient (PPO)

Overview

PPO is one of the most popular DRL algorithms. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.

Original paper:

Proximal Policy Optimization Algorithms

Reference resources:

All our PPO implementations below are augmented with the same code-level optimizations presented in openai/baselines's PPO. To achieve this, see how we matched the implementation details in our blog post The 37 Implementation Details of Proximal Policy Optimization.

Implemented Variants

Variants Implemented	Description
`ppo.py`, docs	For classic control tasks like `CartPole-v1`.
`ppo_atari.py`, docs	For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
`ppo_continuous_action.py`, docs	For continuous action space. Also implemented Mujoco-specific code-level optimizations
`ppo_atari_lstm.py`, docs	For Atari games using LSTM without stacked frames.
`ppo_atari_envpool.py`, docs	Uses the blazing fast Envpool Atari vectorized environment.
`ppo_procgen.py`, docs	For the procgen environments

Below are our single-file implementations of PPO:

`ppo.py`

The ppo.py has the following features:

Works with the Box observation space of low-level features
Works with the Discrete action space
Works with envs like CartPole-v1

Usage

poetry install
python cleanrl/ppo.py --help
python cleanrl/ppo.py --env-id CartPole-v1

Explanation of the logged metrics

Running python cleanrl/ppo.py will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:

charts/episodic_return: episodic return of the game
charts/episodic_length: episodic length of the game
charts/SPS: number of steps per second
charts/learning_rate: the current learning rate
losses/value_loss: the mean value loss across all data points
losses/policy_loss: the mean policy loss across all data points
losses/entropy: the mean entropy value across all data points
losses/old_approx_kl: the approximate Kullback–Leibler divergence, measured by (-logratio).mean(), which corresponds to the k1 estimator in John Schulman’s blog post on approximating KL
losses/approx_kl: better alternative to olad_approx_kl measured by (logratio.exp() - 1) - logratio, which corresponds to the k3 estimator in approximating KL
losses/clipfrac: the fraction of the training data that triggered the clipped objective
losses/explained_variance: the explained variance for the value function

Implementation details

ppo.py is based on the "13 core implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

Vectorized architecture ( common/cmd_util.py#L22)
Orthogonal Initialization of Weights and Constant Initialization of biases ( a2c/utils.py#L58))
The Adam Optimizer's Epsilon Parameter ( ppo2/model.py#L100)
Adam Learning Rate Annealing ( ppo2/ppo2.py#L133-L135)
Generalized Advantage Estimation ( ppo2/runner.py#L56-L65)
Mini-batch Updates ( ppo2/ppo2.py#L157-L166)
Normalization of Advantages ( ppo2/model.py#L139)
Clipped surrogate objective ( ppo2/model.py#L81-L86)
Value Function Loss Clipping ( ppo2/model.py#L68-L75)
Overall Loss and Entropy Bonus ( ppo2/model.py#L91)
Global Gradient Clipping ( ppo2/model.py#L102-L108)
Debug variables ( ppo2/model.py#L115-L116)
Separate MLP networks for policy and value functions ( common/policies.py#L156-L160, baselines/common/models.py#L75-L103)

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

Below are the average episodic returns for ppo.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
CartPole-v1	492.40 ± 13.05	497.54 ± 4.02
Acrobot-v1	-89.93 ± 6.34	-81.82 ± 5.58
MountainCar-v0	-200.00 ± 0.00	-200.00 ± 0.00

Learning curves:

Tracked experiments and game play videos:

Video tutorial

If you'd like to learn ppo.py in-depth, consider checking out the following video tutorial:

`ppo_atari.py`

The ppo_atari.py has the following features:

For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/ppo_atari.py --help
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_atari.py is based on the "9 Atari implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

The Use of NoopResetEnv ( common/atari_wrappers.py#L12)
The Use of MaxAndSkipEnv ( common/atari_wrappers.py#L97)
The Use of EpisodicLifeEnv ( common/atari_wrappers.py#L61)
The Use of FireResetEnv ( common/atari_wrappers.py#L41)
The Use of WarpFrame (Image transformation) common/atari_wrappers.py#L134
The Use of ClipRewardEnv ( common/atari_wrappers.py#L125)
The Use of FrameStack ( common/atari_wrappers.py#L188)
Shared Nature-CNN network for the policy and value functions ( common/policies.py#L157, common/models.py#L15-L26)
Scaling the Images to Range [0, 1] ( common/models.py#L19)

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

Below are the average episodic returns for ppo_atari.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo_atari.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
BreakoutNoFrameskip-v4	416.31 ± 43.92	406.57 ± 31.554
PongNoFrameskip-v4	20.59 ± 0.35	20.512 ± 0.50
BeamRiderNoFrameskip-v4	2445.38 ± 528.91	2642.97 ± 670.37

Learning curves:

Tracked experiments and game play videos:

Video tutorial

If you'd like to learn ppo_atari.py in-depth, consider checking out the following video tutorial:

`ppo_continuous_action.py`

The ppo_continuous_action.py has the following features:

For continuous action space. Also implemented Mujoco-specific code-level optimizations
Works with the Box observation space of low-level features
Works with the Box (continuous) action space

Usage

poetry install -E atari
python cleanrl/ppo_continuous_action.py --help
python cleanrl/ppo_continuous_action.py --env-id Hopper-v2

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_continuous_action.py is based on the "9 details for continuous action domains (e.g. Mujoco)" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

Continuous actions via normal distributions ( common/distributions.py#L103-L104)
State-independent log standard deviation ( common/distributions.py#L104)
Independent action components ( common/distributions.py#L238-L246)
Separate MLP networks for policy and value functions ( common/policies.py#L160, baselines/common/models.py#L75-L103
Handling of action clipping to valid range and storage ( common/cmd_util.py#L99-L100)
Normalization of Observation ( common/vec_env/vec_normalize.py#L4)
Observation Clipping ( common/vec_env/vec_normalize.py#L39)
Reward Scaling ( common/vec_env/vec_normalize.py#L28)
Reward Clipping ( common/vec_env/vec_normalize.py#L32)

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

Below are the average episodic returns for ppo_continuous_action.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo_continuous_action.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
Hopper-v2	2231.12 ± 656.72	2518.95 ± 850.46
Walker2d-v2	3050.09 ± 1136.21	3208.08 ± 1264.37
HalfCheetah-v2	1822.82 ± 928.11	2152.26 ± 1159.84

Learning curves:

Tracked experiments and game play videos:

Video tutorial

If you'd like to learn ppo_continuous_action.py in-depth, consider checking out the following video tutorial:

`ppo_atari_lstm.py`

The ppo_atari_lstm.py has the following features:

For Atari games using LSTM without stacked frames. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/ppo_atari_lstm.py --help
python cleanrl/ppo_atari_lstm.py --env-id BreakoutNoFrameskip-v4

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_atari_lstm.py is based on the "5 LSTM implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

Layer initialization for LSTM layers ( a2c/utils.py#L84-L86)
Initialize the LSTM states to be zeros ( common/models.py#L179)
Reset LSTM states at the end of the episode ( common/models.py#L141)
Prepare sequential rollouts in mini-batches ( a2c/utils.py#L81)
Reconstruct LSTM states during training ( a2c/utils.py#L81)

To help test out the memory, we remove the 4 stacked frames from the observation (i.e., using env = gym.wrappers.FrameStack(env, 1) instead of env = gym.wrappers.FrameStack(env, 4) like in ppo_atari.py )

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

Below are the average episodic returns for ppo_atari_lstm.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo_atari_lstm.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
BreakoutNoFrameskip-v4	128.92 ± 31.10	138.98 ± 50.76
PongNoFrameskip-v4	19.78 ± 1.58	19.79 ± 0.67
BeamRiderNoFrameskip-v4	1536.20 ± 612.21	1591.68 ± 372.95

Learning curves:

Tracked experiments and game play videos:

`ppo_atari_envpool.py`

The ppo_atari_envpool.py has the following features:

Uses the blazing fast Envpool vectorized environment.
For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/ppo_atari_envpool.py --help
python cleanrl/ppo_atari_envpool.py --env-id Breakout-v5

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_atari_envpool.py uses a customized RecordEpisodeStatistics to work with envpool but has the same other implementation details as ppo_atari.py (see related docs).

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

Below are the average episodic returns for ppo_atari_envpool.py. Notice it has the same sample efficiency as ppo_atari.py, but runs about 3x faster.

Environment	`ppo_atari_envpool.py` (~80 mins)	`ppo_atari.py` (~220 mins)
BreakoutNoFrameskip-v4	389.57 ± 29.62	416.31 ± 43.92
PongNoFrameskip-v4	20.55 ± 0.37	20.59 ± 0.35
BeamRiderNoFrameskip-v4	2039.83 ± 1146.62	2445.38 ± 528.91

Learning curves:

Tracked experiments and game play videos:

`ppo_procgen.py`

The ppo_procgen.py has the following features:

For the procgen environments
Uses IMPALA-style neural network
Works with the Discrete action space

Usage

poetry install -E atari
python cleanrl/ppo_procgen.py --help
python cleanrl/ppo_procgen.py --env-id BreakoutNoFrameskip-v4

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_procgen.py is based on the details in "Appendix" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:

IMPALA-style Neural Network ( common/models.py#L28)

Experiment results

To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:

We try to match the default setting in openai/train-procgen except that we use the easy distribution mode and total_timesteps=25e6 to save compute. Notice openai/train-procgen has the following settings:

Learning rate annealing is turned off by default
Reward scaling and reward clipping is used

Below are the average episodic returns for ppo_procgen.py. To ensure the quality of the implementation, we compared the results against openai/baselies' PPO.

Environment	`ppo_procgen.py`	`openai/baselies`' PPO (Huang et al., 2022)¹
StarPilot	31.40 ± 11.73	33.97 ± 7.86
BossFight	9.09 ± 2.35	9.35 ± 2.04
BigFish	21.44 ± 6.73	20.06 ± 5.34

Learning curves:

Tracked experiments and game play videos:

Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR 2022 Blog Track https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ ↩↩↩↩↩