Training speed

This page reports end-to-end training wall-clock time when using the default Stable-Baselines3 buffers versus the compressed buffer classes from this package. The runs mirror the example scripts example_train_rollout.py and example_train_replay.py, with only the buffer class and compression method changed for the comparison.

Benchmark setup

Hardware

  • Mac mini (Apple M4, 16 GB RAM)

  • device="mps"

Libraries

  • Stable-Baselines3 2.8.0

  • PyTorch 2.12.0

  • sb3-extra-buffers 0.5.1

Each run trained for 10M environment steps. Compressed buffers used zstd-3 with dtypes from find_buffer_dtypes().

PPO on PongNoFrameskip-v4

Hyperparameters follow the preset on Huggingface: sb3/ppo-PongNoFrameskip-v4 and check example_train_rollout.py for code:

  • Frame stack: 1 (no stacking)

  • n_envs: 8 (train and eval)

  • n_steps: 128

  • batch_size: 256

  • n_epochs: 4

  • learning_rate: linear schedule from 2.5e-4 to 0

  • clip_range: linear schedule from 0.1 to 0

  • ent_coef: 0.01

  • vf_coef: 0.5

  • gae_lambda: 0.9

  • gamma: 0.99

  • max_grad_norm: 0.5

  • Policy: CnnPolicy with normalize_images=False

Buffer

Wall-clock time

SB3 RolloutBuffer

2:56:16

CompressedRolloutBuffer (zstd-3)

4:48:41

On this setup, zstd-3 rollout compression added roughly 65% training time over the default buffer while keeping the same SB3 training loop.

DQN on MsPacmanNoFrameskip-v4

Hyperparameters follow the preset on Huggingface: sb3/dqn-MsPacmanNoFrameskip-v4 and check example_train_replay.py for code:

  • Frame stack: 4

  • n_envs: 1 (train), 8 (eval)

  • buffer_size: 100_000

  • batch_size: 32

  • learning_starts: 100_000

  • train_freq: 4

  • gradient_steps: 1

  • target_update_interval: 1000

  • learning_rate: 1e-4

  • exploration_fraction: 0.1

  • exploration_final_eps: 0.01

  • Policy: CnnPolicy

Buffer

Wall-clock time

SB3 ReplayBuffer

12:33:26

CompressedReplayBuffer (zstd-3)

12:44:16

For DQN replay compression, zstd-3 added only about 11 minutes (~1.4%) on top of a 12.5-hour run. Off-policy algorithms spend more time in gradient updates than in buffer I/O, so compression overhead is smaller than for PPO.

See also

  • Benchmarks for per-transition buffer memory and sampling latency

  • Validation for reward results from the example training scripts