Validation

The repository includes example scripts for training and evaluating SB3 models with compressed buffers. They are intended to verify that the buffer classes can be used with minimal change to normal SB3 training code. Browse the examples in the examples directory.

Training setup

The runs below used the same Atari environments and hyperparameters as the presets on Huggingface: sb3/ppo-PongNoFrameskip-v4, sb3/ppo-MsPacmanNoFrameskip-v4, and sb3/dqn-MsPacmanNoFrameskip-v4.

Hardware: M4 Macbook Air
Software: Stable-Baselines3 2.7.0, sb3-extra-buffers 0.4.3
Device: mps
Training length: 10M environment steps

For end-to-end wall-clock time comparing default SB3 buffers with CompressedRolloutBuffer / CompressedReplayBuffer using zstd-3, see Training speed.

Evaluation results for example training scripts

The example scripts have been run and evaluated to confirm they train correctly. Each run below used rle-jit compression.

PPO on PongNoFrameskip-v4, no frame stack:

(Best ) Evaluated 10000 episodes, mean reward: 21.0 +/- 0.00
Q1:   21 | Q2:   21 | Q3:   21 | Relative IQR: 0.00 | Min: 21 | Max: 21
(Final) Evaluated 10000 episodes, mean reward: 21.0 +/- 0.02
Q1:   21 | Q2:   21 | Q3:   21 | Relative IQR: 0.00 | Min: 20 | Max: 21

PPO on MsPacmanNoFrameskip-v4, with frame stack 4:

(Best ) Evaluated 10000 episodes, mean reward: 2667.0 +/- 290.00
Q1: 2300 | Q2: 2490 | Q3: 3000 | Relative IQR: 0.28 | Min: 2300 | Max: 3000
(Final) Evaluated 10000 episodes, mean reward: 2500.9 +/- 221.03
Q1: 2300 | Q2: 2390 | Q3: 2490 | Relative IQR: 0.08 | Min: 1420 | Max: 3000

DQN on MsPacmanNoFrameskip-v4, with frame stack 4:

(Best ) Evaluated 10000 episodes, mean reward: 3300.0 +/- 770.79
Q1: 2490 | Q2: 4020 | Q3: 4020 | Relative IQR: 0.38 | Min: 2460 | Max: 4020
(Final) Evaluated 10000 episodes, mean reward: 3379.2 +/- 453.78
Q1: 2690 | Q2: 3400 | Q3: 3880 | Relative IQR: 0.35 | Min: 1230 | Max: 4090