Benchmarks

This benchmark compared compressed buffers against the Stable-Baselines3 (pre-2.7.1) baseline on MsPacmanNoFrameskip-v4.

Benchmark setup

Frame stack: 4.
Vectorized environments: 4.
Buffer size: 40,000 transitions, split across vectorized environments.
Hardware: M4 MacBook Air.
Saving test: the same observations are added to each buffer for fairness, uses replay buffer. Check example_eval_memory_saving.py.
Loading test: rollout-buffer trajectories are sampled with batch size 64, uses rollout buffer. Check example_eval_memory_loading.py.
Baseline: SB3 ReplayBuffer or RolloutBuffer without compression.

Important caveats:

SB3’s RolloutBuffer stores observations as np.float32, which is 4x larger than np.uint8 observations. (Fixed in SB3 2.7.1)
igzip does not benefit from Intel SIMD acceleration on Apple Silicon.
Transfer latency between CPU and GPU may differ on other systems.

Summary

The main takeaway is that zstd gives a strong balance between memory saving and latency. zstd-1 through zstd-5 are good first choices, and we recommend zstd-3 as a practical default. gzip0 should usually be avoided because it keeps similar latency to zstd-5 while using much more memory.

Run-length encoding is more data-dependent. On the MsPacman benchmark, the 84x84 observations are noisy enough that rle saves much less memory than zstd, although decompression remains usable.

Results

Compression	Save Mem	Save Mem %	Save Latency	Load Mem	Load Mem %	Load Latency
baseline	1.05GB	100.0%	0.9	4.21GB	100.0%	5.21
none	1.05GB	100.1%	1.2	1.05GB	25.0%	8.70
zstd-100	387MB	36.0%	1.8	413MB	9.6%	9.08
zstd-50	306MB	28.4%	1.9	326MB	7.6%	8.95
zstd-5	82.9MB	7.7%	2.1	89.1MB	2.1%	8.80
lz4-frame/1	118MB	10.9%	2.1	127MB	2.9%	8.86
zstd-20	181MB	16.8%	2.2	189MB	4.4%	8.91
zstd-3	73.9MB	6.9%	2.3	78.7MB	1.8%	8.81
zstd-1	66.0MB	6.1%	2.3	70.0MB	1.6%	8.79
zstd1	61.3MB	5.7%	2.7	64.7MB	1.5%	8.90
zstd3	59.4MB	5.5%	3.0	63.1MB	1.5%	8.91
igzip0	129MB	12.0%	3.4	136MB	3.1%	9.60
rle	811MB	75.3%	4.0	849MB	19.7%	14.7
rle-jit	811MB	75.3%	4.0	849MB	19.7%	9.10
rle-old	811MB	75.3%	4.0	849MB	19.7%	104
lz4-block/1	83.2MB	7.7%	4.6	89.8MB	2.1%	8.73
igzip1	114MB	10.6%	5.0	121MB	2.8%	9.66
zstd5	55.9MB	5.2%	5.4	59.3MB	1.4%	8.90
lz4-block/5	75.1MB	7.0%	6.3	80.1MB	1.9%	8.76
lz4-frame/5	75.9MB	7.0%	6.5	80.8MB	1.9%	8.72
gzip1	104MB	9.6%	7.6	108MB	2.5%	9.75
gzip3	81.9MB	7.6%	8.3	85.9MB	2.0%	9.44
igzip3	81.5MB	7.6%	10.5	87.0MB	2.0%	9.59
zstd10	52.8MB	4.9%	10.8	56.5MB	1.3%	8.89
lz4-block/9	72.0MB	6.7%	20.0	76.9MB	1.8%	8.69
lz4-frame/9	72.7MB	6.8%	20.0	77.6MB	1.8%	8.74
lz4-block/16	71.3MB	6.6%	57.9	76.2MB	1.8%	8.69
lz4-frame/12	72.0MB	6.7%	58.4	77.0MB	1.8%	8.77
zstd15	48.5MB	4.5%	99.8	52.0MB	1.2%	8.86
zstd22	47.6MB	4.4%	590.7	51.0MB	1.2%	8.96

Training validation

The table above measures buffer memory and sampling latency. To confirm that compressed buffers work end-to-end in SB3 training, see Validation for evaluation results from the example PPO and DQN scripts (10M steps on Atari with rle-jit). For wall-clock training time with zstd-3 versus the default SB3 buffers, see Training speed.