Benchmarks
==========

This benchmark compared compressed buffers against the Stable-Baselines3
(pre-2.7.1) baseline on ``MsPacmanNoFrameskip-v4``.

Benchmark setup
---------------

- Frame stack: ``4``.
- Vectorized environments: ``4``.
- Buffer size: ``40,000`` transitions, split across vectorized environments.
- Hardware: M4 MacBook Air.
- Saving test: the same observations are added to each buffer for fairness, uses replay buffer. Check :download:`example_eval_memory_saving.py <../examples/example_eval_memory_saving.py>`.
- Loading test: rollout-buffer trajectories are sampled with batch size ``64``, uses rollout buffer. Check :download:`example_eval_memory_loading.py <../examples/example_eval_memory_loading.py>`.
- Baseline: SB3 ``ReplayBuffer`` or ``RolloutBuffer`` without compression.

Important caveats:

- SB3's ``RolloutBuffer`` stores observations as ``np.float32``, which is 4x
  larger than ``np.uint8`` observations. (Fixed in SB3 2.7.1)
- ``igzip`` does not benefit from Intel SIMD acceleration on Apple Silicon.
- Transfer latency between CPU and GPU may differ on other systems.

Summary
-------

The main takeaway is that ``zstd`` gives a strong balance between memory saving
and latency. ``zstd-1`` through ``zstd-5`` are good first choices, and we
recommend ``zstd-3`` as a practical default. ``gzip0`` should usually be
avoided because it keeps similar latency to ``zstd-5`` while using much more
memory.

Run-length encoding is more data-dependent. On the MsPacman benchmark, the
``84x84`` observations are noisy enough that ``rle`` saves much less memory than
``zstd``, although decompression remains usable.

Results
-------

.. list-table::
   :header-rows: 1

   * - Compression
     - Save Mem
     - Save Mem %
     - Save Latency
     - Load Mem
     - Load Mem %
     - Load Latency
   * - baseline
     - 1.05GB
     - 100.0%
     - 0.9
     - 4.21GB
     - 100.0%
     - 5.21
   * - none
     - 1.05GB
     - 100.1%
     - 1.2
     - 1.05GB
     - 25.0%
     - 8.70
   * - zstd-100
     - 387MB
     - 36.0%
     - 1.8
     - 413MB
     - 9.6%
     - 9.08
   * - zstd-50
     - 306MB
     - 28.4%
     - 1.9
     - 326MB
     - 7.6%
     - 8.95
   * - zstd-5
     - 82.9MB
     - 7.7%
     - 2.1
     - 89.1MB
     - 2.1%
     - 8.80
   * - lz4-frame/1
     - 118MB
     - 10.9%
     - 2.1
     - 127MB
     - 2.9%
     - 8.86
   * - zstd-20
     - 181MB
     - 16.8%
     - 2.2
     - 189MB
     - 4.4%
     - 8.91
   * - zstd-3
     - 73.9MB
     - 6.9%
     - 2.3
     - 78.7MB
     - 1.8%
     - 8.81
   * - zstd-1
     - 66.0MB
     - 6.1%
     - 2.3
     - 70.0MB
     - 1.6%
     - 8.79
   * - zstd1
     - 61.3MB
     - 5.7%
     - 2.7
     - 64.7MB
     - 1.5%
     - 8.90
   * - zstd3
     - 59.4MB
     - 5.5%
     - 3.0
     - 63.1MB
     - 1.5%
     - 8.91
   * - igzip0
     - 129MB
     - 12.0%
     - 3.4
     - 136MB
     - 3.1%
     - 9.60
   * - rle
     - 811MB
     - 75.3%
     - 4.0
     - 849MB
     - 19.7%
     - 14.7
   * - rle-jit
     - 811MB
     - 75.3%
     - 4.0
     - 849MB
     - 19.7%
     - 9.10
   * - rle-old
     - 811MB
     - 75.3%
     - 4.0
     - 849MB
     - 19.7%
     - 104
   * - lz4-block/1
     - 83.2MB
     - 7.7%
     - 4.6
     - 89.8MB
     - 2.1%
     - 8.73
   * - igzip1
     - 114MB
     - 10.6%
     - 5.0
     - 121MB
     - 2.8%
     - 9.66
   * - zstd5
     - 55.9MB
     - 5.2%
     - 5.4
     - 59.3MB
     - 1.4%
     - 8.90
   * - lz4-block/5
     - 75.1MB
     - 7.0%
     - 6.3
     - 80.1MB
     - 1.9%
     - 8.76
   * - lz4-frame/5
     - 75.9MB
     - 7.0%
     - 6.5
     - 80.8MB
     - 1.9%
     - 8.72
   * - gzip1
     - 104MB
     - 9.6%
     - 7.6
     - 108MB
     - 2.5%
     - 9.75
   * - gzip3
     - 81.9MB
     - 7.6%
     - 8.3
     - 85.9MB
     - 2.0%
     - 9.44
   * - igzip3
     - 81.5MB
     - 7.6%
     - 10.5
     - 87.0MB
     - 2.0%
     - 9.59
   * - zstd10
     - 52.8MB
     - 4.9%
     - 10.8
     - 56.5MB
     - 1.3%
     - 8.89
   * - lz4-block/9
     - 72.0MB
     - 6.7%
     - 20.0
     - 76.9MB
     - 1.8%
     - 8.69
   * - lz4-frame/9
     - 72.7MB
     - 6.8%
     - 20.0
     - 77.6MB
     - 1.8%
     - 8.74
   * - lz4-block/16
     - 71.3MB
     - 6.6%
     - 57.9
     - 76.2MB
     - 1.8%
     - 8.69
   * - lz4-frame/12
     - 72.0MB
     - 6.7%
     - 58.4
     - 77.0MB
     - 1.8%
     - 8.77
   * - zstd15
     - 48.5MB
     - 4.5%
     - 99.8
     - 52.0MB
     - 1.2%
     - 8.86
   * - zstd22
     - 47.6MB
     - 4.4%
     - 590.7
     - 51.0MB
     - 1.2%
     - 8.96

Training validation
-------------------

The table above measures buffer memory and sampling latency. To confirm that
compressed buffers work end-to-end in SB3 training, see
:doc:`validation` for evaluation results from the example PPO and DQN scripts
(10M steps on Atari with ``rle-jit``). For wall-clock training time with
``zstd-3`` versus the default SB3 buffers, see :doc:`speed`.