GPU Buffers
===========

.. warning::

   **Advanced / experimental.** :mod:`sb3_extra_buffers.gpu_buffers` targets users who
   want observation tensors to live on a Torch device (CPU, CUDA, or MPS) with optional
   compression. The API is less stable than :mod:`sb3_extra_buffers.compressed`, behaviour
   varies by device and codec, and not every CPU compression backend has a GPU-native
   implementation yet.

.. caution::

   **You are responsible for heap sizing.** Compressed observations are stored in a single
   packed byte heap (:class:`~sb3_extra_buffers.gpu_buffers.raw_buffer.RawBuffer`) with
   per-cell ``start_idx`` / ``lengths``. If the heap runs out of space, compression raises
   an error. By default, buffers set heap capacity via
   :func:`~sb3_extra_buffers.gpu_buffers.utils.estimate_total_heap_bytes` (per-cell
   heuristic × number of cells). For large or high-entropy observations, set
   ``heap_bytes`` explicitly on :class:`~sb3_extra_buffers.gpu_buffers.gpu_replay.GpuReplayBuffer`
   or :class:`~sb3_extra_buffers.gpu_buffers.gpu_rollout.GpuRolloutBuffer` and validate with
   your observation distribution. The deprecated ``max_slot_bytes`` argument is interpreted
   as per-cell capacity and multiplied by the cell count.

Overview
--------

GPU buffers mirror the CPU compressed replay/rollout API but keep flattened observations on
``buffer_device`` end-to-end (add → store → sample/get). Non-observation fields (actions,
rewards, dones, etc.) remain NumPy arrays like the CPU buffers.

**Public entry points**

- :class:`~sb3_extra_buffers.gpu_buffers.gpu_replay.GpuReplayBuffer` — off-policy (DQN, SAC, …)
- :class:`~sb3_extra_buffers.gpu_buffers.gpu_rollout.GpuRolloutBuffer` — on-policy (PPO, …)
- :func:`~sb3_extra_buffers.gpu_buffers.base.find_gpu_buffer_dtypes` — Torch ``elem_type`` / ``runs_type`` helpers
- :data:`~sb3_extra_buffers.gpu_buffers.compression_methods.COMPRESSION_METHOD_MAP` — registered codecs
- :func:`~sb3_extra_buffers.gpu_buffers.compression_methods.has_zstd` — optional Zstd backend probe

**Compression methods**

+------------------+--------------------+--------------------------------------------------+
| Method           | Storage            | Notes                                            |
+==================+====================+==================================================+
| ``none``         | Dense tensor       | Full-size flat obs on ``buffer_device``           |
+------------------+--------------------+--------------------------------------------------+
| ``rle``          | Packed raw heap    | Run-length encoding in device byte storage       |
+------------------+--------------------+--------------------------------------------------+
| ``zstd``         | Packed raw heap    | CPU Zstd round-trip into heap bytes; needs       |
|                  |                    | ``pip install "sb3-extra-buffers[zstd]"``        |
+------------------+--------------------+--------------------------------------------------+
| ``zstd3``, etc.  | Packed raw heap    | Shorthand for ``compresslevel`` (same as CPU)    |
+------------------+--------------------+--------------------------------------------------+

Example scripts (Pong, ``PongNoFrameskip-v4``):

- ``examples/example_train_gpu_replay.py`` / ``examples/example_watch_gpu_replay.py``
- ``examples/example_train_gpu_rollout.py`` / ``examples/example_watch_gpu_rollout.py``

.. automodule:: sb3_extra_buffers.gpu_buffers
   :members:
   :undoc-members:

Metadata
--------

.. automodule:: sb3_extra_buffers.gpu_buffers.metadata
   :members:

Raw storage
-----------

.. automodule:: sb3_extra_buffers.gpu_buffers.raw_buffer
   :members:

Observation stores
------------------

.. automodule:: sb3_extra_buffers.gpu_buffers.observation_store
   :members:

Base helpers
------------

.. automodule:: sb3_extra_buffers.gpu_buffers.base
   :members:

Replay buffer
-------------

.. automodule:: sb3_extra_buffers.gpu_buffers.gpu_replay
   :members:

Rollout buffer
--------------

.. automodule:: sb3_extra_buffers.gpu_buffers.gpu_rollout
   :members:

Utilities
---------

.. automodule:: sb3_extra_buffers.gpu_buffers.utils
   :members:

Compression methods
-------------------

.. automodule:: sb3_extra_buffers.gpu_buffers.compression_methods
   :members:

Usage
-----

Replay buffer with RLE on CUDA:

.. code-block:: python

   import torch as th
   from stable_baselines3 import DQN
   from sb3_extra_buffers.gpu_buffers import GpuReplayBuffer, find_gpu_buffer_dtypes
   from sb3_extra_buffers.gpu_buffers.utils import numpy_dtype_to_torch

   device = "cuda" if th.cuda.is_available() else "cpu"
   dtypes = find_gpu_buffer_dtypes(
       env.observation_space.shape,
       elem_dtype=numpy_dtype_to_torch(env.observation_space.dtype),
       compression_method="rle",
   )
   model = DQN(
       "CnnPolicy",
       env,
       replay_buffer_class=GpuReplayBuffer,
       replay_buffer_kwargs=dict(
           dtypes=dtypes,
           compression_method="rle",
           buffer_device=device,
           # heap_bytes=...,  # override if estimate_total_heap_bytes is too small
       ),
       device=device,
   )

Rollout buffer with optional Zstd (when installed):

.. code-block:: python

   from sb3_extra_buffers.gpu_buffers import GpuRolloutBuffer, has_zstd

   compression_method = "zstd" if has_zstd() else "rle"
   dtypes = find_gpu_buffer_dtypes(obs_shape, compression_method=compression_method)
   model = PPO(
       "CnnPolicy",
       env,
       rollout_buffer_class=GpuRolloutBuffer,
       rollout_buffer_kwargs=dict(
           dtypes=dtypes,
           compression_method=compression_method,
           buffer_device=device,
       ),
       device=device,
   )

Heap layout and compaction
--------------------------

Compressed observations share one :class:`~sb3_extra_buffers.gpu_buffers.raw_buffer.RawBuffer`
backed by :class:`~sb3_extra_buffers.gpu_buffers.raw_buffer.SharedRawHeap`. Each cell has a
``start_idx`` and ``lengths`` entry; codec metadata (``pos_runs``, ``pos_elem``, ``run_length``)
is stored relative to that cell's byte offset.

**Compaction** packs live payloads and resets ``data_end`` to roughly ``sum(lengths)``:

- :class:`~sb3_extra_buffers.gpu_buffers.gpu_replay.GpuReplayBuffer` compacts when the write
  cursor wraps after the buffer is full.
- :class:`~sb3_extra_buffers.gpu_buffers.gpu_rollout.GpuRolloutBuffer` compacts when the
  rollout buffer becomes full (before ``get()``).

Default heap capacity is ``n_cells * estimate_max_slot_bytes(...)`` where ``n_cells`` depends
on buffer geometry (replay vs rollout, ``n_envs``, shared next-obs storage). Per-cell size uses
MsPacman **Save Mem %** ratios from the README benchmark, fitted per codec family and level
(see ``scripts/fit_heap_heuristic.py``), scaled by ``flat_len * elem_size * overalloc_factor``
(default **1.5**). All documented levels are accepted; unmeasured levels use polynomial
extrapolation clamped to each family's benchmark min/max ratio.

Supported level ranges (invalid levels raise ``ValueError``):

- ``gzip``: 0–9
- ``igzip``: 0–3
- ``zstd``: 1–22 or -100–-1 (bare ``zstd`` uses level -3)
- ``lz4-frame``: negatives and 0–16
- ``lz4-block``: negatives, 0, and 1–12

Re-fit after updating the benchmark table::

   uv pip install scipy
   uv run scripts/fit_heap_heuristic.py

Copy the printed constants into :mod:`sb3_extra_buffers.gpu_buffers.size_estimation`.

After compaction, ``data_end`` reflects actual compressed usage; peak allocation remains ``heap_bytes``.

If compression fails with a heap capacity error, increase ``heap_bytes``. When in doubt,
log compressed sizes from your environment and add headroom.