PufferLib - Reinforcement Learning at 1 Million Steps / Second.
PufferLib - Reinforcement Learning at 1 Million Steps / Second.
We came across this post, and immediately had Grok 4 write us a detailed dissertation on it. It sounded amazing that consumer hardware could train LLM's which typically can cost hundreds of millions of dollars - for a fraction of the cost. For your reading pleasure:
What is PufferLib LLM?
- Correction this original article referenced 'PufferFish' and this threw out the article which we reviewed and corrected.
- Here is a short preamble of the moving parts in PufferLib LLM.
Environment Binding Creation: The code begins by defining a Binding object using pufferlib.emulation.Binding. This wraps the environment creator (e.g., gym.make('PongNoFrameskip-v4') for Atari) to standardize observation and action spaces, supporting single- or multi-agent setups.
Vectorization: The environment is vectorized with pufferlib.vector.make, creating multiple parallel instances (e.g., num_envs=32) via backends like 'Multiprocessing'. This allows simultaneous data collection across instances, accelerating training by leveraging hardware parallelism (e.g., CPU or GPU).
Policy Definition: A neural network policy is instantiated, such as pufferlib.models.NatureCNN for image-based observations (e.g., Atari) or pufferlib.models.DefaultMLP for vector observations (e.g., NetHack). The policy maps observations to actions, with output dimensions matching the action space.
Trainer Configuration and Initialization: A PPO configuration is set using pufferlib.frameworks.cleanrl.PPOConfig, specifying hyperparameters like learning rate or batch size. The trainer (pufferlib.frameworks.cleanrl.PPOTrainer) is then created, integrating the vectorized environment and policy.
Training Loop: The loop runs for a specified number of timesteps (e.g., 1,000,000). It alternates between:
trainer.evaluate(): Rolls out episodes to collect trajectories (observations, actions, rewards, etc.).trainer.train(): Computes PPO losses (policy, value, entropy) and updates the model via gradient descent.- Periodic logging with
trainer.mean_and_log()to track metrics like rewards or losses.
Finally,trainer.close()releases resources.
These examples work on "datasets" in the RL sense—trajectories generated from interactions with simulated environments—rather than static data. For instance, the Atari example trains on pixel-based game states from Pong, while NetHack uses structured dungeon observations. Training converges toward optimal policies, with performance depending on hyperparameters, environment complexity, and hardware. PufferLib's optimizations (e.g., custom kernels) ensure scalability, making it suitable for research or production RL workflows. If adapting for LLM-related RL (e.g., RLHF), custom language-based environments would be required, though the examples focus on classic RL benchmarks.
Installation Guide for PufferLib
PufferLib is an open-source reinforcement learning library designed to simplify compatibility between complex environments and RL algorithms, while providing high-performance vectorization and emulation layers. It supports integration with libraries like CleanRL and Stable Baselines 3, and includes bindings for numerous environments. Below is a comprehensive installation guide based on official documentation.
Prerequisites
- Python 3.11 or higher.
- PyTorch (compatible with CUDA for GPU acceleration; install via
pip install torchor specify CUDA version). - For faster training, install NVCC (NVIDIA CUDA Compiler) to enable custom kernels.
- Avoid using Conda environments, as they may slow down environment compilation. Use UV or virtualenv instead.
- Docker Desktop (optional, for PufferTank containerized setup).
Installation via Pip
The primary method is to install PufferLib from PyPI. Use the following commands:
Basic installation:
pip install pufferlib
Installation with specific environment bindings (e.g., for Atari or NetHack):
pip install pufferlib[atari]
pip install pufferlib[nethack]
Supported extras include: atari, procgen, nethack, neural_mmo, magent, minigrid, minihack, crafter, griddly, pokemon, and others. For a broad set of common environments:
pip install pufferlib[common]
Installation from source (for development):
git clone https://github.com/PufferAI/PufferLib.git
cd PufferLib
pip install -e .
If encountering PyTorch CUDA incompatibilities, use:
pip install pufferlib --no-build-isolation
Installation via Docker (PufferTank)
PufferTank is a prebuilt Docker image with GPU support, pre-installed dependencies (CUDA 12.1, PyTorch 2.4, Python 3.11), and PufferLib. It is recommended for complex setups.
Clone the repository:
git clone https://github.com/PufferAI/PufferTank.git
cd PufferTank
Build and test the container:
./docker.sh test
For Visual Studio Code users, install the Dev Container extension and open the repository to automatically set up the environment. NeoVim is pre-installed for editing.
Additional Setup for Custom Environments
- For C extensions in custom environments:
python setup.py build_ext --inplace --force
- To enable debugging (address sanitizer):
DEBUG=1 python setup.py build_ext --inplace --force
CUDA_VISIBLE_DEVICES=None LD_PRELOAD=$(gcc -print-file-name=libasan.so) python -m pufferlib.pufferl train --train.device cpu --vec.backend Serial
- For Ocean environments (included by default), compile custom ones:
scripts/build_ocean.sh local
Verification
After installation, test with a simple command-line training run (e.g., on an Ocean environment):
puffer train puffer_breakout --train.device cpu
This confirms the setup. For full documentation, refer to https://puffer.ai/docs.html.
10 Code Examples for Training on Different Environments
Below are 10 Python code examples demonstrating how to use PufferLib to train models on 10 different environments using Proximal Policy Optimization (PPO) via the PuffeRL trainer. Each example assumes PufferLib is installed with the relevant extras (e.g., pufferlib[atari]). The code structure involves creating a vectorized environment, defining a policy, initializing the trainer, and running a training loop. Hyperparameters are set to reasonable defaults for illustration; adjust as needed for production. These examples focus on single-agent setups for simplicity, though PufferLib supports multi-agent scenarios.
1. Atari (PongNoFrameskip-v4)
import gymnasium as gym
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Atari environment
binding = pufferlib.emulation.Binding(
env_creator=lambda: gym.make('PongNoFrameskip-v4'),
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=32, backend='Multiprocessing')
# Define policy (CNN for image observations)
policy = pufferlib.models.NatureCNN(
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig() # Default PPO config
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(1000000): # Total timesteps
trainer.evaluate() # Collect interactions
trainer.train() # Update policy
if step % 10000 == 0:
trainer.mean_and_log() # Log metrics
trainer.close()
2. Procgen (Coinrun)
import procgen
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Procgen environment
binding = pufferlib.emulation.Binding(
env_creator=lambda: procgen.ProcgenEnv(env_name='coinrun', num_envs=1),
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=64, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.ResNet( # Suitable for Procgen
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(learning_rate=0.0003)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(25000000):
trainer.evaluate()
trainer.train()
if step % 50000 == 0:
trainer.mean_and_log()
trainer.close()
3. NetHack
import nle
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for NetHack
binding = pufferlib.emulation.Binding(
env_creator=lambda: nle.env.NLE(),
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=16, backend='Multiprocessing')
# Define policy (MLP for structured observations)
policy = pufferlib.models.DefaultMLP(
input_size=vecenv.single_observation_space.shape[0],
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(batch_size=256)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(100000000):
trainer.evaluate()
trainer.train()
if step % 100000 == 0:
trainer.mean_and_log()
trainer.close()
4. Neural MMO
import neuralmmo
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Neural MMO (multi-agent)
binding = pufferlib.emulation.Binding(
env_creator=lambda: neuralmmo.RLlibEnv(),
multiagent=True,
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=8, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.DefaultMLP(
input_size=vecenv.single_observation_space.shape[0],
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(num_agents=128) # Adjust for multi-agent
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(50000000):
trainer.evaluate()
trainer.train()
if step % 50000 == 0:
trainer.mean_and_log()
trainer.close()
5. MiniGrid (Empty-8x8-v0)
import minigrid
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for MiniGrid
binding = pufferlib.emulation.Binding(
env_creator=lambda: minigrid.wrappers.FullyObsWrapper(minigrid.envs.EmptyEnv(size=8)),
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=32, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.NatureCNN(
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig()
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(1000000):
trainer.evaluate()
trainer.train()
if step % 10000 == 0:
trainer.mean_and_log()
trainer.close()
6. Crafter
import crafter
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Crafter
binding = pufferlib.emulation.Binding(
env_creator=crafter.Env,
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=64, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.ResNet(
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(learning_rate=0.0001)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(10000000):
trainer.evaluate()
trainer.train()
if step % 20000 == 0:
trainer.mean_and_log()
trainer.close()
7. Griddly (Custom Level)
import griddly
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Griddly (example with a GDY file)
binding = pufferlib.emulation.Binding(
env_creator=lambda: griddly.GymWrapper('path/to/gdy/file.yaml'),
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=16, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.DefaultMLP(
input_size=vecenv.single_observation_space.shape[0],
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig()
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(5000000):
trainer.evaluate()
trainer.train()
if step % 50000 == 0:
trainer.mean_and_log()
trainer.close()
8. Pokemon Red
import gym_pokemon_red
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for Pokemon Red
binding = pufferlib.emulation.Binding(
env_creator=gym_pokemon_red.PokemonRedEnv,
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=8, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.NatureCNN(
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(batch_size=512)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(20000000):
trainer.evaluate()
trainer.train()
if step % 100000 == 0:
trainer.mean_and_log()
trainer.close()
9. MAgent (Battle)
import magent
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for MAgent (multi-agent battle)
binding = pufferlib.emulation.Binding(
env_creator=lambda: magent.GridWorld('battle'),
multiagent=True,
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=16, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.DefaultMLP(
input_size=vecenv.single_observation_space.shape[0],
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(num_agents=100)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(10000000):
trainer.evaluate()
trainer.train()
if step % 50000 == 0:
trainer.mean_and_log()
trainer.close()
10. Gym MicroRTS
import microrts
import pufferlib
import pufferlib.emulation
import pufferlib.vector
import pufferlib.models
import pufferlib.frameworks.cleanrl
# Create binding for MicroRTS
binding = pufferlib.emulation.Binding(
env_creator=microrts.MicroRTSEnv,
)
# Vectorize the environment
vecenv = pufferlib.vector.make(binding.env_creator, num_envs=32, backend='Multiprocessing')
# Define policy
policy = pufferlib.models.ResNet(
input_shape=vecenv.single_observation_space.shape,
output_size=vecenv.single_action_space.n,
)
# Configure and initialize trainer
config = pufferlib.frameworks.cleanrl.PPOConfig(learning_rate=0.00025)
trainer = pufferlib.frameworks.cleanrl.PPOTrainer(vecenv, policy, config)
# Training loop
for step in range(5000000):
trainer.evaluate()
trainer.train()
if step % 10000 == 0:
trainer.mean_and_log()
trainer.close()