Continual Learning Bench

Continual Learning Bench measures how well AI agents learn from past environment interactions, the defining challenge for agents that operate over extended horizons and are expected to improve online.

Quickstart

Continual Learning Bench requires Python 3.13 or later, uv, and a local Docker installation for tasks and systems that run containerized workspaces.

git clone https://github.com/pgasawa/continual-learning-bench && cd continual-learning-bench
uv sync --all-extras && source .venv/bin/activate
pre-commit install
clbench setup --all

Add any model provider keys you need to a .env file in the project root, then verify the CLI and run a small benchmark:

clbench list
clbench run exploitable_poker --schedule quick_test --system icl

The Quickstart Guide walks through the same flow in more detail, including how to inspect tasks, systems, schedules, and run outputs.

Further Documentation

Installation — Full setup notes and prerequisites
Task Gallery — Browse all available tasks
Leaderboard — See how models compare
Viewers — Viewing and comparing results
Docs — Concepts, metrics, contribution guides, and more

Core Components

Dataset of Tasks

Each task spans multiple episodes in a shared environment, so agents that remember feedback and adapt should outperform those that solve each instance from scratch. Tasks include:

a set of constructed task instances with a learnable structure,
an evaluation script and reward metric measuring improvement over episodes,
schedules and variants for repeatable comparisons across systems.

Tasks live in src/tasks/. See the Task Gallery for an overview that's easy to browse.

Systems

Systems are the agents evaluated by the benchmark. Built-in baselines and model-backed systems live in src/systems/, and custom systems can be added to test new memory, retrieval, prompting, or tool-use strategies.

Execution Harness

The harness connects systems to task environments, manages multi-episode rollouts, records traces, and writes viewer artifacts for analysis. After installing, explore available options with:

clbench run --help
clbench run-all --help

Contribution

Roadmap — See what we're working on and where you can help
Contributing a New Task — Add a benchmark environment
Contributing a New System — Add an evaluated agent

Contributions are welcome, especially new tasks that stress-test long-horizon learning. Reach out on Discord to discuss ideas before diving in.

Citing Us

If you found Continual Learning Bench useful, please cite us as:

@misc{clbench2026,
      title={Continual Learning Bench},
      author={Parth Asawa and Chris Glaze and Gabe Orlanski and Ramya Ramakrishnan and Benji Xu and Asim Biswal and Vincent Sunn Chen and Frederic Sala and Matei Zaharia and Joseph E. Gonzalez},
      year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent_docs		agent_docs
assets		assets
configs		configs
data		data
final_results/runs		final_results/runs
results		results
scripts		scripts
skills		skills
src		src
tests		tests
viewers		viewers
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continual Learning Bench

Quickstart

Further Documentation

Core Components

Dataset of Tasks

Systems

Execution Harness

Contribution

Citing Us

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Continual Learning Bench

Quickstart

Further Documentation

Core Components

Dataset of Tasks

Systems

Execution Harness

Contribution

Citing Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages