Continual Learning Bench measures how well AI agents learn from past environment interactions, the defining challenge for agents that operate over extended horizons and are expected to improve online.
Continual Learning Bench requires Python 3.13 or later, uv, and a local Docker installation for tasks and systems that run containerized workspaces.
git clone https://github.com/pgasawa/continual-learning-bench && cd continual-learning-bench
uv sync --all-extras && source .venv/bin/activate
pre-commit install
clbench setup --allAdd any model provider keys you need to a .env file in the project root, then verify the CLI and run a small benchmark:
clbench list
clbench run exploitable_poker --schedule quick_test --system iclThe Quickstart Guide walks through the same flow in more detail, including how to inspect tasks, systems, schedules, and run outputs.
- Installation — Full setup notes and prerequisites
- Task Gallery — Browse all available tasks
- Leaderboard — See how models compare
- Viewers — Viewing and comparing results
- Docs — Concepts, metrics, contribution guides, and more
Each task spans multiple episodes in a shared environment, so agents that remember feedback and adapt should outperform those that solve each instance from scratch. Tasks include:
- a set of constructed task instances with a learnable structure,
- an evaluation script and reward metric measuring improvement over episodes,
- schedules and variants for repeatable comparisons across systems.
Tasks live in src/tasks/. See the Task Gallery for an overview that's easy to browse.
Systems are the agents evaluated by the benchmark. Built-in baselines and model-backed systems live in src/systems/, and custom systems can be added to test new memory, retrieval, prompting, or tool-use strategies.
The harness connects systems to task environments, manages multi-episode rollouts, records traces, and writes viewer artifacts for analysis. After installing, explore available options with:
clbench run --help
clbench run-all --help- Roadmap — See what we're working on and where you can help
- Contributing a New Task — Add a benchmark environment
- Contributing a New System — Add an evaluated agent
Contributions are welcome, especially new tasks that stress-test long-horizon learning. Reach out on Discord to discuss ideas before diving in.
If you found Continual Learning Bench useful, please cite us as:
@misc{clbench2026,
title={Continual Learning Bench},
author={Parth Asawa and Chris Glaze and Gabe Orlanski and Ramya Ramakrishnan and Benji Xu and Asim Biswal and Vincent Sunn Chen and Frederic Sala and Matei Zaharia and Joseph E. Gonzalez},
year={2026},
}