Skip to content

NinjaTech-AI/continual-learning-bench

 
 

Repository files navigation

Continual Learning Bench

Continual Learning Bench

Discord Website Docs

Continual Learning Bench measures how well AI agents learn from past environment interactions, the defining challenge for agents that operate over extended horizons and are expected to improve online.

Quickstart

Continual Learning Bench requires Python 3.13 or later, uv, and a local Docker installation for tasks and systems that run containerized workspaces.

git clone https://github.com/pgasawa/continual-learning-bench && cd continual-learning-bench
uv sync --all-extras && source .venv/bin/activate
pre-commit install
clbench setup --all

Add any model provider keys you need to a .env file in the project root, then verify the CLI and run a small benchmark:

clbench list
clbench run exploitable_poker --schedule quick_test --system icl

The Quickstart Guide walks through the same flow in more detail, including how to inspect tasks, systems, schedules, and run outputs.

Further Documentation

  • Installation — Full setup notes and prerequisites
  • Task Gallery — Browse all available tasks
  • Leaderboard — See how models compare
  • Viewers — Viewing and comparing results
  • Docs — Concepts, metrics, contribution guides, and more

Core Components

Dataset of Tasks

Each task spans multiple episodes in a shared environment, so agents that remember feedback and adapt should outperform those that solve each instance from scratch. Tasks include:

  • a set of constructed task instances with a learnable structure,
  • an evaluation script and reward metric measuring improvement over episodes,
  • schedules and variants for repeatable comparisons across systems.

Tasks live in src/tasks/. See the Task Gallery for an overview that's easy to browse.

Systems

Systems are the agents evaluated by the benchmark. Built-in baselines and model-backed systems live in src/systems/, and custom systems can be added to test new memory, retrieval, prompting, or tool-use strategies.

Execution Harness

The harness connects systems to task environments, manages multi-episode rollouts, records traces, and writes viewer artifacts for analysis. After installing, explore available options with:

clbench run --help
clbench run-all --help

Contribution

Contributions are welcome, especially new tasks that stress-test long-horizon learning. Reach out on Discord to discuss ideas before diving in.

Citing Us

If you found Continual Learning Bench useful, please cite us as:

@misc{clbench2026,
      title={Continual Learning Bench},
      author={Parth Asawa and Chris Glaze and Gabe Orlanski and Ramya Ramakrishnan and Benji Xu and Asim Biswal and Vincent Sunn Chen and Frederic Sala and Matei Zaharia and Joseph E. Gonzalez},
      year={2026},
}

About

Continual Learning Bench

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 89.5%
  • HTML 10.0%
  • Other 0.5%