GitHub - AnwarDebes/lucid: A glass-box, safety-shielded agent for the ARC-AGI-3 interactive reasoning benchmark: object-centric perception, a human-readable learned world model, BFS planning, and an online-learned safety shield. CPU-only, no hosted LLM, honest live-API evaluation.

ARC-AGI-3 drops an agent into a never-before-seen turn-based game with no instructions and asks it to explore, infer the goal, model the dynamics, and plan. At release every frontier model scored below 1% (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Opus 4.6 0.25%, Grok-4 0.00%) while humans solve 100%.

LUCID is the opposite of a frontier model on every axis that matters for transparency and safety:

Self-contained: no hosted LLM, no GPU, no internet at decision time (so it is valid under the competition's sandbox rule). Pure NumPy on a CPU.
Interpretable: object-centric perception, a human-readable learned world model (ACTION1: disp=(-4,0) = up), and a per-step decision trace (which rule fired, why, and what happened).
Defensive: an explicit safety shield whose unsafe-action set is learned online and which provably never repeats an action that caused the terminal GAME_OVER (Proposition 1 in the paper).

The policy is a transparent cascade: shield → exploit → goal-seek (BFS over the learned movement model) → curiosity (count-based novelty).

The accompanying paper is at paper/lucid.pdf. All reported numbers are measured live against the ARC-AGI-3 API; nothing is simulated.

Quick start

pip install -e ".[experiments,dev]"
cp .env.example .env          # paste your ARC-AGI-3 key (kept out of git)

python -m pytest                                   # offline unit tests
python demo/quickstart.py LS20                     # play one game; print the
                                                   # learned model + decision trace
python experiments/run_all.py --agent lucid  --budget 800   # all 25 public games
python experiments/run_all.py --agent random --budget 800   # baseline
python scripts/compute_scores.py                   # RHAE (official methodology)
python scripts/make_figures.py                     # paper figures

from lucidkit import ArcClient, LucidAgent, run_game
c = ArcClient()                                     # reads ARC_API_KEY
game = c.list_games()[0]
card = c.open_scorecard(tags=["demo"])
agent = LucidAgent(seed=0)
res = run_game(c, agent, game, card, budget=800)
print(res.max_levels, res.model_summary["action_effects"])   # interpretable model
for d in agent.trace[:10]:
    print(d.rule, d.reason, "->", d.effect)                  # auditable trace

What's in the box

Path	What it is
`lucidkit/client.py`	robust ARC-AGI-3 API client (X-API-Key, cookie affinity, retries)
`lucidkit/perception.py`	object segmentation, online HUD removal, state signatures
`lucidkit/agent.py`	the glass-box policy cascade + learned dynamics + BFS planner + safety shield
`lucidkit/baselines.py`	random baseline (the benchmark's reference)
`lucidkit/runner.py`	evaluation runner (per-level action accounting for RHAE)
`experiments/`	live exploration + full evaluation drivers
`scripts/`	RHAE scoring, figures, paper numbers
`tests/`	offline unit tests (perception + policy, no network)
`paper/lucid.tex`	the paper source

Results (real, measured, not a SOTA claim)

Across all 25 public games via the live API (curiosity agent, 300-action budget):

	LUCID	Random	Frontier LLMs (report)
Levels completed	2 (VC33, LP85)	~0 (defeated by design)	~0
Public-set RHAE	0.083% (server) / 0.05% (mine)	~0%	<1% (best 0.37%)
Best per-level efficiency	25% on VC33 L1 (14 actions vs human 7)	n/a	n/a
Exploration	far broader than random (equal budget)	baseline	n/a
Repeated fatal actions	0 (shield guarantee)	many	n/a

Honest reading: LUCID completes real ARC-AGI-3 levels that opaque models clear only rarely, but its overall RHAE sits in the same sub-1% regime as frontier models: it does not beat them. ARC-AGI-3 is hard by design (random ~0, frontier <1%, humans 100%). LUCID's value is transparency, online safety, and self-containment, with an honest live-API evaluation, not a leaderboard win. Numbers regenerate via scripts/compute_scores.py.

Honest scope

These are results on the 25 public development games via the live API: a transparent development-set measurement, not the official private leaderboard, and I do not claim to solve ARC-AGI-3. I charge all actions (incl. exploration) to RHAE (conservative), keep the API key out of git, and release every decision trace so the numbers are independently checkable. See the paper's Limitations section.

Citing

@misc{debes2026lucid,
  title  = {LUCID: A Glass-Box, Safety-Shielded Agent for Open-World Interactive
            Reasoning on ARC-AGI-3},
  author = {Debes, Anwar},
  year   = {2026},
  note   = {Reference implementation, v1.0,
            \url{https://github.com/AnwarDebes/lucid}}
}

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
demo		demo
experiments		experiments
lucidkit		lucidkit
paper		paper
results		results
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

What's in the box

Results (real, measured, not a SOTA claim)

Honest scope

Citing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick start

What's in the box

Results (real, measured, not a SOTA claim)

Honest scope

Citing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages