ARC-AGI-3 drops an agent into a never-before-seen turn-based game with no instructions and asks it to explore, infer the goal, model the dynamics, and plan. At release every frontier model scored below 1% (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Opus 4.6 0.25%, Grok-4 0.00%) while humans solve 100%.
LUCID is the opposite of a frontier model on every axis that matters for transparency and safety:
- Self-contained: no hosted LLM, no GPU, no internet at decision time (so it is valid under the competition's sandbox rule). Pure NumPy on a CPU.
- Interpretable: object-centric perception, a human-readable learned
world model (
ACTION1: disp=(-4,0)= up), and a per-step decision trace (which rule fired, why, and what happened). - Defensive: an explicit safety shield whose unsafe-action set is
learned online and which provably never repeats an action that caused the
terminal
GAME_OVER(Proposition 1 in the paper).
The policy is a transparent cascade: shield → exploit → goal-seek (BFS over the learned movement model) → curiosity (count-based novelty).
The accompanying paper is at paper/lucid.pdf. All reported
numbers are measured live against the ARC-AGI-3 API; nothing is simulated.
pip install -e ".[experiments,dev]"
cp .env.example .env # paste your ARC-AGI-3 key (kept out of git)
python -m pytest # offline unit tests
python demo/quickstart.py LS20 # play one game; print the
# learned model + decision trace
python experiments/run_all.py --agent lucid --budget 800 # all 25 public games
python experiments/run_all.py --agent random --budget 800 # baseline
python scripts/compute_scores.py # RHAE (official methodology)
python scripts/make_figures.py # paper figuresfrom lucidkit import ArcClient, LucidAgent, run_game
c = ArcClient() # reads ARC_API_KEY
game = c.list_games()[0]
card = c.open_scorecard(tags=["demo"])
agent = LucidAgent(seed=0)
res = run_game(c, agent, game, card, budget=800)
print(res.max_levels, res.model_summary["action_effects"]) # interpretable model
for d in agent.trace[:10]:
print(d.rule, d.reason, "->", d.effect) # auditable trace| Path | What it is |
|---|---|
lucidkit/client.py |
robust ARC-AGI-3 API client (X-API-Key, cookie affinity, retries) |
lucidkit/perception.py |
object segmentation, online HUD removal, state signatures |
lucidkit/agent.py |
the glass-box policy cascade + learned dynamics + BFS planner + safety shield |
lucidkit/baselines.py |
random baseline (the benchmark's reference) |
lucidkit/runner.py |
evaluation runner (per-level action accounting for RHAE) |
experiments/ |
live exploration + full evaluation drivers |
scripts/ |
RHAE scoring, figures, paper numbers |
tests/ |
offline unit tests (perception + policy, no network) |
paper/lucid.tex |
the paper source |
Across all 25 public games via the live API (curiosity agent, 300-action budget):
| LUCID | Random | Frontier LLMs (report) | |
|---|---|---|---|
| Levels completed | 2 (VC33, LP85) | ~0 (defeated by design) | ~0 |
| Public-set RHAE | 0.083% (server) / 0.05% (mine) | ~0% | <1% (best 0.37%) |
| Best per-level efficiency | 25% on VC33 L1 (14 actions vs human 7) | n/a | n/a |
| Exploration | far broader than random (equal budget) | baseline | n/a |
| Repeated fatal actions | 0 (shield guarantee) | many | n/a |
Honest reading: LUCID completes real ARC-AGI-3 levels that opaque models
clear only rarely, but its overall RHAE sits in the same sub-1% regime as
frontier models: it does not beat them. ARC-AGI-3 is hard by design (random
~0, frontier <1%, humans 100%). LUCID's value is transparency, online safety,
and self-containment, with an honest live-API evaluation, not a leaderboard
win. Numbers regenerate via scripts/compute_scores.py.
These are results on the 25 public development games via the live API: a transparent development-set measurement, not the official private leaderboard, and I do not claim to solve ARC-AGI-3. I charge all actions (incl. exploration) to RHAE (conservative), keep the API key out of git, and release every decision trace so the numbers are independently checkable. See the paper's Limitations section.
@misc{debes2026lucid,
title = {LUCID: A Glass-Box, Safety-Shielded Agent for Open-World Interactive
Reasoning on ARC-AGI-3},
author = {Debes, Anwar},
year = {2026},
note = {Reference implementation, v1.0,
\url{https://github.com/AnwarDebes/lucid}}
}MIT. See LICENSE.