research build log · ARC-AGI-3 interactive grids
ARC-AGI-3 games are 64×64 grids with hidden rules; frontier LLMs score under 1%. This is the honest log of a minimal agent that plays by writing Python against the frame — plus a “sleep” phase that mines reusable skills and confirms them by letting the game itself be the judge. It is built, it is genuinely code-native, and it does not yet solve a level from scratch. Each failure taught the next fix.
status: build log · in progressThe whole thing is one loop, repeated until the level is won or the budget runs out: the agent observes a frame (numbers + a rendered image), writes a block of Python that decodes and probes the board, the sandbox executes it (taking up to three real actions), and the resulting stdout + a fresh image come back. No orchestrator, no subagents, no world-model files. The palette below is the only color on this page — it appears solely inside the rendered grids, because the grids are the game.
Our agent is a compression of two existing systems. Both share one decisive trait, and that
shared trait is the whole premise: neither plays from raw text. The model writes
Python against a numpy frame, sees the board as an image, and calls submit_action.
Our engine/ is a fork of symbolica-ai/ARC-AGI-3-Agents.
An orchestrator LLM is told “you are a manager, not a player” and
dynamically spawns specialized subagents: an explorer (pokes & diffs),
a theorist (text-only reasoning), a tester (tight-budget hypothesis check),
and a solver (executes), all over a shared memories store
(add/query). It is a Python REPL: the model writes free-form
Python in a namespace holding frame (numpy, with
.find/.diff/.change_summary),
submit_action, spawn_agent, and memories.
It renders the grid and looks at it. Reported ft09 = 77.59%.
The second baseline (arc-3-agents-baseline1) is a Python
state machine — protocols normal / stuck /
reset / trouble — wrapping a single Codex CLI agent
(no subagents). Codex works in a workspace: it loads frames as numpy via
session_tools, then writes an executable
world_model_engine.py (a transition function
state, action → new_state), a planner, and a
world_model.md, verifying with ASCII and seeing with PNG.
find/diff), sees the board image,
and calls submit_action. Neither plays from raw text.
That is the one thing we keep.
We strip both ancestors down to their shared core. There is a single actor agent:
no orchestrator, no subagents, no world-model files. One round is exactly this: we send the model
[system prompt + conversation so far + latest "output:<stdout>, state, levels, available" + the board as a PNG];
it replies with a Python code block; we extract and run it in a restricted sandbox; the code’s
stdout plus a fresh board PNG are fed back; repeat until win or budget.
The namespace is the whole interface: frame — a numpy 64×64 array of
ints 0–15 with helpers .find(*v), .diff(),
.change_summary(), .color_counts() and attributes
.state / .levels_completed / .available;
submit_action(name, x=0, y=0) which performs one real action
and returns the new frame (budget-limited; capped at 3 per block to force a re-check);
np; and notes, a list that persists across rounds.
The action set is ACTION1=Up, 2=Down, 3=Left, 4=Right, 5=interact, 6=CLICK(x,y), 7=undo, RESET.
A fork() copy was tried and removed — see Testing.
(Rendered as CSS-grid colored cells from the ARC-16 palette — no image files. The live games are 64×64; these 16×16 crops keep the structure legible.)
From an actual run, the model wrote a forking comparator and a region renderer — decoding the
grid entirely in code via frame.find / frame.diff / np:
# compare two candidate action sequences on throwaway forks def compare_sequences(seqA, seqB): for tag, seq in (("A", seqA), ("B", seqB)): g = game.fork() for a in seq: f = g.submit_action(a) before = frame.grid after = f.grid changed = np.argwhere(before != after) print(tag, "changed_cells=", len(changed), "levels=", f.levels_completed) # custom region renderer to read a sub-window of the board def show(r0, r1, c0, c1): for r in range(r0, r1): print("".join(str(frame.grid[r, c]) for c in range(c0, c1))) compare_sequences([1,1,5], [4,4,5]) show(10, 16, 0, 16)
This is the proof the agent is genuinely code-native: it reads the board with
np.argwhere and slicing, not by eyeballing a text dump.
About 921 lines total: harness 222, sandbox 116, frame 102, run 56, game 155, sleep 256.
Every file is under 500 lines. We kept the code-native foundation (model writes Python,
numpy frame, image, submit_action) and dropped symbolica’s
orchestrator + subagents and astroseger’s world-model files + state machine.
When the actor is stuck (no level gain for K rounds), the harness runs
sleep. A miner prompt reads the trace + notes and proposes up to three candidate
Skill{when, do, expect}. Then execute_and_score runs
each skill’s do on a throwaway game.fork() and
confirms it iff fork.levels_completed increases —
no LLM and no privileged oracle in the gate. The game’s own level-completion is
the judge. Confirmed skills are committed and injected into the actor’s context.
execute_and_score
confirms a hand-written true-mechanic 48-action clearing plan
(fork levels_completed 2→3) and rejects a “stencil”
wiggle skill (levels stay at 2). The stencil’s grid changed on every step — a dense
signal of 1.0 — yet it is rejected, because the gate is level-completion,
not grid-change. This is the direct fix for the old LLM-judge that committed a
plausible-but-wrong skill at 0.88 confidence with 0 real value.
A Skill is just {when: the trigger,
do: an action sequence, expect: the predicted change}.
Grounding it in the game’s own success signal is what makes a confirmed skill trustworthy.
Each rung below is a fix that revealed the next failure. Nothing here is a finished result; it is a detective story where every fix was real and every remaining gap is named.
Hex grid dumped as text; the model eyeballs it. 124 actions, 0 levels on ls20. It fixated on a salient-but-wrong “bottom band” and built a wrong text theory.
Verdict: “can’t see.”
The model decodes perfectly in code (find/diff/compare_sequences)
but runs every experiment on throwaway forks and never commits a real action
→ 0 real actions, 0 levels, on both ft09 and ls20. “fork-paralysis.”
Verdict: “sees but never commits.”
Now it commits real actions (0 → 50 → 150) but still 0 levels. It understands ft09 (ACTION1–5 deplete a countdown bar → GAME_OVER at step 32, which it correctly avoids) yet cannot find the winning action, and commits depleting actions → dies.
Verdict: “acts but can’t find the win.”
Adds the rendered PNG, the explicit action set, and a simpler code-block-in-text loop (not function-calling). Testing whether seeing the board reveals the visual win.
Verdict: in progress.
58 unit tests green (frame, sandbox, harness, judge, sleep, control). Sandbox safety is
live-verified — import / os / eval
are all blocked. U2 (above) is the LLM-free decisive test.
# all three raise inside the restricted sandbox import os # -> blocked (no __import__) os.system("ls") # -> blocked (name 'os' undefined) eval("1+1") # -> blocked (eval not in namespace)
| symbolica | astroseger | ours | |
|---|---|---|---|
| structure | orchestrator + 4 dynamic subagents | state machine + 1 Codex agent | single actor, no orchestrator |
| IO | Python REPL, free-form | workspace + session_tools | code-block-in-text round loop |
| code | writes Python on a numpy frame | writes world_model_engine.py | writes Python on a numpy frame |
| image | renders + looks (PNG) | PNG (see) + ASCII (verify) | PNG every round |
| memory | shared memories add/query | world_model.md + planner files | notes list + confirmed skills |
| size | thousands of lines | ~2K lines + Codex CLI | ~921 lines, every file < 500 |
The point of the table is not that smaller is better — it is that the entire code-native foundation survives the compression, while the orchestration, subagents, state machine, and world-model files turn out to be optional scaffolding we could drop and still decode the board.