research build log · ARC-AGI-3 interactive grids

An agent that writes code to play a game it has never seen.

ARC-AGI-3 games are 64×64 grids with hidden rules; frontier LLMs score under 1%. This is the honest log of a minimal agent that plays by writing Python against the frame — plus a “sleep” phase that mines reusable skills and confirms them by letting the game itself be the judge. It is built, it is genuinely code-native, and it does not yet solve a level from scratch. Each failure taught the next fix.

status: build log · in progress

The whole thing is one loop, repeated until the level is won or the budget runs out: the agent observes a frame (numbers + a rendered image), writes a block of Python that decodes and probes the board, the sandbox executes it (taking up to three real actions), and the resulting stdout + a fresh image come back. No orchestrator, no subagents, no world-model files. The palette below is the only color on this page — it appears solely inside the rendered grids, because the grids are the game.

0123456789101112131415
01

Baselines — two code-native ancestors

where we forked from

Our agent is a compression of two existing systems. Both share one decisive trait, and that shared trait is the whole premise: neither plays from raw text. The model writes Python against a numpy frame, sees the board as an image, and calls submit_action.

symbolica — an orchestrator that spawns specialists

Our engine/ is a fork of symbolica-ai/ARC-AGI-3-Agents. An orchestrator LLM is told “you are a manager, not a player” and dynamically spawns specialized subagents: an explorer (pokes & diffs), a theorist (text-only reasoning), a tester (tight-budget hypothesis check), and a solver (executes), all over a shared memories store (add/query). It is a Python REPL: the model writes free-form Python in a namespace holding frame (numpy, with .find/.diff/.change_summary), submit_action, spawn_agent, and memories. It renders the grid and looks at it. Reported ft09 = 77.59%.

orchestrator LLM “manager, not a player” explorer pokes + diffs theorist text-only reasoning tester tight-budget check solver executes actions spawn_agent(...) memories · add/query REPL base: frame (numpy) · submit_action
Diagram 2 · symbolica. One orchestrator dynamically spawns four specialized subagents over a shared memories store and a common REPL base (frame + submit_action). Heavy: thousands of lines.

astroseger — a state machine wrapping one Codex agent

The second baseline (arc-3-agents-baseline1) is a Python state machine — protocols normal / stuck / reset / trouble — wrapping a single Codex CLI agent (no subagents). Codex works in a workspace: it loads frames as numpy via session_tools, then writes an executable world_model_engine.py (a transition function state, action → new_state), a planner, and a world_model.md, verifying with ASCII and seeing with PNG.

state machine normal stuck reset trouble Codex CLI single agent drives world_model_engine.py planner world_model.md writes inputs: ASCII (verify) + PNG (see)
Diagram 3 · astroseger. A four-protocol state machine drives a single Codex agent that writes executable world-model files, fed by ASCII (verify) + PNG (see). Heavy: ~2K lines + a Codex CLI.
the common foundation (the key point)
Both ancestors are code-native: the model writes Python against a numpy frame (find/diff), sees the board image, and calls submit_action. Neither plays from raw text. That is the one thing we keep.
02

Method — one autonomous CodeAct loop

~921 lines, every file < 500

We strip both ancestors down to their shared core. There is a single actor agent: no orchestrator, no subagents, no world-model files. One round is exactly this: we send the model [system prompt + conversation so far + latest "output:<stdout>, state, levels, available" + the board as a PNG]; it replies with a Python code block; we extract and run it in a restricted sandbox; the code’s stdout plus a fresh board PNG are fed back; repeat until win or budget.

observe frame numbers + PNG image state / levels / available model writes a run_python code block sandbox runs code · submit_action ≤ 3 real actions / block prompt extract output: <stdout> + fresh board PNG  ↻  repeat until WIN / budget sandbox namespace frame .find(*v) .diff() .change_summary() .color_counts() .state .levels_completed submit_action(name, x, y) → new frame np  ·  notes (list, persists)
Diagram 1 · our CodeAct loop. A single actor: observe (frame + image) → model writes run_pythonsandbox executes (submit_action, capped at 3 per block to force a re-check) → stdout + new image feed back. The boxed sandbox namespace is the agent’s entire API.

What lives in the sandbox

The namespace is the whole interface: frame — a numpy 64×64 array of ints 0–15 with helpers .find(*v), .diff(), .change_summary(), .color_counts() and attributes .state / .levels_completed / .available; submit_action(name, x=0, y=0) which performs one real action and returns the new frame (budget-limited; capped at 3 per block to force a re-check); np; and notes, a list that persists across rounds. The action set is ACTION1=Up, 2=Down, 3=Left, 4=Right, 5=interact, 6=CLICK(x,y), 7=undo, RESET. A fork() copy was tried and removed — see Testing.

Real grids the agent decodes in code (not by eye)

ls20·L2 stuck frame. The orange band (12) along the bottom is the salient-but-wrong feature the text-IO agent fixated on; the real target is the green socket (14) ringed in magenta (6).
ft09 board. The yellow bar (11) up top is a countdown that ACTION1–5 deplete (spent = grey 1) → GAME_OVER; the red agent (8) must reach the sky cell (10) on the purple pad (15).

(Rendered as CSS-grid colored cells from the ARC-16 palette — no image files. The live games are 64×64; these 16×16 crops keep the structure legible.)

Real code from a trace

From an actual run, the model wrote a forking comparator and a region renderer — decoding the grid entirely in code via frame.find / frame.diff / np:

run_python — actor trace (verbatim)
# compare two candidate action sequences on throwaway forks
def compare_sequences(seqA, seqB):
    for tag, seq in (("A", seqA), ("B", seqB)):
        g = game.fork()
        for a in seq:
            f = g.submit_action(a)
        before = frame.grid
        after  = f.grid
        changed = np.argwhere(before != after)
        print(tag, "changed_cells=", len(changed),
              "levels=", f.levels_completed)

# custom region renderer to read a sub-window of the board
def show(r0, r1, c0, c1):
    for r in range(r0, r1):
        print("".join(str(frame.grid[r, c]) for c in range(c0, c1)))

compare_sequences([1,1,5], [4,4,5])
show(10, 16, 0, 16)

This is the proof the agent is genuinely code-native: it reads the board with np.argwhere and slicing, not by eyeballing a text dump.

Size — a compression, not a new mountain

About 921 lines total: harness 222, sandbox 116, frame 102, run 56, game 155, sleep 256. Every file is under 500 lines. We kept the code-native foundation (model writes Python, numpy frame, image, submit_action) and dropped symbolica’s orchestrator + subagents and astroseger’s world-model files + state machine.

03

Sleep — the novelty: an LLM-free skill gate

the game is the judge

When the actor is stuck (no level gain for K rounds), the harness runs sleep. A miner prompt reads the trace + notes and proposes up to three candidate Skill{when, do, expect}. Then execute_and_score runs each skill’s do on a throwaway game.fork() and confirms it iff fork.levels_completed increasesno LLM and no privileged oracle in the gate. The game’s own level-completion is the judge. Confirmed skills are committed and injected into the actor’s context.

actor plays stuck? K rounds mine ≤ 3 skills {when,do,expect} fork · execute · judge run ‘do’ on game.fork() commit IFF levels_completed↑ yes inject confirmed skills back into actor context no ↻
Diagram 4 · sleep. actor → [stuck?] → mine (≤3 skills) → fork · execute · judge (commit only if the fork’s real level count rises) → inject back. The gate has no LLM and no oracle.
U2 — the decisive LLM-free result
On the real ls20-L2 stuck frame, execute_and_score confirms a hand-written true-mechanic 48-action clearing plan (fork levels_completed 2→3) and rejects a “stencil” wiggle skill (levels stay at 2). The stencil’s grid changed on every step — a dense signal of 1.0 — yet it is rejected, because the gate is level-completion, not grid-change. This is the direct fix for the old LLM-judge that committed a plausible-but-wrong skill at 0.88 confidence with 0 real value.

A Skill is just {when: the trigger, do: an action sequence, expect: the predicted change}. Grounding it in the game’s own success signal is what makes a confirmed skill trustworthy.

04

Testing — the honest failure-mode chain

the heart of the log

Each rung below is a fix that revealed the next failure. Nothing here is a finished result; it is a detective story where every fix was real and every remaining gap is named.

1 · TEXT-IO 124 actions, 0 levels — “can’t see” 2 · CODE-NATIVE + free fork() decodes perfectly, never commits — “fork-paralysis” 3 · NO-FORK + action-pressure commits 0→50→150, 0 levels — “can’t find the win” 4 · + IMAGE (PNG) + action-set does seeing reveal the visual win? — running
Diagram 5 · the failure ladder. Each rung is a fix that exposed the next failure. = failed and superseded, = currently running. Detail for each rung below.
  1. 1 · TEXT-IO ✗ fail

    Hex grid dumped as text; the model eyeballs it. 124 actions, 0 levels on ls20. It fixated on a salient-but-wrong “bottom band” and built a wrong text theory.

    Verdict: “can’t see.”

  2. 2 · CODE-NATIVE + free fork() ✗ fail

    The model decodes perfectly in code (find/diff/compare_sequences) but runs every experiment on throwaway forks and never commits a real action → 0 real actions, 0 levels, on both ft09 and ls20. “fork-paralysis.”

    Verdict: “sees but never commits.”

  3. 3 · NO-FORK + action-pressure prompt ✗ fail

    Now it commits real actions (0 → 50 → 150) but still 0 levels. It understands ft09 (ACTION1–5 deplete a countdown bar → GAME_OVER at step 32, which it correctly avoids) yet cannot find the winning action, and commits depleting actions → dies.

    Verdict: “acts but can’t find the win.”

  4. 4 · + IMAGE (PNG) + action-set + simpler loop ⌛ running

    Adds the rendered PNG, the explicit action set, and a simpler code-block-in-text loop (not function-calling). Testing whether seeing the board reveals the visual win.

    Verdict: in progress.

58 unit tests green (frame, sandbox, harness, judge, sleep, control). Sandbox safety is live-verified — import / os / eval are all blocked. U2 (above) is the LLM-free decisive test.

sandbox safety — live-verified blocks
# all three raise inside the restricted sandbox
import os            # -> blocked (no __import__)
os.system("ls")      # -> blocked (name 'os' undefined)
eval("1+1")         # -> blocked (eval not in namespace)
05

Results — honest current state

proven vs open

✓ proven

  • The agent is genuinely code-native — it decodes the grid in code, not by eye.
  • The execution-grounded sleep judge is LLM-free (U2: confirms the true-mechanic skill, rejects the stencil).
  • fork-paralysis diagnosed and fixed.
  • 58 unit tests green; sandbox safety live-verified.
  • The whole harness is ~921 lines (vs symbolica’s thousands / astroseger ~2K + Codex).

□ open

  • It does NOT yet solve ft09 level 0 from scratch — it understands the game but can’t find the win without vision; the image test is ⌛ running.
  • It is slow (LLM latency × many rounds).
  • A strategic fork ahead: a stronger model (gpt-5.5 was down; we ran the weaker gpt-5.2/5.4) vs. adding structure (a built world-model like astroseger’s).

The three systems, side by side

symbolicaastrosegerours
structure orchestrator + 4 dynamic subagents state machine + 1 Codex agent single actor, no orchestrator
IO Python REPL, free-form workspace + session_tools code-block-in-text round loop
code writes Python on a numpy frame writes world_model_engine.py writes Python on a numpy frame
image renders + looks (PNG) PNG (see) + ASCII (verify) PNG every round
memory shared memories add/query world_model.md + planner files notes list + confirmed skills
size thousands of lines ~2K lines + Codex CLI ~921 lines, every file < 500

The point of the table is not that smaller is better — it is that the entire code-native foundation survives the compression, while the orchestration, subagents, state machine, and world-model files turn out to be optional scaffolding we could drop and still decode the board.

honest bottom line
Built, code-native, and LLM-free where it counts (the skill gate). It does not yet win a level from scratch. The next bit hangs on whether seeing closes the gap between “understands the game” and “finds the win” — that test is running now.