research build log · ARC-AGI-3 · ft09 (key-pattern click puzzle)

Weeks at 0 levels — a harness bug ate every click.
Then: level 4, RHAE 17.9.

Our code-native agent spent weeks finishing every ft09 run at 0 levels cleared. The model’s reasoning was often right; a harness coordinate bug was silently mangling the ACTION6(x,y) clicks it asked for. This page is the log of what happened once the plumbing was fixed and an execution-grounded sleep loop was bolted on: eight autoresearch fixes later, run win2_s31 clears level 4 with a Relative Human-Action Efficiency of 17.9. Every run below is replayable from raw traces.

21 harvested runs · grid-replay validated
4levels · best run 17.9RHAE (win2_s31) 21harvested runs 8autoresearch fixes 3/4of win2’s level-ups happened in-sleep
0
g1 baselinebug era
0
g2 early-ONbug era
1
v3 · s2 
1
v5 · s20 
2
v5c · s23 
3
v5d · s24 
3
v5f · s28 
4
win2 · s31 

Levels cleared per major run, in chronological order. The first two columns are the wall; every step after it is one of the eight fixes in section 02. 몇 주를 0에 묶어놓은 건 추론이 아니라 배관이었다 — 클릭을 먹어버린 좌표 버그 하나.

01

How it works — one loop, five absorbed ideas

6 animated diagrams · pure CSS

The mechanism that finally moved levels is diagram (a): wake plays until stuck, sleep mines rival hypotheses, an engine fork executes each one, and only what raises the fork’s real levels_completed gets applied to the live game. Diagrams (b)–(f) are the five systems we strip-mined for it — each reduced to its single load-bearing insight and where it landed in our design.

(a) OUR LOOP — wake → stuck → hypotheses → fork-test → apply ✓ got level 4

WAKE — actor writes code, clicks STUCK sleep: mine 4 rivals apply confirmed recipe on the live game H1 · ACTION5 submits as-is H2 · key is the complement map H3 · 0/2 key sets block colors H4 · copy the example’s ring fork · execute · judge run on game.fork() confirm iff levels_completed ↑ rule → recipe ✓ LEVEL 0 → 1

Stuck is a signal: mine rival hypotheses, let the game itself judge them on a fork, apply only what survives.

막힘은 신호다 — 가설을 캐고, 게임이 직접 심판하고, 살아남은 것만 보드에 적용한다.

this is the loop — every idea in (b)–(f) below was absorbed into one of its parts.

static architecture reference (kept from v1): the CodeAct wake loop + the sleep gate
observe frame numbers + PNG image state / levels / available model writes a run_python code block sandbox runs code · submit_action ≤ 3 real actions / block prompt extract output: <stdout> + fresh board PNG  ↻  repeat until WIN / budget sandbox namespace frame .find(*v) .diff() .change_summary() .color_counts() .state .levels_completed submit_action(name, x, y) → new frame np  ·  notes (list, persists)
The CodeAct wake loop (v1 diagram, still accurate). A single actor: observe (frame + image) → model writes run_python → sandbox executes → stdout + new image feed back. The boxed sandbox namespace is the agent’s entire API.
actor plays sleep? every 4 rounds mine ≤ 4 rival rules {rule, recipe} fork · execute · judge run recipe on game.fork() confirm IFF levels_completed↑ yes apply confirmed recipe to the LIVE game + inject into actor context no ↻
The sleep gate (v1 diagram, updated labels). Mine ≤4 rival rules → fork-execute-judge (confirm only if the fork’s real level count rises) → auto-apply + inject. The gate has no LLM and no oracle.

(b) Piriyakulkij & Ellis — rules as a posterior

discriminating experiment ✓ 0.9 r1 r2 r3 r4 bars = posterior over candidate rules; the experiment moves the mass

“Hypotheses live as a distribution; experiments are chosen to discriminate between them.”

가설은 분포로 살아 있고, 실험이 그 분포를 가른다.

absorbed into our design as: each sleep pass proposes ~4 rival rules and fork-tests them against each other — confirm/refute over the set, never a single best guess.

(c) Microsoft SkillOpt — the document is the parameter

rollout diagnosis: re-clicked cleared cells skill.md gate library patch patch′

“The skill document IS the parameter — optimization edits text, and a validation gate decides what ships.”

스킬 문서 자체가 파라미터다 — 학습은 텍스트를 고치고, 게이트가 배포를 결정한다.

absorbed into our design as: confirmed rules are plain-text artifacts, committed only after the execution gate (fork · judge) passes them.

(d) SkillGrad — textual gradients with momentum

loss failed episode momentum buffer ∇ submit once ∇ read key first ∇ don’t re-click merged patch skill.md gradients accumulate; one coherent patch lands, not three twitchy ones

“Loss becomes a textual gradient; momentum accumulates gradients before committing one patch.”

실패는 텍스트 그래디언트가 되고, 모멘텀처럼 쌓여 하나의 패치가 된다.

absorbed into our design as: refuted rules + hindsight credit accumulate as revision seeds across sleep passes, instead of each pass overwriting the last.

(e) symbolica bestiary — a posterior on every record

key 0↦8, 2↦9 (colors) ACTION5 resets the timer ring copies the example quarantine every record carries a posterior gauge; when it decays, the record is quarantined, not deleted

“A posterior on every piece of knowledge — and quarantine, not deletion, for what decays.”

모든 지식 조각에 사후확률을 붙인다 — 의심스러우면 지우지 말고 격리.

absorbed into our design as: a confirmed/refuted status on every mined rule; refuted rules stay in the ledger so the miner cannot re-propose them.

lineage (v1 diagram): the symbolica orchestrator we forked our engine from
orchestrator LLM “manager, not a player” explorer pokes + diffs theorist text-only reasoning tester tight-budget check solver executes actions spawn_agent(...) memories · add/query REPL base: frame (numpy) · submit_action
symbolica (lineage). One orchestrator dynamically spawns four specialized subagents over a shared memories store and a common REPL base. We kept the code-native REPL foundation and dropped the orchestration; what survived conceptually is the posterior-per-record idea above.

(f) astroseger — the runnable world model

world_model.py predict predicted observed mismatch → refactor = re-predict ✓ the world model is code: run it, diff it against reality, fix it

“Make the world model runnable — predict, diff against reality, refactor.”

세계 모델은 읽는 문서가 아니라 실행되는 코드여야 한다.

absorbed into our design as: the engine fork is our runnable world model; recipes are validated by execution, and unexplained OBSERVED transitions become revision seeds (fix #8).

lineage (v1 diagram): astroseger’s state machine + Codex world-model files
state machine normal stuck reset trouble Codex CLI single agent drives world_model_engine.py planner world_model.md writes inputs: ASCII (verify) + PNG (see)
astroseger (lineage). A four-protocol state machine drives a single Codex agent that writes executable world-model files. We dropped the state machine and the files; what survived is the runnable-world-model insight — our fork plays that role with zero extra files.
technical detail: what exactly a sleep pass does

Every 4 wake rounds the harness runs a sleep pass: a miner prompt reads the trace + notes + the per-level ledger (observed transitions, refuted rules, confirmed mechanics from earlier levels) and proposes up to 4 rival {rule, recipe} candidates. execute_and_score runs each recipe on a throwaway game.fork() and confirms it iff fork.levels_completed rises — no LLM and no privileged oracle in the gate. Confirmed recipes are applied to the live game and injected into the actor’s context; each pass logs realized information gain (entropy drop over the rule set) and a [confirmed]/[refuted] verdict per rule. See the amber cards in the run viewer for real passes.

02

Fix timeline — eight autoresearch fixes, before/after

every chart = a real run from results.json

Each fix below came out of the autoresearch loop: a run failed in a specific way, the failure was diagnosed from the trace, one change shipped, and the next run tells you whether it mattered. The mini-charts are level-over-rounds trajectories of real harvested runs (before-run vs after-run). Configs evolved between seeds, so read these as before/after snapshots, not controlled A/Bs (see limitations). 여덟 번의 고장 → 여덟 번의 진단 → 여덟 개의 수정. 각 차트는 실제 런의 레벨 궤적이다.

  1. coord-bug 34bb011

    Problem: the harness re-mapped ACTION6(x,y) before submitting — every click the model reasoned about landed on the wrong cell. Weeks of 0-level runs were plumbing, not reasoning. Fix: pass grid coordinates through to the engine untouched.

  2. exposure → auto-apply 34bb011 (G2 escape runner)

    Problem: confirmed rules were only shown to the wake actor as context (“exposure”), and the weak wake model ignored them. Fix: sleep executes its own confirmed recipe directly on the live game — the first level ever cleared happened during sleep.

  3. evidence-conditioned mining 98574ee

    Problem: the miner free-associated theories disconnected from what the game had actually done. Fix: the miner must condition on a per-level ledger of observed transitions and refuted rules; each pass is scored by realized information gain. (Levels didn’t jump immediately — this fix made passes measurable, which the later fixes needed.)

  4. relational lens miner prompt

    Problem: hypotheses were pixel-local (“cell (12,34) turns red”), but ft09’s mechanic is a relation: a small key region dictates colors of other regions. Fix: the miner prompt explicitly asks for relational/constraint hypotheses across regions.

  5. RESET guard 8efed24

    Problem: the wake actor occasionally issued RESET, which restarts the whole game and erases cleared levels. Fix: RESET is blocked unless the game is actually GAME_OVER.

  6. analogy transfer e1b5030

    Problem: every new level started from zero knowledge, even though ft09 levels repeat the key mechanic with a twist. Fix: confirmed mechanics from cleared levels are injected into the miner as analogy seeds (“last level the key meant X — what does it mean here?”).

  7. noop reset + fresh-strategy apply cc712e7 · 64e4250

    Problem: the v5e collapse — two empty sleep passes at level 0 tripped the noop-guard and silently disabled sleep for the entire 36-round episode; stale strategies from earlier levels were also being re-applied. Fix: the noop-disable resets on level clear (a new level is a new hypothesis space), only this pass’s confirmed strategies are applied, and max_noops 2→4.

  8. observed-residual f9e5d10

    Problem: transitions the game exhibited that no hypothesis predicted (OBSERVED events) were logged and then dropped. Fix: feed OBSERVED click transitions into the next pass’s revision seeds — the missing residual. The next campaign produced the level-4 run.

03

Run viewer — five runs, round by round

grids replayed in the real engine

Every card below is one wake round: the board (an exact engine replay of the trace — not a screenshot, not a mock), the model’s reasoning and the code it wrote, and level/action badges. Amber cards are sleep passes with their mined rules (confirmed/refuted), OBSERVED transitions, applied recipes, and realized IG. Where a grid is missing, that’s honesty, not laziness: once sleep applies clicks outside the trace, later rounds can’t be replayed exactly. 그리드는 스크린샷이 아니라 실제 엔진 리플레이 — 재현 안 되는 라운드는 비워서 표시한다.

0123456789101112131415
technical detail: replay fidelity & provenance

Grids are produced by replaying each trace’s actions in ttso.core.game Game('ft09'), validating both levels_completed and state per round against the trace — zero mismatches in every replayed round.

  • g1_neutral — FULL (12/12 rounds exact).
  • v5e_s25 — FULL (36/36 rounds exact; no sleep-applies, so the whole noop-guard failure is replayable).
  • g2_on_s2 — exact up to round 8; the first sleep-apply lands clicks outside the trace, so rounds 9–12 show a placeholder with the apply event (“levels 0→1 during sleep”).
  • v5f_s28 — exact up to round 8; applies after rounds 8/12/16.
  • win2_s31 — exact up to round 4; transitions 0→1, 1→2, 2→3 happened in-sleep, 3→4 in wake (round 34), all annotated as events.

Sleep-pass positions (after_round) are inferred from the fixed every-4-rounds cadence and validated against levels_before/levels_after for all 21 runs (strict on levels_before; soft on levels_after because a wake-round GAME_OVER reset can drop the level right after a sleep, as in v5_full_s20). Data: data/results.json (21 runs), raw traces in data/raw/, rebuild script data/build_results.py.

04

RHAE — how efficient, relative to a human?

win2_s31 = 17.9

Clearing levels says whether; RHAE (Relative Human-Action Efficiency) says how wastefully. Per level, compare the agent’s action count to a human baseline; square the ratio so waste hurts quadratically; cap at 115 so a lucky short level can’t dominate; weight later levels more; uncleared levels score 0.

score = min( (baseline / actions)² × 100,  115 )   ·   uncleared ⇒ 0
RHAE = Σ ℓ·score  /  Σ ℓ   (ℓ = 1…6, level-number-weighted)
level ℓhuman baseline (actions)win2_s31 (actions)scoreℓweight ℓ
11514114.81
271521.82
3152439.13
4163225.04
521— (uncleared)05
617— (uncleared)06
RHAE(114.8 + 2·21.8 + 3·39.1 + 4·25.0) / 21 = 17.9Σ=21

Reading: level 1 is essentially human-efficient (14 actions vs the 15-action baseline); level 2 is the ugliest (15 actions vs a human’s 7 — the agent re-derives the key instead of recognizing the repeat); and everything from level 5 up is simply unreached. 17.9 is a low score by construction — that is the point of publishing it. 17.9는 낮은 점수다 — 낮게 나오도록 설계한 지표를 그대로 공개하는 것이 핵심.

the honest fork-vs-API caveat
Sleep tests candidate recipes on engine forks — free, invisible lookahead that the official ARC-AGI-3 HTTP API does not offer. Wake actions and applied recipe clicks are counted in RHAE; fork probes are not. A human (or an API-only agent) pays for every experiment with real actions, so this RHAE is not directly comparable to API-scored agents. It is a within-lab efficiency ledger, not a leaderboard number.
05

Limitations — what this does not show yet

read before believing

□ weak-wake confound

The wake actor is a weak model that rarely progresses on its own, so “sleep cleared the level” is confounded with “wake couldn’t have.” 3 of win2_s31’s 4 level-ups happened inside sleep (the 4th in wake, round 34) — strong evidence the mechanism works, weak evidence it would still matter over a frontier wake model that might just solve the game directly.

□ single game so far

Everything on this page is ft09 only. ft09’s mechanic (a key region dictating block colors, repeated across levels) is unusually friendly to rule-mining and analogy transfer. No claim survives contact with ls20 or vc33 until the same loop is run there.

□ config evolution across seeds

The 21 runs span 11 config variants; each fix changed the config between seed groups. The before/after pairs in section 02 are chronological snapshots, not matched-seed A/Bs. The one clean contrast is v5e’s noop-guard failure vs the v5f family — same mechanism, one guard flag apart.

✓ what does hold

The gate is LLM-free (fork level-count only), and 26 of 27 applied recipes across all 21 runs raised the live level (the exception: one v5c_s23 apply, levels 2→2). Grids in the viewer are exact engine replays with zero validated mismatches, and the level-4 run is fully reproducible from data/raw/win2_s31.jsonl.