research build log · ARC-AGI-3 · ft09 (key-pattern click puzzle)
Our code-native agent spent weeks finishing every ft09 run at 0 levels cleared.
The model’s reasoning was often right; a harness coordinate bug was silently mangling the
ACTION6(x,y) clicks it asked for. This page is the log of what happened once the
plumbing was fixed and an execution-grounded sleep loop was bolted on: eight autoresearch
fixes later, run win2_s31 clears level 4 with a
Relative Human-Action Efficiency of 17.9. Every run below is replayable from raw traces.
Levels cleared per major run, in chronological order. The first two columns are the wall; every step after it is one of the eight fixes in section 02. 몇 주를 0에 묶어놓은 건 추론이 아니라 배관이었다 — 클릭을 먹어버린 좌표 버그 하나.
The mechanism that finally moved levels is diagram (a): wake plays until stuck, sleep mines
rival hypotheses, an engine fork executes each one, and only what raises the fork’s real
levels_completed gets applied to the live game. Diagrams (b)–(f) are the five
systems we strip-mined for it — each reduced to its single load-bearing insight and where it landed in our design.
Stuck is a signal: mine rival hypotheses, let the game itself judge them on a fork, apply only what survives.
막힘은 신호다 — 가설을 캐고, 게임이 직접 심판하고, 살아남은 것만 보드에 적용한다.this is the loop — every idea in (b)–(f) below was absorbed into one of its parts.
“Hypotheses live as a distribution; experiments are chosen to discriminate between them.”
가설은 분포로 살아 있고, 실험이 그 분포를 가른다.absorbed into our design as: each sleep pass proposes ~4 rival rules and fork-tests them against each other — confirm/refute over the set, never a single best guess.
“The skill document IS the parameter — optimization edits text, and a validation gate decides what ships.”
스킬 문서 자체가 파라미터다 — 학습은 텍스트를 고치고, 게이트가 배포를 결정한다.absorbed into our design as: confirmed rules are plain-text artifacts, committed only after the execution gate (fork · judge) passes them.
“Loss becomes a textual gradient; momentum accumulates gradients before committing one patch.”
실패는 텍스트 그래디언트가 되고, 모멘텀처럼 쌓여 하나의 패치가 된다.absorbed into our design as: refuted rules + hindsight credit accumulate as revision seeds across sleep passes, instead of each pass overwriting the last.
“A posterior on every piece of knowledge — and quarantine, not deletion, for what decays.”
모든 지식 조각에 사후확률을 붙인다 — 의심스러우면 지우지 말고 격리.absorbed into our design as: a confirmed/refuted status on every mined rule; refuted rules stay in the ledger so the miner cannot re-propose them.
“Make the world model runnable — predict, diff against reality, refactor.”
세계 모델은 읽는 문서가 아니라 실행되는 코드여야 한다.absorbed into our design as: the engine fork is our runnable world model; recipes are validated by execution, and unexplained OBSERVED transitions become revision seeds (fix #8).
Every 4 wake rounds the harness runs a sleep pass: a miner prompt reads the trace + notes
+ the per-level ledger (observed transitions, refuted rules, confirmed mechanics from earlier levels) and
proposes up to 4 rival {rule, recipe} candidates. execute_and_score
runs each recipe on a throwaway game.fork() and confirms it iff
fork.levels_completed rises — no LLM and no privileged oracle in the gate.
Confirmed recipes are applied to the live game and injected into the actor’s context; each pass logs
realized information gain (entropy drop over the rule set) and a [confirmed]/[refuted]
verdict per rule. See the amber cards in the run viewer for real passes.
Each fix below came out of the autoresearch loop: a run failed in a specific way, the failure was diagnosed from the trace, one change shipped, and the next run tells you whether it mattered. The mini-charts are level-over-rounds trajectories of real harvested runs (before-run vs after-run). Configs evolved between seeds, so read these as before/after snapshots, not controlled A/Bs (see limitations). 여덟 번의 고장 → 여덟 번의 진단 → 여덟 개의 수정. 각 차트는 실제 런의 레벨 궤적이다.
Problem: the harness re-mapped ACTION6(x,y) before submitting —
every click the model reasoned about landed on the wrong cell. Weeks of 0-level runs were plumbing, not reasoning.
Fix: pass grid coordinates through to the engine untouched.
Problem: confirmed rules were only shown to the wake actor as context (“exposure”), and the weak wake model ignored them. Fix: sleep executes its own confirmed recipe directly on the live game — the first level ever cleared happened during sleep.
Problem: the miner free-associated theories disconnected from what the game had actually done. Fix: the miner must condition on a per-level ledger of observed transitions and refuted rules; each pass is scored by realized information gain. (Levels didn’t jump immediately — this fix made passes measurable, which the later fixes needed.)
Problem: hypotheses were pixel-local (“cell (12,34) turns red”), but ft09’s mechanic is a relation: a small key region dictates colors of other regions. Fix: the miner prompt explicitly asks for relational/constraint hypotheses across regions.
Problem: the wake actor occasionally issued RESET, which
restarts the whole game and erases cleared levels. Fix: RESET is blocked unless the game
is actually GAME_OVER.
Problem: every new level started from zero knowledge, even though ft09 levels repeat the key mechanic with a twist. Fix: confirmed mechanics from cleared levels are injected into the miner as analogy seeds (“last level the key meant X — what does it mean here?”).
Problem: the v5e collapse — two empty sleep passes at level 0 tripped the noop-guard and silently disabled sleep for the entire 36-round episode; stale strategies from earlier levels were also being re-applied. Fix: the noop-disable resets on level clear (a new level is a new hypothesis space), only this pass’s confirmed strategies are applied, and max_noops 2→4.
Problem: transitions the game exhibited that no hypothesis predicted (OBSERVED events) were logged and then dropped. Fix: feed OBSERVED click transitions into the next pass’s revision seeds — the missing residual. The next campaign produced the level-4 run.
Every card below is one wake round: the board (an exact engine replay of the trace — not a screenshot, not a mock), the model’s reasoning and the code it wrote, and level/action badges. Amber cards are sleep passes with their mined rules (confirmed/refuted), OBSERVED transitions, applied recipes, and realized IG. Where a grid is missing, that’s honesty, not laziness: once sleep applies clicks outside the trace, later rounds can’t be replayed exactly. 그리드는 스크린샷이 아니라 실제 엔진 리플레이 — 재현 안 되는 라운드는 비워서 표시한다.
Grids are produced by replaying each trace’s actions in ttso.core.game Game('ft09'),
validating both levels_completed and state per round against
the trace — zero mismatches in every replayed round.
g1_neutral — FULL (12/12 rounds exact).v5e_s25 — FULL (36/36 rounds exact; no sleep-applies, so the whole noop-guard failure is replayable).g2_on_s2 — exact up to round 8; the first sleep-apply lands clicks outside the trace, so rounds 9–12 show a placeholder with the apply event (“levels 0→1 during sleep”).v5f_s28 — exact up to round 8; applies after rounds 8/12/16.win2_s31 — exact up to round 4; transitions 0→1, 1→2, 2→3 happened in-sleep, 3→4 in wake (round 34), all annotated as events.Sleep-pass positions (after_round) are inferred from the fixed every-4-rounds cadence and
validated against levels_before/levels_after for all 21 runs (strict on levels_before; soft on
levels_after because a wake-round GAME_OVER reset can drop the level right after a sleep, as in
v5_full_s20). Data: data/results.json (21 runs), raw traces in
data/raw/, rebuild script data/build_results.py.
Clearing levels says whether; RHAE (Relative Human-Action Efficiency) says how wastefully. Per level, compare the agent’s action count to a human baseline; square the ratio so waste hurts quadratically; cap at 115 so a lucky short level can’t dominate; weight later levels more; uncleared levels score 0.
| level ℓ | human baseline (actions) | win2_s31 (actions) | scoreℓ | weight ℓ |
|---|---|---|---|---|
| 1 | 15 | 14 | 114.8 | 1 |
| 2 | 7 | 15 | 21.8 | 2 |
| 3 | 15 | 24 | 39.1 | 3 |
| 4 | 16 | 32 | 25.0 | 4 |
| 5 | 21 | — (uncleared) | 0 | 5 |
| 6 | 17 | — (uncleared) | 0 | 6 |
| RHAE | (114.8 + 2·21.8 + 3·39.1 + 4·25.0) / 21 = 17.9 | Σ=21 |
Reading: level 1 is essentially human-efficient (14 actions vs the 15-action baseline); level 2 is the ugliest (15 actions vs a human’s 7 — the agent re-derives the key instead of recognizing the repeat); and everything from level 5 up is simply unreached. 17.9 is a low score by construction — that is the point of publishing it. 17.9는 낮은 점수다 — 낮게 나오도록 설계한 지표를 그대로 공개하는 것이 핵심.
The wake actor is a weak model that rarely progresses on its own, so “sleep cleared the level” is confounded with “wake couldn’t have.” 3 of win2_s31’s 4 level-ups happened inside sleep (the 4th in wake, round 34) — strong evidence the mechanism works, weak evidence it would still matter over a frontier wake model that might just solve the game directly.
Everything on this page is ft09 only. ft09’s mechanic (a key region dictating block colors, repeated across levels) is unusually friendly to rule-mining and analogy transfer. No claim survives contact with ls20 or vc33 until the same loop is run there.
The 21 runs span 11 config variants; each fix changed the config between seed groups. The before/after pairs in section 02 are chronological snapshots, not matched-seed A/Bs. The one clean contrast is v5e’s noop-guard failure vs the v5f family — same mechanism, one guard flag apart.
The gate is LLM-free (fork level-count only), and 26 of 27 applied recipes across all 21 runs raised the live
level (the exception: one v5c_s23 apply, levels 2→2). Grids in the viewer are exact engine replays with zero
validated mismatches, and the level-4 run is fully reproducible from data/raw/win2_s31.jsonl.