research build log · ARC-AGI-3 · ft09 (key-pattern click puzzle) · week of 2026‑07‑03
같은 모델, 같은 wake — sleep 하니스를 붙이면 4레벨, 떼면 0레벨.
fork.levels_completed ↑v5f_s28 (rounds 1→8)
with that pass’s actual rule texts; the confirmed rule’s apply escaped level 0 during sleep.
Hover to pause. 실제 엔진 프레임 + 실제 규칙 텍스트 리플레이 — 마우스를 올리면 멈춘다.Levels cleared per major run, chronological — each step after the wall is one of the eight fixes.
Our code-native agent spent weeks finishing every ft09 run at 0 levels cleared.
The model’s reasoning was often right; a harness coordinate bug was silently mangling the
ACTION6(x,y) clicks it asked for. This page is the log of what happened once the
plumbing was fixed and an execution-grounded sleep loop was bolted on: eight autoresearch
fixes later, run win2_s31 clears level 4 with a
Relative Human-Action Efficiency of 17.9. All 21 harvested runs are replayable from raw
traces, grid-replay validated.
몇 주를 0에 묶어놓은 건 추론이 아니라 배관이었다 — 클릭을 먹어버린 좌표 버그 하나.
Wake plays until stuck; sleep mines ≤4 rival rules; an engine fork executes each
recipe; only what raises the fork’s real levels_completed touches the live game.
막힘은 신호다 — 가설을 캐고, 게임이 직접 심판하고, 살아남은 것만 적용한다.
gate = LLM-free (fork level count only) 26/27 applies raised the live level sleep every 4 rounds ≤4 rival rules / pass confirmed rules injected + auto-applied
Every 4 wake rounds the harness runs a sleep pass: a miner prompt reads the trace + notes
+ the per-level ledger (observed transitions, refuted rules, confirmed mechanics from earlier levels) and
proposes up to 4 rival {rule, recipe} candidates. execute_and_score
runs each recipe on a throwaway game.fork() and confirms it iff
fork.levels_completed rises — no LLM and no privileged oracle in the gate.
Confirmed recipes are applied to the live game and injected into the actor’s context; each pass logs
realized information gain (entropy drop over the rule set) and a [confirmed]/[refuted]
verdict per rule. Step through real passes in the trace explorer.
Each card is a system we strip-mined for the loop: its single load-bearing insight, and where it landed in our design. 여섯 시스템에서 각각 하중을 받치는 통찰 하나씩만 캐서 루프에 흡수했다.
card = hypothesis / record ✓ = confirmed ✗ = refuted door = validation gate hover any diagram to pause
“Hypotheses live as a distribution; experiments are chosen to discriminate between them.”
가설은 분포로 살아 있고, 실험이 그 분포를 가른다.each sleep pass proposes ~4 rival rules and fork-tests them against each other — confirm/refute over the set, never a single best guess.
“The skill document IS the parameter — optimization edits text, and a validation gate decides what ships.”
스킬 문서 자체가 파라미터다 — 학습은 텍스트를 고치고, 게이트가 배포를 결정한다.confirmed rules are plain-text artifacts, committed only after the execution gate (fork · judge) passes them.
“Loss becomes a textual gradient; momentum accumulates gradients before committing one patch.”
실패는 텍스트 그래디언트가 되고, 모멘텀처럼 쌓여 하나의 패치가 된다.refuted rules + hindsight credit accumulate as revision seeds across sleep passes, instead of each pass overwriting the last.
“A posterior on every piece of knowledge — and quarantine, not deletion, for what decays.”
모든 지식 조각에 사후확률을 붙인다 — 의심스러우면 지우지 말고 격리.a confirmed/refuted status on every mined rule; refuted rules stay in the ledger so the miner cannot re-propose them.
“Make the world model runnable — predict, diff against reality, refactor.”
세계 모델은 읽는 문서가 아니라 실행되는 코드여야 한다.the engine fork is our runnable world model; recipes are validated by execution, and unexplained OBSERVED transitions become revision seeds (fix #8).
“Memory management itself is a trainable skill (metamemory) — writes and reads are first-class actions, and an outer loop retrains that faculty.”
기억 관리 자체가 학습 가능한 스킬이다 — 메타메모리.orthogonal to ours — AutoMem trains the memory faculty across episodes; we verify the memory contents at test time (and our fork-verdicts could supply its training signal). arXiv 2607.01224
Eight failures → eight diagnoses → eight fixes; each mini-chart is a real run’s level-over-rounds trajectory, before-run vs after-run. 여덟 번의 고장 → 여덟 번의 진단 → 여덟 개의 수정.
snapshots, not controlled A/Bs configs evolved between seeds — see limits
Problem: the harness re-mapped ACTION6(x,y) before submitting —
every click the model reasoned about landed on the wrong cell. Weeks of 0-level runs were plumbing, not reasoning.
Fix: pass grid coordinates through to the engine untouched.
Problem: confirmed rules were only shown to the wake actor as context (“exposure”), and the weak wake model ignored them. Fix: sleep executes its own confirmed recipe directly on the live game — the first level ever cleared happened during sleep.
Problem: the miner free-associated theories disconnected from what the game had actually done. Fix: the miner must condition on a per-level ledger of observed transitions and refuted rules; each pass is scored by realized information gain. (Levels didn’t jump immediately — this fix made passes measurable, which the later fixes needed.)
Problem: hypotheses were pixel-local (“cell (12,34) turns red”), but ft09’s mechanic is a relation: a small key region dictates colors of other regions. Fix: the miner prompt explicitly asks for relational/constraint hypotheses across regions.
Problem: the wake actor occasionally issued RESET, which
restarts the whole game and erases cleared levels. Fix: RESET is blocked unless the game
is actually GAME_OVER.
Problem: every new level started from zero knowledge, even though ft09 levels repeat the key mechanic with a twist. Fix: confirmed mechanics from cleared levels are injected into the miner as analogy seeds (“last level the key meant X — what does it mean here?”).
Problem: the v5e collapse — two empty sleep passes at level 0 tripped the noop-guard and silently disabled sleep for the entire 36-round episode; stale strategies from earlier levels were also being re-applied. Fix: the noop-disable resets on level clear (a new level is a new hypothesis space), only this pass’s confirmed strategies are applied, and max_noops 2→4.
Problem: transitions the game exhibited that no hypothesis predicted (OBSERVED events) were logged and then dropped. Fix: feed OBSERVED click transitions into the next pass’s revision seeds — the missing residual. The next campaign produced the level-4 run.
Every grid behind these cards is an exact engine replay — open a card in the trace explorer and step it with ←/→. 그리드는 스크린샷이 아니라 실제 엔진 리플레이다.
Grids are produced by replaying each trace’s actions in ttso.core.game Game('ft09'),
validating both levels_completed and state per round against
the trace — zero mismatches in every replayed round.
g1_neutral — FULL (12/12 rounds exact).v5e_s25 — FULL (36/36 rounds exact; no sleep-applies, so the whole noop-guard failure is replayable).g2_on_s2 — exact up to round 8; the first sleep-apply lands clicks outside the trace, so rounds 9–12 show a placeholder with the apply event (“levels 0→1 during sleep”).v5f_s28 — exact up to round 8; applies after rounds 8/12/16.win2_s31 — exact up to round 4; transitions 0→1, 1→2, 2→3 happened in-sleep, 3→4 in wake (round 34), all annotated as events.Sleep-pass positions (after_round) are inferred from the fixed every-4-rounds cadence and
validated against levels_before/levels_after for all 21 runs (strict on levels_before; soft on
levels_after because a wake-round GAME_OVER reset can drop the level right after a sleep, as in
v5_full_s20). Data: data/results.json (21 runs), raw traces in
data/raw/, rebuild script data/build_results.py.
Clearing levels says whether; RHAE says how wastefully — level 1 is human-efficient (score 114.8), the overall 17.9 is low by construction because unreached levels score 0. 17.9는 낮은 점수다 — 낮게 나오도록 설계한 지표를 그대로 공개한다.
L1: 14 actions vs human 15 → 114.8 L2 ugliest: 15 vs 7 → 21.8 L5–6 unreached → 0 not leaderboard-comparable (fork caveat below)
| level ℓ | human baseline (actions) | win2_s31 (actions) | scoreℓ | weight ℓ |
|---|---|---|---|---|
| 1 | 15 | 14 | 114.8 | 1 |
| 2 | 7 | 15 | 21.8 | 2 |
| 3 | 15 | 24 | 39.1 | 3 |
| 4 | 16 | 32 | 25.0 | 4 |
| 5 | 21 | — (uncleared) | 0 | 5 |
| 6 | 17 | — (uncleared) | 0 | 6 |
| RHAE | (114.8 + 2·21.8 + 3·39.1 + 4·25.0) / 21 = 17.9 | Σ=21 |
Reading: level 1 is essentially human-efficient (14 actions vs the 15-action baseline); level 2 is
the ugliest (15 actions vs a human’s 7 — the agent re-derives the key instead of recognizing the repeat);
and everything from level 5 up is simply unreached. Per level:
min((baseline/actions)²×100, 115), level-number-weighted, uncleared ⇒ 0.
17.9 is a low score by construction — that is the point of publishing it.
Three open confounds, one thing that does hold — expand each for the honest version.
“Sleep cleared it” is confounded with “wake couldn’t have.”
The wake actor is a weak model that rarely progresses on its own, so “sleep cleared the level” is confounded with “wake couldn’t have.” 3 of win2_s31’s 4 level-ups happened inside sleep (the 4th in wake, round 34) — strong evidence the mechanism works, weak evidence it would still matter over a frontier wake model that might just solve the game directly.
Everything here is ft09 only — a rule-mining-friendly game.
ft09’s mechanic (a key region dictating block colors, repeated across levels) is unusually friendly to rule-mining and analogy transfer. No claim survives contact with ls20 or vc33 until the same loop is run there.
Before/after pairs are chronological snapshots, not matched-seed A/Bs.
The 21 runs span 11 config variants; each fix changed the config between seed groups. The before/after pairs in the fix timeline are chronological snapshots, not matched-seed A/Bs. The one clean contrast is v5e’s noop-guard failure vs the v5f family — same mechanism, one guard flag apart.
LLM-free gate; 26/27 applies raised the live level; replays exact.
The gate is LLM-free (fork level-count only), and 26 of 27 applied recipes across all 21 runs raised the live
level (the exception: one v5c_s23 apply, levels 2→2). Grids in the explorer are exact engine replays with zero
validated mismatches, and the level-4 run is fully reproducible from data/raw/win2_s31.jsonl.