← all weeks week of 2026‑07‑03 · v2 (latest) prev: 2026‑06‑30 (v1)

research build log · ARC-AGI-3 · ft09 (key-pattern click puzzle) · week of 2026‑07‑03

A frozen gpt-5.5 with our test-time sleep harness climbs 0 → 4 levels; the same wake without it stays at 0.

같은 모델, 같은 wake — sleep 하니스를 붙이면 4레벨, 떼면 0레벨.

0 → 4levels · best run win2_s31 27 vs 0sleep-escapes · ON (18 runs) vs OFF 17.9RHAE · level-1 score 114.8 ≈ human
r1 · wake · L0 r3 · wake · L0 r5 · wake · L0 r7 · wake · L0 r8 · stuck at L0 → sleep
LEVEL 0 → 1during sleep · run v5f_s28
☽ sleep after round 8 — mine rivals · fork-test · apply
ACTION1 is the submit/check action and the current grid is already in a completed state.
Clicking the leftmost color-12 cell in the bottom status strip is the level-exit or submit control.
ACTION2, not ACTION1, is the submit/check action — first swap the southwest and southeast surrounding blocks implied by the key.
fork · execute · judge — confirm iff fork.levels_completed
✓ confirmed rule → applied to the live game
Real replay, not a mock: exact engine frames from run v5f_s28 (rounds 1→8) with that pass’s actual rule texts; the confirmed rule’s apply escaped level 0 during sleep. Hover to pause. 실제 엔진 프레임 + 실제 규칙 텍스트 리플레이 — 마우스를 올리면 멈춘다.
0
g1 baselinebug era
0
g2 early-ONbug era
1
v3 · s2 
1
v5 · s20 
2
v5c · s23 
3
v5d · s24 
3
v5f · s28 
4
win2 · s31 

Levels cleared per major run, chronological — each step after the wall is one of the eight fixes.

the back-story in one paragraph — weeks at 0 levels was a harness bug, not reasoning

Our code-native agent spent weeks finishing every ft09 run at 0 levels cleared. The model’s reasoning was often right; a harness coordinate bug was silently mangling the ACTION6(x,y) clicks it asked for. This page is the log of what happened once the plumbing was fixed and an execution-grounded sleep loop was bolted on: eight autoresearch fixes later, run win2_s31 clears level 4 with a Relative Human-Action Efficiency of 17.9. All 21 harvested runs are replayable from raw traces, grid-replay validated. 몇 주를 0에 묶어놓은 건 추론이 아니라 배관이었다 — 클릭을 먹어버린 좌표 버그 하나.

01

How it works — wake → stuck → sleep → fork-test → apply

the hero animation above is this loop, on real frames

Wake plays until stuck; sleep mines ≤4 rival rules; an engine fork executes each recipe; only what raises the fork’s real levels_completed touches the live game. 막힘은 신호다 — 가설을 캐고, 게임이 직접 심판하고, 살아남은 것만 적용한다.

gate = LLM-free (fork level count only) 26/27 applies raised the live level sleep every 4 rounds ≤4 rival rules / pass confirmed rules injected + auto-applied

what exactly a sleep pass does (miner · fork gate · apply)

Every 4 wake rounds the harness runs a sleep pass: a miner prompt reads the trace + notes + the per-level ledger (observed transitions, refuted rules, confirmed mechanics from earlier levels) and proposes up to 4 rival {rule, recipe} candidates. execute_and_score runs each recipe on a throwaway game.fork() and confirms it iff fork.levels_completed rises — no LLM and no privileged oracle in the gate. Confirmed recipes are applied to the live game and injected into the actor’s context; each pass logs realized information gain (entropy drop over the rule set) and a [confirmed]/[refuted] verdict per rule. Step through real passes in the trace explorer.

static architecture reference (kept from v1): the CodeAct wake loop + the sleep gate
observe frame numbers + PNG image state / levels / available model writes a run_python code block sandbox runs code · submit_action ≤ 3 real actions / block prompt extract output: <stdout> + fresh board PNG  ↻  repeat until WIN / budget sandbox namespace frame .find(*v) .diff() .change_summary() .color_counts() .state .levels_completed submit_action(name, x, y) → new frame np  ·  notes (list, persists)
The CodeAct wake loop (v1 diagram, still accurate). A single actor: observe (frame + image) → model writes run_python → sandbox executes → stdout + new image feed back. The boxed sandbox namespace is the agent’s entire API.
actor plays sleep? every 4 rounds mine ≤ 4 rival rules {rule, recipe} fork · execute · judge run recipe on game.fork() confirm IFF levels_completed↑ yes apply confirmed recipe to the LIVE game + inject into actor context no ↻
The sleep gate (v1 diagram, updated labels). Mine ≤4 rival rules → fork-execute-judge (confirm only if the fork’s real level count rises) → auto-apply + inject. The gate has no LLM and no oracle.
02

Prior works — six systems, one load-bearing insight each

animated · pure CSS · hover to pause

Each card is a system we strip-mined for the loop: its single load-bearing insight, and where it landed in our design. 여섯 시스템에서 각각 하중을 받치는 통찰 하나씩만 캐서 루프에 흡수했다.

card = hypothesis / record ✓ = confirmed ✗ = refuted door = validation gate hover any diagram to pause

(b) Piriyakulkij & Ellis — rules as a posterior

discriminating experiment ✓ 0.9 r1 r2 r3 r4 bars = posterior over candidate rules; the experiment moves the mass

“Hypotheses live as a distribution; experiments are chosen to discriminate between them.”

가설은 분포로 살아 있고, 실험이 그 분포를 가른다.
absorbed into our design as…

each sleep pass proposes ~4 rival rules and fork-tests them against each other — confirm/refute over the set, never a single best guess.

(c) Microsoft SkillOpt — the document is the parameter

rollout diagnosis: re-clicked cleared cells skill.md gate (door) library patch patch′

“The skill document IS the parameter — optimization edits text, and a validation gate decides what ships.”

스킬 문서 자체가 파라미터다 — 학습은 텍스트를 고치고, 게이트가 배포를 결정한다.
absorbed into our design as…

confirmed rules are plain-text artifacts, committed only after the execution gate (fork · judge) passes them.

(d) SkillGrad — textual gradients with momentum

loss failed episode momentum buffer ∇ submit once ∇ read key first ∇ don’t re-click merged patch skill.md gradients accumulate; one coherent patch lands, not three twitchy ones

“Loss becomes a textual gradient; momentum accumulates gradients before committing one patch.”

실패는 텍스트 그래디언트가 되고, 모멘텀처럼 쌓여 하나의 패치가 된다.
absorbed into our design as…

refuted rules + hindsight credit accumulate as revision seeds across sleep passes, instead of each pass overwriting the last.

(e) symbolica bestiary — a posterior on every record

key 0↦8, 2↦9 (colors) ACTION5 resets the timer ring copies the example quarantine every record carries a posterior gauge; when it decays, the record is quarantined, not deleted

“A posterior on every piece of knowledge — and quarantine, not deletion, for what decays.”

모든 지식 조각에 사후확률을 붙인다 — 의심스러우면 지우지 말고 격리.
absorbed into our design as…

a confirmed/refuted status on every mined rule; refuted rules stay in the ledger so the miner cannot re-propose them.

lineage (v1 diagram): the symbolica orchestrator we forked our engine from
orchestrator LLM “manager, not a player” explorer pokes + diffs theorist text-only reasoning tester tight-budget check solver executes actions spawn_agent(...) memories · add/query REPL base: frame (numpy) · submit_action
symbolica (lineage). One orchestrator dynamically spawns four specialized subagents over a shared memories store and a common REPL base. We kept the code-native REPL foundation and dropped the orchestration; what survived conceptually is the posterior-per-record idea above.

(f) astroseger — the runnable world model

world_model.py predict predicted observed mismatch → refactor = re-predict ✓ the world model is code: run it, diff it against reality, fix it

“Make the world model runnable — predict, diff against reality, refactor.”

세계 모델은 읽는 문서가 아니라 실행되는 코드여야 한다.
absorbed into our design as…

the engine fork is our runnable world model; recipes are validated by execution, and unexplained OBSERVED transitions become revision seeds (fix #8).

lineage (v1 diagram): astroseger’s state machine + Codex world-model files
state machine normal stuck reset trouble Codex CLI single agent drives world_model_engine.py planner world_model.md writes inputs: ASCII (verify) + PNG (see)
astroseger (lineage). A four-protocol state machine drives a single Codex agent that writes executable world-model files. We dropped the state machine and the files; what survived is the runnable-world-model insight — our fork plays that role with zero extra files.

(g) AutoMem — memory ops as first-class actions arXiv 2607.01224

R outer meta-loop: episode reward retrains the memory policy itself agent actions = env ops + memory ops mem/rules.md mem/failures.md write() — an action read() before acting memory files are read and written like moves — then the loop above trains the habit

“Memory management itself is a trainable skill (metamemory) — writes and reads are first-class actions, and an outer loop retrains that faculty.”

기억 관리 자체가 학습 가능한 스킬이다 — 메타메모리.
relation to ours

orthogonal to ours — AutoMem trains the memory faculty across episodes; we verify the memory contents at test time (and our fork-verdicts could supply its training signal). arXiv 2607.01224

03

Fixes — eight autoresearch fixes, before/after

every chart = a real run from results.json

Eight failures → eight diagnoses → eight fixes; each mini-chart is a real run’s level-over-rounds trajectory, before-run vs after-run. 여덟 번의 고장 → 여덟 번의 진단 → 여덟 개의 수정.

snapshots, not controlled A/Bs configs evolved between seeds — see limits

  1. coord-bug 34bb011

    clicks landed on the wrong cell → pass coordinates through untouched

    Problem: the harness re-mapped ACTION6(x,y) before submitting — every click the model reasoned about landed on the wrong cell. Weeks of 0-level runs were plumbing, not reasoning. Fix: pass grid coordinates through to the engine untouched.

  2. exposure → auto-apply 34bb011 (G2 escape runner)

    confirmed rules were only shown → sleep applies them itself

    Problem: confirmed rules were only shown to the wake actor as context (“exposure”), and the weak wake model ignored them. Fix: sleep executes its own confirmed recipe directly on the live game — the first level ever cleared happened during sleep.

  3. evidence-conditioned mining 98574ee

    the miner free-associated → condition on the observed-transition ledger

    Problem: the miner free-associated theories disconnected from what the game had actually done. Fix: the miner must condition on a per-level ledger of observed transitions and refuted rules; each pass is scored by realized information gain. (Levels didn’t jump immediately — this fix made passes measurable, which the later fixes needed.)

  4. relational lens miner prompt

    pixel-local hypotheses → ask for relations between regions

    Problem: hypotheses were pixel-local (“cell (12,34) turns red”), but ft09’s mechanic is a relation: a small key region dictates colors of other regions. Fix: the miner prompt explicitly asks for relational/constraint hypotheses across regions.

  5. RESET guard 8efed24

    RESET erased cleared levels → blocked unless GAME_OVER

    Problem: the wake actor occasionally issued RESET, which restarts the whole game and erases cleared levels. Fix: RESET is blocked unless the game is actually GAME_OVER.

  6. analogy transfer e1b5030

    every level restarted from zero → seed the miner with cleared-level mechanics

    Problem: every new level started from zero knowledge, even though ft09 levels repeat the key mechanic with a twist. Fix: confirmed mechanics from cleared levels are injected into the miner as analogy seeds (“last level the key meant X — what does it mean here?”).

  7. noop reset + fresh-strategy apply cc712e7 · 64e4250

    noop-guard silently disabled sleep → reset it on level clear

    Problem: the v5e collapse — two empty sleep passes at level 0 tripped the noop-guard and silently disabled sleep for the entire 36-round episode; stale strategies from earlier levels were also being re-applied. Fix: the noop-disable resets on level clear (a new level is a new hypothesis space), only this pass’s confirmed strategies are applied, and max_noops 2→4.

  8. observed-residual f9e5d10

    unpredicted transitions were dropped → feed them back as revision seeds

    Problem: transitions the game exhibited that no hypothesis predicted (OBSERVED events) were logged and then dropped. Fix: feed OBSERVED click transitions into the next pass’s revision seeds — the missing residual. The next campaign produced the level-4 run.

04

Explorer — five runs, one story each

step round-by-round in the trace explorer

Every grid behind these cards is an exact engine replay — open a card in the trace explorer and step it with /. 그리드는 스크린샷이 아니라 실제 엔진 리플레이다.

technical detail: replay fidelity & provenance

Grids are produced by replaying each trace’s actions in ttso.core.game Game('ft09'), validating both levels_completed and state per round against the trace — zero mismatches in every replayed round.

  • g1_neutral — FULL (12/12 rounds exact).
  • v5e_s25 — FULL (36/36 rounds exact; no sleep-applies, so the whole noop-guard failure is replayable).
  • g2_on_s2 — exact up to round 8; the first sleep-apply lands clicks outside the trace, so rounds 9–12 show a placeholder with the apply event (“levels 0→1 during sleep”).
  • v5f_s28 — exact up to round 8; applies after rounds 8/12/16.
  • win2_s31 — exact up to round 4; transitions 0→1, 1→2, 2→3 happened in-sleep, 3→4 in wake (round 34), all annotated as events.

Sleep-pass positions (after_round) are inferred from the fixed every-4-rounds cadence and validated against levels_before/levels_after for all 21 runs (strict on levels_before; soft on levels_after because a wake-round GAME_OVER reset can drop the level right after a sleep, as in v5_full_s20). Data: data/results.json (21 runs), raw traces in data/raw/, rebuild script data/build_results.py.

05

RHAE — how efficient, relative to a human?

win2_s31 = 17.9

Clearing levels says whether; RHAE says how wastefully — level 1 is human-efficient (score 114.8), the overall 17.9 is low by construction because unreached levels score 0. 17.9는 낮은 점수다 — 낮게 나오도록 설계한 지표를 그대로 공개한다.

L1: 14 actions vs human 15 → 114.8 L2 ugliest: 15 vs 7 → 21.8 L5–6 unreached → 0 not leaderboard-comparable (fork caveat below)

score = min( (baseline / actions)² × 100,  115 )   ·   uncleared ⇒ 0
RHAE = Σ ℓ·score  /  Σ ℓ   (ℓ = 1…6, level-number-weighted)
level ℓhuman baseline (actions)win2_s31 (actions)scoreℓweight ℓ
11514114.81
271521.82
3152439.13
4163225.04
521— (uncleared)05
617— (uncleared)06
RHAE(114.8 + 2·21.8 + 3·39.1 + 4·25.0) / 21 = 17.9Σ=21
how to read the table + the honest fork-vs-API caveat

Reading: level 1 is essentially human-efficient (14 actions vs the 15-action baseline); level 2 is the ugliest (15 actions vs a human’s 7 — the agent re-derives the key instead of recognizing the repeat); and everything from level 5 up is simply unreached. Per level: min((baseline/actions)²×100, 115), level-number-weighted, uncleared ⇒ 0. 17.9 is a low score by construction — that is the point of publishing it.

the honest fork-vs-API caveat
Sleep tests candidate recipes on engine forks — free, invisible lookahead that the official ARC-AGI-3 HTTP API does not offer. Wake actions and applied recipe clicks are counted in RHAE; fork probes are not. A human (or an API-only agent) pays for every experiment with real actions, so this RHAE is not directly comparable to API-scored agents. It is a within-lab efficiency ledger, not a leaderboard number.
06

Limits — what this does not show yet

read before believing

Three open confounds, one thing that does hold — expand each for the honest version.

□ weak-wake confound

“Sleep cleared it” is confounded with “wake couldn’t have.”

detail

The wake actor is a weak model that rarely progresses on its own, so “sleep cleared the level” is confounded with “wake couldn’t have.” 3 of win2_s31’s 4 level-ups happened inside sleep (the 4th in wake, round 34) — strong evidence the mechanism works, weak evidence it would still matter over a frontier wake model that might just solve the game directly.

□ single game so far

Everything here is ft09 only — a rule-mining-friendly game.

detail

ft09’s mechanic (a key region dictating block colors, repeated across levels) is unusually friendly to rule-mining and analogy transfer. No claim survives contact with ls20 or vc33 until the same loop is run there.

□ config evolution across seeds

Before/after pairs are chronological snapshots, not matched-seed A/Bs.

detail

The 21 runs span 11 config variants; each fix changed the config between seed groups. The before/after pairs in the fix timeline are chronological snapshots, not matched-seed A/Bs. The one clean contrast is v5e’s noop-guard failure vs the v5f family — same mechanism, one guard flag apart.

✓ what does hold

LLM-free gate; 26/27 applies raised the live level; replays exact.

detail

The gate is LLM-free (fork level-count only), and 26 of 27 applied recipes across all 21 runs raised the live level (the exception: one v5c_s23 apply, levels 2→2). Grids in the explorer are exact engine replays with zero validated mismatches, and the level-4 run is fully reproducible from data/raw/win2_s31.jsonl.