code-native agent — frozen wake + sleep harness climbs 0 → 4 levels (ARC-AGI-3 ft09)

How it works — wake → stuck → sleep → fork-test → apply

the hero animation above is this loop, on real frames

Wake plays until stuck; sleep mines ≤4 rival rules; an engine fork executes each recipe; only what raises the fork’s real levels_completed touches the live game. 막힘은 신호다 — 가설을 캐고, 게임이 직접 심판하고, 살아남은 것만 적용한다.

gate = LLM-free (fork level count only) 26/27 applies raised the live level sleep every 4 rounds ≤4 rival rules / pass confirmed rules injected + auto-applied

what exactly a sleep pass does (miner · fork gate · apply)

Every 4 wake rounds the harness runs a sleep pass: a miner prompt reads the trace + notes + the per-level ledger (observed transitions, refuted rules, confirmed mechanics from earlier levels) and proposes up to 4 rival {rule, recipe} candidates. execute_and_score runs each recipe on a throwaway game.fork() and confirms it iff fork.levels_completed rises — no LLM and no privileged oracle in the gate. Confirmed recipes are applied to the live game and injected into the actor’s context; each pass logs realized information gain (entropy drop over the rule set) and a [confirmed]/[refuted] verdict per rule. Step through real passes in the trace explorer.

static architecture reference (kept from v1): the CodeAct wake loop + the sleep gate

The CodeAct wake loop (v1 diagram, still accurate). A single actor: observe (frame + image) → model writes run_python → sandbox executes → stdout + new image feed back. The boxed sandbox namespace is the agent’s entire API.

The sleep gate (v1 diagram, updated labels). Mine ≤4 rival rules → fork-execute-judge (confirm only if the fork’s real level count rises) → auto-apply + inject. The gate has no LLM and no oracle.

Prior works — six systems, one load-bearing insight each

animated · pure CSS · hover to pause

Each card is a system we strip-mined for the loop: its single load-bearing insight, and where it landed in our design. 여섯 시스템에서 각각 하중을 받치는 통찰 하나씩만 캐서 루프에 흡수했다.

card = hypothesis / record ✓ = confirmed ✗ = refuted door = validation gate hover any diagram to pause

(b) Piriyakulkij & Ellis — rules as a posterior

“Hypotheses live as a distribution; experiments are chosen to discriminate between them.”

가설은 분포로 살아 있고, 실험이 그 분포를 가른다.

absorbed into our design as…

each sleep pass proposes ~4 rival rules and fork-tests them against each other — confirm/refute over the set, never a single best guess.

(c) Microsoft SkillOpt — the document is the parameter

“The skill document IS the parameter — optimization edits text, and a validation gate decides what ships.”

스킬 문서 자체가 파라미터다 — 학습은 텍스트를 고치고, 게이트가 배포를 결정한다.

absorbed into our design as…

confirmed rules are plain-text artifacts, committed only after the execution gate (fork · judge) passes them.

(d) SkillGrad — textual gradients with momentum

“Loss becomes a textual gradient; momentum accumulates gradients before committing one patch.”

실패는 텍스트 그래디언트가 되고, 모멘텀처럼 쌓여 하나의 패치가 된다.

absorbed into our design as…

refuted rules + hindsight credit accumulate as revision seeds across sleep passes, instead of each pass overwriting the last.

(e) symbolica bestiary — a posterior on every record

“A posterior on every piece of knowledge — and quarantine, not deletion, for what decays.”

모든 지식 조각에 사후확률을 붙인다 — 의심스러우면 지우지 말고 격리.

absorbed into our design as…

a confirmed/refuted status on every mined rule; refuted rules stay in the ledger so the miner cannot re-propose them.

lineage (v1 diagram): the symbolica orchestrator we forked our engine from

symbolica (lineage). One orchestrator dynamically spawns four specialized subagents over a shared memories store and a common REPL base. We kept the code-native REPL foundation and dropped the orchestration; what survived conceptually is the posterior-per-record idea above.

(f) astroseger — the runnable world model

“Make the world model runnable — predict, diff against reality, refactor.”

세계 모델은 읽는 문서가 아니라 실행되는 코드여야 한다.

absorbed into our design as…

the engine fork is our runnable world model; recipes are validated by execution, and unexplained OBSERVED transitions become revision seeds (fix #8).

lineage (v1 diagram): astroseger’s state machine + Codex world-model files

astroseger (lineage). A four-protocol state machine drives a single Codex agent that writes executable world-model files. We dropped the state machine and the files; what survived is the runnable-world-model insight — our fork plays that role with zero extra files.

(g) AutoMem — memory ops as first-class actions arXiv 2607.01224

“Memory management itself is a trainable skill (metamemory) — writes and reads are first-class actions, and an outer loop retrains that faculty.”

기억 관리 자체가 학습 가능한 스킬이다 — 메타메모리.

relation to ours

orthogonal to ours — AutoMem trains the memory faculty across episodes; we verify the memory contents at test time (and our fork-verdicts could supply its training signal). arXiv 2607.01224

Fixes — eight autoresearch fixes, before/after

every chart = a real run from results.json

Eight failures → eight diagnoses → eight fixes; each mini-chart is a real run’s level-over-rounds trajectory, before-run vs after-run. 여덟 번의 고장 → 여덟 번의 진단 → 여덟 개의 수정.

snapshots, not controlled A/Bs configs evolved between seeds — see limits

coord-bug 34bb011

✗ clicks landed on the wrong cell → pass coordinates through untouched ✓

Problem: the harness re-mapped ACTION6(x,y) before submitting — every click the model reasoned about landed on the wrong cell. Weeks of 0-level runs were plumbing, not reasoning. Fix: pass grid coordinates through to the engine untouched.
exposure → auto-apply 34bb011 (G2 escape runner)

✗ confirmed rules were only shown → sleep applies them itself ✓

Problem: confirmed rules were only shown to the wake actor as context (“exposure”), and the weak wake model ignored them. Fix: sleep executes its own confirmed recipe directly on the live game — the first level ever cleared happened during sleep.
evidence-conditioned mining 98574ee

✗ the miner free-associated → condition on the observed-transition ledger ✓

Problem: the miner free-associated theories disconnected from what the game had actually done. Fix: the miner must condition on a per-level ledger of observed transitions and refuted rules; each pass is scored by realized information gain. (Levels didn’t jump immediately — this fix made passes measurable, which the later fixes needed.)
relational lens miner prompt

✗ pixel-local hypotheses → ask for relations between regions ✓

Problem: hypotheses were pixel-local (“cell (12,34) turns red”), but ft09’s mechanic is a relation: a small key region dictates colors of other regions. Fix: the miner prompt explicitly asks for relational/constraint hypotheses across regions.
RESET guard 8efed24

✗ RESET erased cleared levels → blocked unless GAME_OVER ✓

Problem: the wake actor occasionally issued RESET, which restarts the whole game and erases cleared levels. Fix: RESET is blocked unless the game is actually GAME_OVER.
analogy transfer e1b5030

✗ every level restarted from zero → seed the miner with cleared-level mechanics ✓

Problem: every new level started from zero knowledge, even though ft09 levels repeat the key mechanic with a twist. Fix: confirmed mechanics from cleared levels are injected into the miner as analogy seeds (“last level the key meant X — what does it mean here?”).
noop reset + fresh-strategy apply cc712e7 · 64e4250

✗ noop-guard silently disabled sleep → reset it on level clear ✓

Problem: the v5e collapse — two empty sleep passes at level 0 tripped the noop-guard and silently disabled sleep for the entire 36-round episode; stale strategies from earlier levels were also being re-applied. Fix: the noop-disable resets on level clear (a new level is a new hypothesis space), only this pass’s confirmed strategies are applied, and max_noops 2→4.
observed-residual f9e5d10

✗ unpredicted transitions were dropped → feed them back as revision seeds ✓

Problem: transitions the game exhibited that no hypothesis predicted (OBSERVED events) were logged and then dropped. Fix: feed OBSERVED click transitions into the next pass’s revision seeds — the missing residual. The next campaign produced the level-4 run.

Explorer — five runs, one story each

step round-by-round in the trace explorer

Every grid behind these cards is an exact engine replay — open a card in the trace explorer and step it with ←/→. 그리드는 스크린샷이 아니라 실제 엔진 리플레이다.

win2_s31 the level-4 run — 3 of 4 level-ups happened in-sleep L4RHAE 17.948 rounds · 98 actions · 11 sleeps open in Explorer → v5f_s28 three sleep-escapes chain 0→1→2→3 by analogy L318 rounds · 80 actions · 4 sleeps open in Explorer → g2_on_s2 the first escape ever — auto-apply lands, 0→1 during sleep L112 rounds · 29 actions · 3 sleeps open in Explorer → v5e_s25 the noop-guard failure — sleep silently off for 28 rounds L036 rounds · 44 actions · 2 sleeps open in Explorer → g1_neutral sleep OFF baseline — never leaves level 0 L012 rounds · 25 actions · 0 sleeps open in Explorer →

technical detail: replay fidelity & provenance

Grids are produced by replaying each trace’s actions in ttso.core.game Game('ft09'), validating both levels_completed and state per round against the trace — zero mismatches in every replayed round.

g1_neutral — FULL (12/12 rounds exact).
v5e_s25 — FULL (36/36 rounds exact; no sleep-applies, so the whole noop-guard failure is replayable).
g2_on_s2 — exact up to round 8; the first sleep-apply lands clicks outside the trace, so rounds 9–12 show a placeholder with the apply event (“levels 0→1 during sleep”).
v5f_s28 — exact up to round 8; applies after rounds 8/12/16.
win2_s31 — exact up to round 4; transitions 0→1, 1→2, 2→3 happened in-sleep, 3→4 in wake (round 34), all annotated as events.

Sleep-pass positions (after_round) are inferred from the fixed every-4-rounds cadence and validated against levels_before/levels_after for all 21 runs (strict on levels_before; soft on levels_after because a wake-round GAME_OVER reset can drop the level right after a sleep, as in v5_full_s20). Data: data/results.json (21 runs), raw traces in data/raw/, rebuild script data/build_results.py.

RHAE — how efficient, relative to a human?

win2_s31 = 17.9

Clearing levels says whether; RHAE says how wastefully — level 1 is human-efficient (score 114.8), the overall 17.9 is low by construction because unreached levels score 0. 17.9는 낮은 점수다 — 낮게 나오도록 설계한 지표를 그대로 공개한다.

L1: 14 actions vs human 15 → 114.8 L2 ugliest: 15 vs 7 → 21.8 L5–6 unreached → 0 not leaderboard-comparable (fork caveat below)

score_ℓ = min( (baseline_ℓ / actions_ℓ)² × 100, 115 ) · uncleared ⇒ 0
RHAE = Σ ℓ·score_ℓ / Σ ℓ (ℓ = 1…6, level-number-weighted)

level ℓ	human baseline (actions)	win2_s31 (actions)	scoreℓ	weight ℓ
1	15	14	114.8	1
2	7	15	21.8	2
3	15	24	39.1	3
4	16	32	25.0	4
5	21	— (uncleared)	0	5
6	17	— (uncleared)	0	6
RHAE			(114.8 + 2·21.8 + 3·39.1 + 4·25.0) / 21 = 17.9	Σ=21

how to read the table + the honest fork-vs-API caveat

Reading: level 1 is essentially human-efficient (14 actions vs the 15-action baseline); level 2 is the ugliest (15 actions vs a human’s 7 — the agent re-derives the key instead of recognizing the repeat); and everything from level 5 up is simply unreached. Per level: min((baseline/actions)²×100, 115), level-number-weighted, uncleared ⇒ 0. 17.9 is a low score by construction — that is the point of publishing it.

the honest fork-vs-API caveat

Sleep tests candidate recipes on engine forks — free, invisible lookahead that the official ARC-AGI-3 HTTP API does not offer. Wake actions and applied recipe clicks are counted in RHAE; fork probes are not. A human (or an API-only agent) pays for every experiment with real actions, so this RHAE is not directly comparable to API-scored agents. It is a within-lab efficiency ledger, not a leaderboard number.

Limits — what this does not show yet

read before believing

Three open confounds, one thing that does hold — expand each for the honest version.

□ weak-wake confound

“Sleep cleared it” is confounded with “wake couldn’t have.”

detail

The wake actor is a weak model that rarely progresses on its own, so “sleep cleared the level” is confounded with “wake couldn’t have.” 3 of win2_s31’s 4 level-ups happened inside sleep (the 4th in wake, round 34) — strong evidence the mechanism works, weak evidence it would still matter over a frontier wake model that might just solve the game directly.

□ single game so far

Everything here is ft09 only — a rule-mining-friendly game.

detail

ft09’s mechanic (a key region dictating block colors, repeated across levels) is unusually friendly to rule-mining and analogy transfer. No claim survives contact with ls20 or vc33 until the same loop is run there.

□ config evolution across seeds

Before/after pairs are chronological snapshots, not matched-seed A/Bs.

detail

The 21 runs span 11 config variants; each fix changed the config between seed groups. The before/after pairs in the fix timeline are chronological snapshots, not matched-seed A/Bs. The one clean contrast is v5e’s noop-guard failure vs the v5f family — same mechanism, one guard flag apart.

✓ what does hold

LLM-free gate; 26/27 applies raised the live level; replays exact.

detail

The gate is LLM-free (fork level-count only), and 26 of 27 applied recipes across all 21 runs raised the live level (the exception: one v5c_s23 apply, levels 2→2). Grids in the explorer are exact engine replays with zero validated mismatches, and the level-4 run is fully reproducible from data/raw/win2_s31.jsonl.

How it works — wake → stuck → sleep → fork-test → apply

Prior works — six systems, one load-bearing insight each

(b) Piriyakulkij & Ellis — rules as a posterior

(c) Microsoft SkillOpt — the document is the parameter

(d) SkillGrad — textual gradients with momentum

(e) symbolica bestiary — a posterior on every record

(f) astroseger — the runnable world model

(g) AutoMem — memory ops as first-class actions arXiv 2607.01224

Fixes — eight autoresearch fixes, before/after

coord-bug 34bb011

exposure → auto-apply 34bb011 (G2 escape runner)

evidence-conditioned mining 98574ee

relational lens miner prompt

RESET guard 8efed24

analogy transfer e1b5030

noop reset + fresh-strategy apply cc712e7 · 64e4250

observed-residual f9e5d10

Explorer — five runs, one story each

RHAE — how efficient, relative to a human?

Limits — what this does not show yet

□ weak-wake confound

□ single game so far

□ config evolution across seeds

✓ what does hold