code-native agent — weekly build-log archive (ARC-AGI-3)

experiment journal (latest): ▸ leak-free pipeline & code map · ▸ RQ card journal

2026‑07‑30 latest

Meeting deck: ft09 6/6 ×2 re-audited — the nine-row component ledger, the four axes, and the priors re-indexed

Every changed component gets a ledger row with its owner file:line, isolating pair, and a did-the-6/6-use-it verdict — the second leak-free WIN (bwsvfull, RHAE 19.7) is not credited to the skill loop; every positive sits adjacent to its control. Four axes measured on bwsvfull/ls20a (quality best_f1 0.813/0.832 · improvement 0 of 244 & 0 of 124 records ever revised · retrieval best 0.813 vs USED 0.267), plus a companion page re-indexing prior systems on the same four axes with evidence-tier badges and the recorded predictions.

component ledger: 9 rows four axes × 2 games priors: evidence-tiered

read the week →

2026‑07‑17

0716–0717 repair battery, compressed + the full interactive set (framework·implementation·history·trace)

A compressed strip of 7 runs (h4–h10) with prior work, a click-node pipeline, a module map, an abstract insight-arc timeline, and a trace viewer showing 5 representative turns’ hypothesis·THINK·reasoning·skill claims uncut. Three 0717 amendments measured live: sleep fire rate 4/4→1/4 · rounds 16→5 min · GAME_OVER now refutes.

trace viewer: 5 uncut turns 2 critic rounds × 5 pages 3 GAME_OVERs honestly logged

read the week →

2026‑07‑16

Two root causes confirmed: briefing-claim truncation + the judge predictor’s numpy boundary

The wake briefing truncated winner-skill claims at [:160], starving the L3 winning rule (head_capped 83ea9e5), and judge._Predictor passed boards as lists-of-lists, so 13/21 numpy-assuming skills silently excepted → all scored 0 (one-line np.asarray, a524931). Verification run h9: credence 0.25→0.89 (first τ breach) · first live observation of an L0→L1 rule carry-over. Constitution 1:1 audit 23/29 OK · 1 blocker sealed. 14 commits.

0/21 → 13/21 (a524931) credence 0.25 → 0.89 constitution audit 23/29 OK

read the week →

2026‑07‑14

Carrying skills beyond a level as memory — a frozen LLM solves by compressing experience into skills

The post-0710-WIN phase: the week the transfer axis was reorganized so a skill does not end inside one level but becomes the next level’s memory via its when(board) condition. (This card was missing from the original list and was added retroactively on 07-16 — folder and contents unchanged from the time.)

read the week →

2026‑07‑11

TTSO: first leak-free full-game WIN, twice — ft09 6/6 ×2, fork=0

Two independent runs clear every ft09 level with predict-verify alone — no engine cloning (fork). An interactive pipeline (click a node = the real prompt/trace), a six-crises timeline, four prior-work lanes. (The 07-09 combined edition was absorbed into this page — original preserved in git history.)

6/6 levels × 2 runs fork_dependence = 0 RHAE 40.9 / 55.7

read the week →

2026‑07‑03 fork-era snapshot

v2: sleep harness, escapes 0 → 4 levels, RHAE, trace explorer

A harness coordinate bug was silently eating every click. Eight fixes later, the agent clears level 4 on ft09 — with a data-driven run viewer over the real traces. Correction (07-09): this “WIN” was later judged a fork-bruteforce artifact — preserved as a historical record.

level 4 cleared (fork-era) RHAE 17.9 21 runs · 8 fixes

read the week →

2026‑06‑30

v1: code-native baseline, failure ladder, first architecture

The first honest log: one autonomous CodeAct loop, an LLM-free skill gate, and the failure-mode chain — built and genuinely code-native, not yet solving a level from scratch.

0 levels from scratch failure-mode ladder LLM-free skill gate

read the week →