Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .verdicts/unshadow-hexaval-unbox/pilot.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
UNSHADOW HexaVal-unbox pilot — verdict (mini macOS arm64, clang 21.0.0)
========================================================================
falsifier: known-int binop result reboxed via inline ((HexaVal){.tag=TAG_INT,.i=(...)})
must (a) be byte-identical to out-of-line hexa_int(...) AND (b) close the parity gap.

G5 byte-diff gate (output, both arms + ref-C):
before stdout = 34200003330000000
after stdout = 34200003330000000
ref-C stdout = 34200003330000000
G5: IDENTICAL (md5 63888b02e0325abf096209d943c8413f)

asm (hot mix(), otool -S / clang -S, -O2):
before bl _hexa_int = 17 (total bl = 19)
after bl _hexa_int = 0 (total bl = 2 = mix-call + printf)
after mix() = pure register arith (add/sub/lsl in x8..x11), zero HexaVal spills.

wall (best-of-11, ms):
ref-C @-O2 = 54
before (boxed rebox)= 599
after (inline lit) = 53

findings:
unbox speedup (before/after) = 11.30x (91.2% wall drop)
parity gap before = 11.09x ; after = 0.98x (AT PARITY)
gap closed = 100% of the before-parity gap on this known-int workload.
108 changes: 108 additions & 0 deletions .verdicts/unshadow-same-tu/F-UNSHADOW-SAME-TU.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
F-UNSHADOW-SAME-TU — 🔵×🟡 same-TU 빌드 기본화 cost/benefit PILOT (measured)
================================================================================
Host: pool mini (macOS arm64, Apple clang). Both arms same host, back-to-back.
Runtime source: pre-graduation self/ tree (commit 151c52c8, B9-faithful — the
emitter reproduces runtime.c byte-identically, B9.C-10 source-SHA gate) via
`git archive 151c52c8 self | tar -x` into /tmp (the repo is .c=0 post-B9, so
runtime.c must be supplied; this is the milestone-blessed faithful proxy).
Reproducer (committed, parse-gated + run on mini):
tool/unshadow_same_tu_bench.hexa --rt <dir-with-self/runtime.c> --runs 5

FALSIFIER (pre-registered)
"Making the C-emit build same-TU by default (user.c `#include "runtime.c"`,
no separate runtime.o 2nd TU) (a) opens #2-ext-class cross-layer boundary
wins generally (clang -O2 inlines bl _rt_*/_hexa_* across the now-open
boundary), byte-identical, AND (b) carries an acceptable build-time/binary
cost — i.e. default-on is a net win."

BUILD-RECIPE CHANGE IMPLEMENTED (self/main.hexa cmd_build, GATED HEXA_SAME_TU=1)
Two coordinated, reversible edits — opt-in only, NOT a forced global default:
(1) codegen half — when HEXA_SAME_TU=1, the transpile step is run with
HEXA_USE_RUNTIME_C=1 (the existing codegen.hexa:947 escape hatch), so
emitted user.c does `#include "runtime.c"` instead of `runtime.h` → the
runtime amalgam enters the user TU.
(2) link half — when HEXA_SAME_TU=1, the separate runtime object/source 2nd
TU is dropped from the final clang call (_rt_input = "") — the runtime is
already textually present, so a single TU compiles. (Adding it as a 2nd
TU would duplicate every symbol → link error.)
guarded `shared != "1" && len(target) == 0` (same-TU not applied to --shared
PIC or cross-target zig builds). Unset HEXA_SAME_TU → byte-for-byte the
legacy walled build (the resolve_prebuilt / content-hash-.o / source path).
self/main.hexa parses cleanly.

MEASUREMENT METHOD (honest A/B proxy — no full self-host rebuild)
The full `hexa cc --regen` self-host rebuild is blocked by the B9 wall
(runtime.c GENERATED, absent from a fresh clone — same blocker the prior
unwall agent hit). The milestone spec explicitly blesses a faithful A/B proxy:
two build modes, SAME runtime source, isolating only the TU/link strategy.
· workload transpiled by the INSTALLED hexat (emits `#include "runtime.h"`).
· WALLED : compile user.c + link a precompiled runtime object (2 TU) — the
live default.
· SAME-TU : textually swap runtime.h→runtime.c in user.c (the EXACT transform
the codegen half performs) and compile as ONE TU.
Both arms compile against the SAME runtime.c source (the walled object is
compiled from it), so only the TU boundary varies — precisely the variable
the cmd_build edit controls.

VERIFICATION (verbatim — tool/unshadow_same_tu_bench.hexa --runs 5, mini)
--- workload: string-boundary ---
built: walled=yes same-TU=yes
g5 md5: walled=0e2afa85abbd8d3d13b7a79efb429a8e same-TU=0e2afa85abbd8d3d13b7a79efb429a8e [IDENTICAL]
wall best-of-5: walled=1.87s same-TU=1.48s
binary size: walled=409080B same-TU=408888B
--- workload: HexaVal-arith (control) ---
built: walled=yes same-TU=yes
g5 md5: walled=657d1ec4586d9e7cb3572bd47e3d1bb2 same-TU=657d1ec4586d9e7cb3572bd47e3d1bb2 [IDENTICAL]
wall best-of-5: walled=0.58s same-TU=0.44s
binary size: walled=408728B same-TU=408552B
--- build-time (best-of-3, mini-class) ---
walled COLD (compile runtime.o + link) : 3.53s
walled WARM (cached runtime.o, link only): 0.10s ← live default hot path
same-TU (recompile amalgam EVERY build) : 3.55s
--- _u_main hot-fn boundary `bl` histogram (string workload) ---
[WALLED] [SAME-TU]
12 bl _hexa_int (gone — inlined)
2 bl _rt_str_starts_with (gone — inlined → 2 bl _hxlcl_strncmp + 2 _hxlcl_strlen)
1 bl _hexa_contains_poly (gone — inlined → 1 bl _hxlcl_strstr)
1 bl _hexa_to_string (gone — inlined → 1 __hexa_to_string_rec)
2 bl _hexa_bool (gone — inlined)
4 bl _hexa_add_slow 4 bl _hexa_add_slow (kept)
3 bl _hexa_truthy 4 bl _hexa_truthy (kept)

FINDING (honest — benefit real, cost prohibitive for default)
BENEFIT (real, generalizes): same-TU opens the whole HexaVal/runtime ABI to
clang -O2 cross-TU inlining. The #2-ext-class boundary calls _rt_str_starts_with
(2→0) and _hexa_contains_poly (1→0) flip called→inlined exactly as §lto-unwall
predicted; and crucially the win is NOT string-specific — the HexaVal-arith
control (_hexa_int boxing) also wins (0.58→0.44s, −24%), because hexa_int /
hexa_to_string / hexa_bool boxing helpers are themselves runtime boundary
calls that same-TU inlines. Measured wall: string −21% (1.87→1.48s), arith
−24% (0.58→0.44s). g5 byte-IDENTICAL on BOTH workloads.
COST (prohibitive for default-on): same-TU recompiles the full ~14.6K-line
runtime amalgam into EVERY user TU — 3.55s/build vs the walled WARM default
of 0.10s = ~35× build-time tax. The walled model amortizes the one-time
3.53s runtime compile via the content-hash `runtime.<sha>.o` cache; same-TU
structurally CANNOT cache the runtime (it is fused into each user TU, keyed
by user source). Binary size is a wash (−0.05%, −192 B). A second structural
cost: same-TU as a shipped default REQUIRES runtime.c on disk, which B9
graduation removed — default-on would re-introduce a generated-.c dependency.

RULED-OUT AXES
- default-on same-TU is NOT a net win — the ~35× per-build compile tax on the
hot path dominates the −21~24% runtime win for general (non-perf) builds.
- the benefit is NOT string-specific — it generalizes to any HexaVal-boxing
hot loop (control workload confirms), so the lever is the whole runtime ABI.
- binary size is NOT a meaningful axis (wash).

RECOMMENDATION: OPT-IN FLAG (HEXA_SAME_TU=1), NOT default-on. The −21~24%
byte-identical runtime win is real and generalizes, but the ~35× build-time
tax (3.55s vs 0.10s WARM) + the re-introduced generated-runtime.c dependency
make default-on a poor tradeoff for the common build. Same-TU is worth it for
RELEASE / perf builds of HexaVal-/boundary-call-heavy programs — exactly the
opt-in surface this pilot landed. Terminal: opt-in, not default.

VERDICT: 🔵×🟡 same-TU build = OPT-IN (HEXA_SAME_TU=1), NOT default. BENEFIT
−21~24% byte-identical (boundary calls _rt_str_starts_with/_hexa_contains_poly/
_hexa_int inlined, generalizes past strings) · COST ~35× build-time tax
(3.55s vs 0.10s warm) + generated-runtime.c dependency · binary Δ −0.05%
(wash). Terminal measured recommendation: opt-in flag.
46 changes: 46 additions & 0 deletions bench/unshadow/knownint_heavy.hexa
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// UNSHADOW HexaVal-unbox pilot — known-int rebox hot workload.
//
// Every binop below has BOTH operands provably TAG_INT (immutable int `let`s
// or IntLits), so codegen takes the STRUCTURAL-2 known-int fast path. On
// origin/main that path re-boxes each intermediate via the out-of-line
// `hexa_int(…)` runtime call (`bl _hexa_int` at -O2 — the runtime.o C-ABI
// wall). The pilot emits the result as an INLINE compound literal
// `((HexaVal){.tag=TAG_INT,.i=(…)})` instead, so clang -O2 can keep the
// chain in registers and drop the rebox calls. Byte-equivalent (hexa_int(n)
// ≡ {.tag=TAG_INT,.i=n}); the value is identical, only the boxing form moves.
fn mix(n: int) -> int {
let a = n + 1
let b = a + 2
let c = b * 3
let d = c - a
let e = d + b
let f = e * 2
let g = f - c
let h = g + a
let i = h * 2
let j = i - b
let k = j + c
let l = k - d
let m = l + e
let o = m * 2
let p = o - f
let q = p + g
return q
}

fn main() {
// warmup
let mut warm = 0
while warm < 3 {
mix(7)
warm = warm + 1
}
// timed body
let mut r = 0
let mut rounds = 0
while rounds < 60000000 {
r = r + mix(rounds)
rounds = rounds + 1
}
println(r)
}
139 changes: 139 additions & 0 deletions domains/UNSHADOW.bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -378,3 +378,142 @@ fold/LICM 을 막아 parity 가 free 가 **아니다**. parity ≈1.0 은 `.c=0`
> 바이너리는 캐시된 `runtime.o` 링크(코드 변경 0) · reference C 는 `/tmp` 외부 작성(hook
> 회피 · repo 안 `.c` 0개) · fib ref-O2=1ms 는 clang 의 dead-loop elim 으로 degenerate
> (ratio 절대값보다 "벽 너머 fold 불가"라는 정성 신호가 본질).

## §hexaval-unbox — 🟢 HexaVal 언박싱 pilot (known-int rebox → inline literal)

> milestone `🟢 HexaVal 언박싱 / register-pack` 의 실측. **요지**: §parity-attest 가
> raw 7.9×~1263× 갭의 주범으로 지목한 HexaVal 박싱을 **한 좁은 지점**에서 제거한다 —
> codegen STRUCTURAL-2 known-int BinOp fast-path 가 결과를 **out-of-line `hexa_int(…)`**
> 로 재박싱하던 것을 **inline C compound literal** `((HexaVal){.tag=TAG_INT,.i=(…)})` 로
> 바꿔, 핫루프 매 산술 step 의 `bl _hexa_int` ABI 호출(= runtime.o C-ABI 벽)을 없앤다.
> `self/codegen.hexa` L5127. 발화 조건 = `_is_known_int` 가 두 피연산자를 정적 TAG_INT
> 로 인증할 때만(불변 int-only `let`/IntLit). 그 외엔 기존 boxed emit = 일반 경로 무변경.

측정: `mini` (macOS arm64) · clang 21.0.0 · best-of-11 wall(real min, ms) · 같은
`runtime.o` 링크 · 2026-05-30 · tag `mac-arm64-mini`. 워크로드 = `knownint_heavy`
(16-op 불변 int `let` 체인 × 60M, 매 op 가 known-int fast-path 발화).

### 표 — 3-way wall min (ms) + asm

| arm | wall (ms) | hot `mix()` `bl _hexa_int` | parity gap (arm/ref) |
|---|---|---|---|
| ref-C @-O2 (plain `int64_t`) | 54 | — (no HexaVal) | 1.00× (baseline) |
| BEFORE — out-of-line `hexa_int(…)` rebox (origin/main) | 599 | **17** | **11.09×** |
| AFTER — inline `((HexaVal){.tag=TAG_INT,.i=(…)})` (pilot) | 53 | **0** | **0.98×** |

> g5(정확성): before/after/ref **세 바이너리 stdout 전부 동일** = `34200003330000000`
> (md5 `63888b02e0325abf096209d943c8413f`). asm: AFTER `mix()` 는 순수 레지스터 arith
> (`add`/`sub`/`lsl` in x8..x11), HexaVal spill(`str`/`ldr`) 0 — clang -O2 가 16-op
> 체인을 ~10 스칼라 명령으로 fold. BEFORE 는 17개 opaque `bl _hexa_int` 가 이 fold 를 차단.

### 발견 — 박싱 제거가 known-int 워크로드의 parity 갭을 닫는다

- **unbox speedup = 11.30× (91.2% wall drop)** · **parity gap 11.09× → 0.98×** = known-int
핫루프의 raw-parity 갭을 **100% closed**(AFTER 53ms ≈ ref 54ms, 노이즈 내 동일).
- §parity-attest 의 "raw parity 는 runtime.o C-ABI 벽이 막는다"가 박싱 축에서 **확증** —
벽 = 매 op 의 `hexa_int(…)` out-of-line rebox. inline literal 로 그 호출을 제거하면
clang -O2 가 벽 없이 누산기를 레지스터에 유지·fold → idiomatic C 와 parity.

**정직 caveat**:
- 측정은 **faithful C A/B proxy** — 두 arm 이 각 codegen variant 의 call-site emit 을
정확히 미러(같은 runtime.o·clang -O2). full self-host transpiler rebuild **아님**:
**B9 빌드 벽**(origin/main HEAD 에 일관된 generated-.c 셋 부재 + 설치 트리 runtime.h
ABI skew → `hexa cc --regen` merge forward-decl 버그/module link skew 로 canonical
재빌드 차단, 메모리 `reference_b9_generated_c_no_checkout_shortcut`). proxy sound 근거 =
byte-equivalence 가 **runtime 소스에서 증명**됨(`runtime_core_emit.hexa:1371`
`hexa_int(n)={.tag=TAG_INT,.i=n}`) + 변경 변수 1개만 격리.
- 갭-클로저 절대값은 **known-int 비율이 높은** 워크로드 기준. `_is_known_int` 미발화
케이스(mut 누산기 — 예 `fib_heavy` 의 `let mut a; a=b`)는 이 pilot 미적용 →
mut-accumulator 언박싱(raw `int64_t` 캐리)은 별도 follow-up.
- codegen 편집 검증: `self/codegen.hexa` parse-clean(`hexa parse` OK) + 편집 라인이
emit 하는 C 문자열이 AFTER arm 형태(`((HexaVal){.tag=TAG_INT,.i=(HX_INT(l) op HX_INT(r))})`)
와 정확히 일치(구성으로 검증). 재현 = `bench/unshadow/knownint_heavy.hexa` ·
verdict = `.verdicts/unshadow-hexaval-unbox/pilot.txt`.
## §same-tu — C-emit same-TU 빌드 기본화 cost/benefit PILOT 실측

> milestone "🔵×🟡 same-TU 빌드 기본화" 의 실측. **요지**: §lto-unwall 이 입증한
> same-TU(`#include "runtime.c"`)를 C-emit 빌드 경로의 빌드-레시피로 만들면 (1) 무엇이
> 드는가(레시피 변경), (2) BENEFIT(#2-ext류 경계호출 cross-layer 전면 개방), (3) COST
> (빌드시간·바이너리)를 측정해 default/opt-in/no 정직 권고를 낸다.
> SSOT 도구 = `tool/unshadow_same_tu_bench.hexa` · verdict = `.verdicts/unshadow-same-tu/`.

### 구현한 same-TU 빌드 MODE (self/main.hexa cmd_build · GATED HEXA_SAME_TU=1)

reversible · opt-in 두 짝 편집(전역 default 강제 flip 아님):

1. **codegen 반쪽** — HEXA_SAME_TU=1 일 때 transpile 스텝을 `HEXA_USE_RUNTIME_C=1`
(기존 codegen.hexa:947 escape hatch)으로 돌려 user.c 가 `#include "runtime.c"` 를
emit → 런타임 아말감이 user TU 안으로 들어온다.
2. **link 반쪽** — HEXA_SAME_TU=1 일 때 별도 runtime 오브젝트/소스 2nd TU 를 최종
clang 호출에서 뺀다(`_rt_input = ""`). 런타임이 이미 텍스트로 들어와 있으니 단일 TU
컴파일. (2nd TU 로 또 넣으면 모든 심볼 중복 → 링크 에러.)

`shared != "1" && len(target) == 0` 가드(–shared PIC·cross-target zig 제외). unset →
바이트 동일하게 legacy walled 빌드. main.hexa parse-gate PASS.

### 측정 방법 (정직한 A/B 프록시 — full self-host rebuild 없이)

full `hexa cc --regen` 자체빌드는 **B9 벽**(runtime.c GENERATED · fresh clone 부재 —
선행 unwall 에이전트가 부딪힌 그 블로커)으로 막힘. milestone 스펙이 명시 허용한 faithful
프록시: 두 빌드 모드 · **동일 runtime 소스** · TU/link 전략만 격리.
- workload 는 INSTALLED hexat 로 transpile(`#include "runtime.h"` emit).
- **WALLED**: user.c + 별 precompiled runtime 오브젝트 링크(2 TU) — live default.
- **SAME-TU**: user.c 의 runtime.h→runtime.c 텍스트 swap(codegen 반쪽과 동일 변환) →
단일 TU 컴파일.
- runtime 소스는 B9 graduation(commit 151c52c8) 직전 `git archive | tar -x` 트리(에미터가
byte-identical 재현 = B9.C-10 source-SHA 게이트라 faithful). 양 arm 이 같은 runtime.c 로
컴파일되므로 오직 TU 경계만 변수.

### 측정: `mini` (macOS arm64) · best-of-5 wall · 2026-05-30

| workload | g5 (md5) | walled wall | same-TU wall | Δ | walled size | same-TU size |
|---|---|---|---|---|---|---|
| string-boundary | IDENTICAL `0e2afa85…` | 1.87s | **1.48s** | **−21%** | 409080 B | 408888 B |
| HexaVal-arith (control) | IDENTICAL `657d1ec4…` | 0.58s | **0.44s** | **−24%** | 408728 B | 408552 B |

**빌드시간 (best-of-3):**

| 빌드 모드 | 빌드시간 | 메모 |
|---|---|---|
| walled COLD (runtime.o 컴파일 + 링크) | 3.53s | first-ever build |
| **walled WARM (runtime.o 캐시 · 링크만)** | **0.10s** | **live default hot path** |
| **same-TU (매 빌드 아말감 재컴파일)** | **3.55s** | runtime.o 캐시 구조적 불가 |

**`_u_main` 핫함수 경계 `bl` 히스토그램 (string workload):**

| bl 타깃 | walled | same-TU | 비고 |
|---|---|---|---|
| `_rt_str_starts_with` | 2 | **0** | 인라인 → `_hxlcl_strncmp`×2 + `_hxlcl_strlen`×2 |
| `_hexa_contains_poly` | 1 | **0** | 인라인 → `_hxlcl_strstr`×1 |
| `_hexa_int` | 12 | **0** | 정수 박싱 헬퍼 전부 인라인 |
| `_hexa_to_string` | 1 | **0** | 인라인 → `__hexa_to_string_rec` |
| `_hexa_bool` | 2 | **0** | 인라인 |

### 정직한 해석 — BENEFIT 실재·일반화 / COST 기본화엔 과대

- **BENEFIT (실재·일반화):** same-TU 가 HexaVal/runtime ABI 전체를 clang -O2 cross-TU
인라이너에 연다. #2-ext류 경계호출 `_rt_str_starts_with`(2→0)·`_hexa_contains_poly`(1→0)
가 §lto-unwall 예측대로 call→inlined. 결정적으로 win 은 **string 전용이 아니다** —
HexaVal-arith 컨트롤(`_hexa_int` 박싱)도 −24%(0.58→0.44s)로 이긴다. hexa_int/
hexa_to_string/hexa_bool 박싱 헬퍼 자체가 런타임 경계호출이라 same-TU 가 전부 인라인.
g5 양 workload byte-IDENTICAL.
- **COST (기본화엔 과대):** same-TU 는 ~14.6K-line 런타임 아말감을 **매 user TU 마다 재컴파일**
→ 3.55s/빌드 vs walled WARM 0.10s = **~35× 빌드시간 세금**. walled 는 1회 3.53s 런타임
컴파일을 content-hash `runtime.<sha>.o` 캐시로 amortize; same-TU 는 런타임이 user TU 에
융합돼 **구조적으로 캐시 불가**. 바이너리 크기는 wash(−0.05% · −192 B). 2차 구조적 비용:
default-on same-TU 는 디스크에 runtime.c 를 요구 → B9 graduation 이 지운 generated-.c
의존을 재도입.

### 권고 (정직)

**OPT-IN FLAG (HEXA_SAME_TU=1) · NOT default-on.** −21~24% byte-identical 런타임 win 은
실재하고 일반화하지만, ~35× 빌드시간 세금(3.55s vs 0.10s WARM) + generated-runtime.c 의존
재도입 때문에 일반 빌드에서 default-on 은 나쁜 트레이드. same-TU 는 HexaVal-/경계호출-heavy
프로그램의 **release/perf 빌드**에 가치 — 이 pilot 이 랜딩한 opt-in surface 가 바로 그것.
terminal 측정 권고 = opt-in flag.

> caveat: 단일 호스트(mini) 단일 세션 · wall = best-of-5 real min · 양 arm 동일 runtime.c
> 소스(walled .o 도 그것으로 컴파일) back-to-back · runtime 소스 = B9-faithful pre-graduation
> 트리(emitter SSOT 와 byte-identical) · full self-host rebuild 은 B9 벽으로 차단되어 A/B
> 프록시 사용(스펙 허용) · repo 안 `.c` 0개 유지(/tmp 외부 트리). 재현 =
> `tool/unshadow_same_tu_bench.hexa --rt <self-with-runtime.c> --runs 5`.
Loading
Loading