Releases: dmriding/kaio
v0.5.0
v0.5.0 — FlashAttention Backward, bf16 Tensor Cores, ldmatrix
Phase 9 aggregate release. Workspace crates (kaio, kaio-core,
kaio-macros, kaio-ops, kaio-runtime) move to 0.5.0;
kaio-candle moves to 0.2.0.
Highlights
FlashAttention backward — attention_flash_bwd /
attention_flash_bwd_causal plus _with_stats forward variants that
export the per-row softmax logsumexp. Three dedicated PTX kernels
(D-term preprocess, dK/dV, dQ) rebuild the softmax from the saved
logsumexp instead of materializing the O(seq²) probability matrix —
no atomics, 18–22 registers/thread on sm_89. Validated against a CPU
f64 analytical oracle that is itself self-checked against central
finite differences before judging any GPU output.
bf16 tensor-core matmul family — matmul_tc_bf16,
matmul_tc_bf16_async, and the matmul_auto_tc_bf16 auto-tuner.
Operand-dtype siblings of the f16 kernels via a dedicated
mma.sync.bf16 IR variant (no runtime conversion). Performance parity
with f16 verified by interleaved per-iteration A/B gates that ship
with the kernels; see docs/performance.md for the parity record.
candle bridge (kaio-candle 0.2.0) — twelve forward CustomOp
bindings; backward (autograd) for all four matmul TC variants
(f16 + bf16, sync + async) and FlashAttention (plain + causal),
f32 end-to-end on the flash path.
ldmatrix IR primitive — TensorCoreOp::LdMatrix
(ldmatrix.sync.aligned.m8n8, .x2/.x4, optional .trans),
min_sm = 75 — the first sub-Ampere tensor-core op in the IR. The
matmul_tc fragment-A loader built on it is proven bit-identical to
the shipped loader and measured at the bench noise floor at the
current tile layout, so it ships built-and-parked behind the proven
ld.shared default; an XOR-swizzled tile layout is the named
follow-up that makes it pay.
Also in this release
- First-shipped previously-unreleased work: the
kaio-pyPyO3
scaffold (in-repo, unpublished, demand-gated) and the ten-harness
cargo xtask benchroster. - Hardening: the FlashAttention backward building-block helpers now
validate dimensions and buffer lengths before launch, matching
every other launch wrapper in kaio-ops. - CI: new no-CUDA check/doc leg for kaio-candle; weekly
candle-HEAD compatibility workflow repaired (its CUDA toolkit
install step had been failing before any candle code was checked). - Test coverage: 94.67% line coverage across the workspace,
measured host + GPU suites merged on RTX 4090 (sm_89).
Install
kaio = "0.5"
kaio-ops = "0.5"
kaio-candle = { version = "0.2", features = ["cuda"] } # candle bridgeNVIDIA driver only — no CUDA toolkit required to build or run
(cudarc dynamic-loading). Tensor-core ops need SM 8.0+; MSRV 1.94.
Full detail in CHANGELOG.md; measured numbers and
methodology in docs/performance.md.
What's Changed
Sprint 8.0.5 — bench coverage extension by @dmriding in #16
Sprint 8.1 — kaio-py PyO3 scaffold by @dmriding in #17
Phase 9 — FlashAttention backward, bf16 TC family, ldmatrix by @dmriding in #18
Full Changelog: v0.4.1...v0.5.0
v0.4.1
v0.4.0
v0.3.0
v0.2.2
v0.2.1
v0.2.0
v0.1.0
Phase5 (#5) * readme * docs: Phase 5 master plan and sprint log - phase5_master_plan.md: architecture decisions, 8-sprint breakdown, success criteria, risk analysis - PHASE_5_LOG.md: sprint index with dependency graph - Key blocker: 2D block reductions (Sprint 5.1 hard gate) - FlashAttention is stretch goal, standard attention is the floor * feat: 2D block reductions — linear tid for block_reduce_sum/max Sprint 5.1: Hard gate for fused attention cleared. - Compute linear_tid = tidx + tidy * block_dim_x for 2D kernels in lower_block_reduce() — all warp/lane/thread-0 logic flows from #tid automatically. 1D kernels unchanged (zero extra PTX). - Remove 2D rejection guard in builtins.rs - Delete cf11 compile-fail test (2D+reduce now works) - Add 4 GPU tests: 16x16 sum, 16x16 max, 32x8 asymmetric, identity-based (tidx*100+tidy catches row aliasing) - Zero regression: all 1D reduction, softmax, shared_mem tests pass - Update tech_debt.md, phases.md (Phase 5 actual plan + roadmap) * docs: Phase 4 retrospective cleanup, success criteria alignment Codex review identified drift in planning docs: - phase4_master_plan.md: status -> Complete, launch model updated to grid tuple (not LaunchConfig), perf target updated to actual 31% result, Sprint 4.9 added - success-criteria.md: Phase 4 criteria updated to match what shipped (shared_mem! not block_load, 31% not 60%), rescoped items documented explicitly. Phase 5 criteria updated to match actual plan (CPU reference not PyTorch). Publication checklist updated to current state. * gitignore * feat: standard attention — softmax(Q*K^T/sqrt(d_k)) * V Sprint 5.2: correctness baseline for FlashAttention. - Three-kernel decomposition: qk_scaled_matmul (naive 16x16 with transposed K indexing), row_softmax (Phase 3 pattern), reused matmul() for P*V - kaio_ops::attention() host API with validation - 7 GPU tests: tiny/16x16/non-aligned/medium/identity/zero-dim - Softmax grid fix: block_size=(256,1) for explicit grid control - DSL friction documented: no &&/||, 1D grid inference, no sqrt() * feat: causal masking + DSL friction report Sprint 5.3: causal mask for autoregressive attention. - apply_causal_mask kernel: S[i,j] = -FLT_MAX where j > i - attention_causal() host API (unmasked attention() unchanged) - 6 GPU tests: 4 sizes, row0-self-only, direct mask verification (V=identity -> out=P, verify lower-triangular weights sum to 1) - DSL friction report: 5 documented friction points from 5.2-5.3 (no &&/||, 1D grid inference, no sqrt(), no compound shared assign, no -inf literal) * feat: FlashAttention — online softmax, O(d_k) memory, no materialization Sprint 5.4: the stretch goal landed. - BLOCK_M=1 design: one query per block, 256 threads. Avoids per-row reduction limitation. Online softmax with running (m, l, O) state rescaled per K/V tile. - flash_attn_kernel + flash_attn_causal_kernel (block_size=(256,1)) - attention_flash() + attention_flash_causal() host APIs - d_k <= 256 runtime validation - 9 GPU tests: 4 sizes, causal, flash_matches_standard, causal_first_rows, single-tile, partial last tile (seq=257) - Standard attention unchanged as correctness oracle - All 22 attention tests pass, zero regression * fix: flash_attention_tiny now tests flash path, add d_k>256 guard tests Codex review findings: - flash_attention_tiny was calling check_attention() (standard path) instead of check_flash_vs_standard() — fixed - Added flash_rejects_dk_over_256 and flash_causal_rejects_dk_over_256 regression tests for the d_k <= 256 safety guard * feat: auto-tuner — benchmark variants, JSON cache, auto-dispatch Sprint 5.5: benchmark-backed dispatch for matmul and attention. - tune_matmul() / tune_attention() benchmark variants, cache as JSON - matmul_auto() / attention_auto() dispatch from cache, fall back to default. Pure dispatch — no side effects. - Internal enums (MatmulVariant, AttentionVariant) for exhaustive match, strings only for serialization - Cache: versioned JSON, SM target in key, causal/non-causal separated, duplicate overwrite, corrupt-file recovery - Variant filtering: flash skipped when d_k > 256 - 7 tests: tune, auto dispatch, cache roundtrip, fallback, d_k guard * cargo.lock * ci: Windows + Linux matrix, doc build job Sprint 5.6: CI/CD platform coverage. - Test matrix: ubuntu-latest + windows-latest - New doc job: cargo doc --no-deps --workspace - Fix rustdoc broken link warning (S[i,j] escape) - Verified passing on both Windows and WSL * chore: v0.1.0 prep — version bump, CHANGELOG, README, publish dry-run Sprint 5.7: packaging for v0.1.0 release. - Version: 0.0.4 -> 0.1.0 (workspace + all inter-crate deps) - CHANGELOG: Phase 5 section (attention, FlashAttention, auto-tuner, Windows CI, 2D reductions, DSL friction report) - README: feature table updated (attention, flash, auto-tuner), Phase 5 checked in roadmap, limitations updated - kaio/README.md: status -> Phase 5 complete - kaio-ops/README.md: attention + auto-tuner sections - docs/phases.md: Phase 5 status -> Complete - Publish dry-run: kaio-core + kaio-macros pass clean
v0.0.4
docs: adoption polish — examples, README rewrite, patterns, limitations Sprint 4.9: Make KAIO frictionless for first-time users. - Add 4 runnable examples: vector_add, saxpy, reduction, matmul - README: add "When to Use KAIO", comparison table, examples table, 3 copy-paste patterns, limitations, gotchas, feedback CTA - README: quickstart now shows printed output + expected terminal result - kaio/README.md: expanded crates.io landing page (<150 lines) - reduction example uses real array data, not synthetic constants
v0.0.3-beta — Phase 3 Complete
KAIO v0.0.3-beta — Loops, Reductions & Softmax
Early development release. Name reservation + Phase 3 milestone.
What's in this release
#[gpu_kernel]proc macro — write GPU kernels in Rust syntax, compile to PTX at build time- Arithmetic, comparisons,
if/else,for/whileloops - Shared memory (
shared_mem![f32; 256]), barrier sync, warp shuffle - Block-level reductions (
block_reduce_sum,block_reduce_max) - Softmax kernel validated on RTX 4090 (< 1e-5 error vs CPU reference)
- 19 built-in functions: thread indices, math (
sqrt,exp,log,tanh,sin,cos) - Cross-platform: Windows + Linux
- 200 host tests + 24 GPU tests, zero clippy warnings
Not ready for production use
This is a work-in-progress. API will change. Missing features include tiled matmul (Phase 4) and fused attention (Phase 5). See roadmap.
Crates
kaiov0.0.3kaio-corev0.0.3kaio-macrosv0.0.3kaio-runtimev0.0.3