Skip to content

Releases: dmriding/kaio

v0.5.0

12 Jun 18:36
dabd57d

Choose a tag to compare

v0.5.0 — FlashAttention Backward, bf16 Tensor Cores, ldmatrix

Phase 9 aggregate release. Workspace crates (kaio, kaio-core,
kaio-macros, kaio-ops, kaio-runtime) move to 0.5.0;
kaio-candle moves to 0.2.0.

Highlights

FlashAttention backwardattention_flash_bwd /
attention_flash_bwd_causal plus _with_stats forward variants that
export the per-row softmax logsumexp. Three dedicated PTX kernels
(D-term preprocess, dK/dV, dQ) rebuild the softmax from the saved
logsumexp instead of materializing the O(seq²) probability matrix —
no atomics, 18–22 registers/thread on sm_89. Validated against a CPU
f64 analytical oracle that is itself self-checked against central
finite differences before judging any GPU output.

bf16 tensor-core matmul familymatmul_tc_bf16,
matmul_tc_bf16_async, and the matmul_auto_tc_bf16 auto-tuner.
Operand-dtype siblings of the f16 kernels via a dedicated
mma.sync.bf16 IR variant (no runtime conversion). Performance parity
with f16 verified by interleaved per-iteration A/B gates that ship
with the kernels; see docs/performance.md for the parity record.

candle bridge (kaio-candle 0.2.0) — twelve forward CustomOp
bindings; backward (autograd) for all four matmul TC variants
(f16 + bf16, sync + async) and FlashAttention (plain + causal),
f32 end-to-end on the flash path.

ldmatrix IR primitiveTensorCoreOp::LdMatrix
(ldmatrix.sync.aligned.m8n8, .x2/.x4, optional .trans),
min_sm = 75 — the first sub-Ampere tensor-core op in the IR. The
matmul_tc fragment-A loader built on it is proven bit-identical to
the shipped loader and measured at the bench noise floor at the
current tile layout, so it ships built-and-parked behind the proven
ld.shared default; an XOR-swizzled tile layout is the named
follow-up that makes it pay.

Also in this release

  • First-shipped previously-unreleased work: the kaio-py PyO3
    scaffold (in-repo, unpublished, demand-gated) and the ten-harness
    cargo xtask bench roster.
  • Hardening: the FlashAttention backward building-block helpers now
    validate dimensions and buffer lengths before launch, matching
    every other launch wrapper in kaio-ops.
  • CI: new no-CUDA check/doc leg for kaio-candle; weekly
    candle-HEAD compatibility workflow repaired (its CUDA toolkit
    install step had been failing before any candle code was checked).
  • Test coverage: 94.67% line coverage across the workspace,
    measured host + GPU suites merged on RTX 4090 (sm_89).

Install

kaio = "0.5"
kaio-ops = "0.5"
kaio-candle = { version = "0.2", features = ["cuda"] }  # candle bridge

NVIDIA driver only — no CUDA toolkit required to build or run
(cudarc dynamic-loading). Tensor-core ops need SM 8.0+; MSRV 1.94.

Full detail in CHANGELOG.md; measured numbers and
methodology in docs/performance.md.

What's Changed
Sprint 8.0.5 — bench coverage extension by @dmriding in #16
Sprint 8.1 — kaio-py PyO3 scaffold by @dmriding in #17
Phase 9 — FlashAttention backward, bf16 TC family, ldmatrix by @dmriding in #18
Full Changelog: v0.4.1...v0.5.0

v0.4.1

24 Apr 15:43
e8b7ae6

Choose a tag to compare

What's Changed

Full Changelog: v0.4.0...v0.4.1

v0.4.0

17 Apr 23:48
c5ed152

Choose a tag to compare

What's Changed

Full Changelog: v0.3.0...v0.4.0

v0.3.0

15 Apr 02:24
fa7dac0

Choose a tag to compare

update test coverage badge

v0.2.2

13 Apr 23:26
83eacb8

Choose a tag to compare

What's Changed

Full Changelog: v0.2.1...v0.2.2

v0.2.1

13 Apr 22:33
76fefb7

Choose a tag to compare

What's Changed

Full Changelog: v0.2.0...v0.2.1

v0.2.0

13 Apr 09:54
866f788

Choose a tag to compare

What's Changed

Full Changelog: v0.1.0...v0.2.0

v0.1.0

11 Apr 22:57
bbc1c4d

Choose a tag to compare

Phase5 (#5)

* readme

* docs: Phase 5 master plan and sprint log

- phase5_master_plan.md: architecture decisions, 8-sprint breakdown,
  success criteria, risk analysis
- PHASE_5_LOG.md: sprint index with dependency graph
- Key blocker: 2D block reductions (Sprint 5.1 hard gate)
- FlashAttention is stretch goal, standard attention is the floor

* feat: 2D block reductions — linear tid for block_reduce_sum/max

Sprint 5.1: Hard gate for fused attention cleared.

- Compute linear_tid = tidx + tidy * block_dim_x for 2D kernels
  in lower_block_reduce() — all warp/lane/thread-0 logic flows
  from #tid automatically. 1D kernels unchanged (zero extra PTX).
- Remove 2D rejection guard in builtins.rs
- Delete cf11 compile-fail test (2D+reduce now works)
- Add 4 GPU tests: 16x16 sum, 16x16 max, 32x8 asymmetric,
  identity-based (tidx*100+tidy catches row aliasing)
- Zero regression: all 1D reduction, softmax, shared_mem tests pass
- Update tech_debt.md, phases.md (Phase 5 actual plan + roadmap)

* docs: Phase 4 retrospective cleanup, success criteria alignment

Codex review identified drift in planning docs:

- phase4_master_plan.md: status -> Complete, launch model updated
  to grid tuple (not LaunchConfig), perf target updated to actual
  31% result, Sprint 4.9 added
- success-criteria.md: Phase 4 criteria updated to match what
  shipped (shared_mem! not block_load, 31% not 60%), rescoped
  items documented explicitly. Phase 5 criteria updated to match
  actual plan (CPU reference not PyTorch). Publication checklist
  updated to current state.

* gitignore

* feat: standard attention — softmax(Q*K^T/sqrt(d_k)) * V

Sprint 5.2: correctness baseline for FlashAttention.

- Three-kernel decomposition: qk_scaled_matmul (naive 16x16 with
  transposed K indexing), row_softmax (Phase 3 pattern), reused
  matmul() for P*V
- kaio_ops::attention() host API with validation
- 7 GPU tests: tiny/16x16/non-aligned/medium/identity/zero-dim
- Softmax grid fix: block_size=(256,1) for explicit grid control
- DSL friction documented: no &&/||, 1D grid inference, no sqrt()

* feat: causal masking + DSL friction report

Sprint 5.3: causal mask for autoregressive attention.

- apply_causal_mask kernel: S[i,j] = -FLT_MAX where j > i
- attention_causal() host API (unmasked attention() unchanged)
- 6 GPU tests: 4 sizes, row0-self-only, direct mask verification
  (V=identity -> out=P, verify lower-triangular weights sum to 1)
- DSL friction report: 5 documented friction points from 5.2-5.3
  (no &&/||, 1D grid inference, no sqrt(), no compound shared
  assign, no -inf literal)

* feat: FlashAttention — online softmax, O(d_k) memory, no materialization

Sprint 5.4: the stretch goal landed.

- BLOCK_M=1 design: one query per block, 256 threads. Avoids
  per-row reduction limitation. Online softmax with running
  (m, l, O) state rescaled per K/V tile.
- flash_attn_kernel + flash_attn_causal_kernel (block_size=(256,1))
- attention_flash() + attention_flash_causal() host APIs
- d_k <= 256 runtime validation
- 9 GPU tests: 4 sizes, causal, flash_matches_standard,
  causal_first_rows, single-tile, partial last tile (seq=257)
- Standard attention unchanged as correctness oracle
- All 22 attention tests pass, zero regression

* fix: flash_attention_tiny now tests flash path, add d_k>256 guard tests

Codex review findings:
- flash_attention_tiny was calling check_attention() (standard path)
  instead of check_flash_vs_standard() — fixed
- Added flash_rejects_dk_over_256 and flash_causal_rejects_dk_over_256
  regression tests for the d_k <= 256 safety guard

* feat: auto-tuner — benchmark variants, JSON cache, auto-dispatch

Sprint 5.5: benchmark-backed dispatch for matmul and attention.

- tune_matmul() / tune_attention() benchmark variants, cache as JSON
- matmul_auto() / attention_auto() dispatch from cache, fall back
  to default. Pure dispatch — no side effects.
- Internal enums (MatmulVariant, AttentionVariant) for exhaustive
  match, strings only for serialization
- Cache: versioned JSON, SM target in key, causal/non-causal
  separated, duplicate overwrite, corrupt-file recovery
- Variant filtering: flash skipped when d_k > 256
- 7 tests: tune, auto dispatch, cache roundtrip, fallback, d_k guard

* cargo.lock

* ci: Windows + Linux matrix, doc build job

Sprint 5.6: CI/CD platform coverage.

- Test matrix: ubuntu-latest + windows-latest
- New doc job: cargo doc --no-deps --workspace
- Fix rustdoc broken link warning (S[i,j] escape)
- Verified passing on both Windows and WSL

* chore: v0.1.0 prep — version bump, CHANGELOG, README, publish dry-run

Sprint 5.7: packaging for v0.1.0 release.

- Version: 0.0.4 -> 0.1.0 (workspace + all inter-crate deps)
- CHANGELOG: Phase 5 section (attention, FlashAttention, auto-tuner,
  Windows CI, 2D reductions, DSL friction report)
- README: feature table updated (attention, flash, auto-tuner),
  Phase 5 checked in roadmap, limitations updated
- kaio/README.md: status -> Phase 5 complete
- kaio-ops/README.md: attention + auto-tuner sections
- docs/phases.md: Phase 5 status -> Complete
- Publish dry-run: kaio-core + kaio-macros pass clean

v0.0.4

11 Apr 21:43
d359b0c

Choose a tag to compare

v0.0.4 Pre-release
Pre-release
docs: adoption polish — examples, README rewrite, patterns, limitations

Sprint 4.9: Make KAIO frictionless for first-time users.

- Add 4 runnable examples: vector_add, saxpy, reduction, matmul
- README: add "When to Use KAIO", comparison table, examples table,
  3 copy-paste patterns, limitations, gotchas, feedback CTA
- README: quickstart now shows printed output + expected terminal result
- kaio/README.md: expanded crates.io landing page (<150 lines)
- reduction example uses real array data, not synthetic constants

v0.0.3-beta — Phase 3 Complete

11 Apr 01:07
cda6031

Choose a tag to compare

Pre-release

KAIO v0.0.3-beta — Loops, Reductions & Softmax

Early development release. Name reservation + Phase 3 milestone.

What's in this release

  • #[gpu_kernel] proc macro — write GPU kernels in Rust syntax, compile to PTX at build time
  • Arithmetic, comparisons, if/else, for/while loops
  • Shared memory (shared_mem![f32; 256]), barrier sync, warp shuffle
  • Block-level reductions (block_reduce_sum, block_reduce_max)
  • Softmax kernel validated on RTX 4090 (< 1e-5 error vs CPU reference)
  • 19 built-in functions: thread indices, math (sqrt, exp, log, tanh, sin, cos)
  • Cross-platform: Windows + Linux
  • 200 host tests + 24 GPU tests, zero clippy warnings

Not ready for production use

This is a work-in-progress. API will change. Missing features include tiled matmul (Phase 4) and fused attention (Phase 5). See roadmap.

Crates