12 Jun 18:36

dmriding

dabd57d

v0.5.0 Latest

Latest

v0.5.0 — FlashAttention Backward, bf16 Tensor Cores, ldmatrix

Phase 9 aggregate release. Workspace crates (kaio, kaio-core,
kaio-macros, kaio-ops, kaio-runtime) move to 0.5.0;
kaio-candle moves to 0.2.0.

Highlights

FlashAttention backward — attention_flash_bwd /
attention_flash_bwd_causal plus _with_stats forward variants that
export the per-row softmax logsumexp. Three dedicated PTX kernels
(D-term preprocess, dK/dV, dQ) rebuild the softmax from the saved
logsumexp instead of materializing the O(seq²) probability matrix —
no atomics, 18–22 registers/thread on sm_89. Validated against a CPU
f64 analytical oracle that is itself self-checked against central
finite differences before judging any GPU output.

bf16 tensor-core matmul family — matmul_tc_bf16,
matmul_tc_bf16_async, and the matmul_auto_tc_bf16 auto-tuner.
Operand-dtype siblings of the f16 kernels via a dedicated
mma.sync.bf16 IR variant (no runtime conversion). Performance parity
with f16 verified by interleaved per-iteration A/B gates that ship
with the kernels; see docs/performance.md for the parity record.

candle bridge (kaio-candle 0.2.0) — twelve forward CustomOp
bindings; backward (autograd) for all four matmul TC variants
(f16 + bf16, sync + async) and FlashAttention (plain + causal),
f32 end-to-end on the flash path.

ldmatrix IR primitive — TensorCoreOp::LdMatrix
(ldmatrix.sync.aligned.m8n8, .x2/.x4, optional .trans),
min_sm = 75 — the first sub-Ampere tensor-core op in the IR. The
matmul_tc fragment-A loader built on it is proven bit-identical to
the shipped loader and measured at the bench noise floor at the
current tile layout, so it ships built-and-parked behind the proven
ld.shared default; an XOR-swizzled tile layout is the named
follow-up that makes it pay.

Also in this release

First-shipped previously-unreleased work: the kaio-py PyO3
scaffold (in-repo, unpublished, demand-gated) and the ten-harness
cargo xtask bench roster.
Hardening: the FlashAttention backward building-block helpers now
validate dimensions and buffer lengths before launch, matching
every other launch wrapper in kaio-ops.
CI: new no-CUDA check/doc leg for kaio-candle; weekly
candle-HEAD compatibility workflow repaired (its CUDA toolkit
install step had been failing before any candle code was checked).
Test coverage: 94.67% line coverage across the workspace,
measured host + GPU suites merged on RTX 4090 (sm_89).

Install

kaio = "0.5"
kaio-ops = "0.5"
kaio-candle = { version = "0.2", features = ["cuda"] }  # candle bridge

NVIDIA driver only — no CUDA toolkit required to build or run
(cudarc dynamic-loading). Tensor-core ops need SM 8.0+; MSRV 1.94.

Full detail in CHANGELOG.md; measured numbers and
methodology in docs/performance.md.

What's Changed
Sprint 8.0.5 — bench coverage extension by @dmriding in #16
Sprint 8.1 — kaio-py PyO3 scaffold by @dmriding in #17
Phase 9 — FlashAttention backward, bf16 TC family, ldmatrix by @dmriding in #18
Full Changelog: v0.4.1...v0.5.0

Contributors

dmriding

Assets 2

24 Apr 15:43

dmriding

v0.4.1

e8b7ae6

v0.4.1

What's Changed

Phase8 by @dmriding in #15

Full Changelog: v0.4.0...v0.4.1

Contributors

dmriding

Assets 2

17 Apr 23:48

dmriding

v0.4.0

c5ed152

v0.4.0

What's Changed

Phase7 wrap by @dmriding in #14

Full Changelog: v0.3.0...v0.4.0

Contributors

dmriding

Assets 2

15 Apr 02:24

dmriding

v0.3.0

fa7dac0

v0.3.0

update test coverage badge

Assets 2

13 Apr 23:26

dmriding

v0.2.2

83eacb8

v0.2.2

What's Changed

Sprint 7 0 5 by @dmriding in #9

Full Changelog: v0.2.1...v0.2.2

Contributors

dmriding

Assets 2

13 Apr 22:33

dmriding

v0.2.1

76fefb7

v0.2.1

What's Changed

Cleanup/open threads by @dmriding in #7
Phase7 by @dmriding in #8

Full Changelog: v0.2.0...v0.2.1

Contributors

dmriding

Assets 2

13 Apr 09:54

dmriding

v0.2.0

866f788

v0.2.0

What's Changed

Phase6 by @dmriding in #6

Full Changelog: v0.1.0...v0.2.0

Contributors

dmriding

Assets 2

11 Apr 22:57

dmriding

v0.1.0

bbc1c4d

v0.1.0

Phase5 (#5)

* readme

* docs: Phase 5 master plan and sprint log

- phase5_master_plan.md: architecture decisions, 8-sprint breakdown,
  success criteria, risk analysis
- PHASE_5_LOG.md: sprint index with dependency graph
- Key blocker: 2D block reductions (Sprint 5.1 hard gate)
- FlashAttention is stretch goal, standard attention is the floor

* feat: 2D block reductions — linear tid for block_reduce_sum/max

Sprint 5.1: Hard gate for fused attention cleared.

- Compute linear_tid = tidx + tidy * block_dim_x for 2D kernels
  in lower_block_reduce() — all warp/lane/thread-0 logic flows
  from #tid automatically. 1D kernels unchanged (zero extra PTX).
- Remove 2D rejection guard in builtins.rs
- Delete cf11 compile-fail test (2D+reduce now works)
- Add 4 GPU tests: 16x16 sum, 16x16 max, 32x8 asymmetric,
  identity-based (tidx*100+tidy catches row aliasing)
- Zero regression: all 1D reduction, softmax, shared_mem tests pass
- Update tech_debt.md, phases.md (Phase 5 actual plan + roadmap)

* docs: Phase 4 retrospective cleanup, success criteria alignment

Codex review identified drift in planning docs:

- phase4_master_plan.md: status -> Complete, launch model updated
  to grid tuple (not LaunchConfig), perf target updated to actual
  31% result, Sprint 4.9 added
- success-criteria.md: Phase 4 criteria updated to match what
  shipped (shared_mem! not block_load, 31% not 60%), rescoped
  items documented explicitly. Phase 5 criteria updated to match
  actual plan (CPU reference not PyTorch). Publication checklist
  updated to current state.

* gitignore

* feat: standard attention — softmax(Q*K^T/sqrt(d_k)) * V

Sprint 5.2: correctness baseline for FlashAttention.

- Three-kernel decomposition: qk_scaled_matmul (naive 16x16 with
  transposed K indexing), row_softmax (Phase 3 pattern), reused
  matmul() for P*V
- kaio_ops::attention() host API with validation
- 7 GPU tests: tiny/16x16/non-aligned/medium/identity/zero-dim
- Softmax grid fix: block_size=(256,1) for explicit grid control
- DSL friction documented: no &&/||, 1D grid inference, no sqrt()

* feat: causal masking + DSL friction report

Sprint 5.3: causal mask for autoregressive attention.

- apply_causal_mask kernel: S[i,j] = -FLT_MAX where j > i
- attention_causal() host API (unmasked attention() unchanged)
- 6 GPU tests: 4 sizes, row0-self-only, direct mask verification
  (V=identity -> out=P, verify lower-triangular weights sum to 1)
- DSL friction report: 5 documented friction points from 5.2-5.3
  (no &&/||, 1D grid inference, no sqrt(), no compound shared
  assign, no -inf literal)

* feat: FlashAttention — online softmax, O(d_k) memory, no materialization

Sprint 5.4: the stretch goal landed.

- BLOCK_M=1 design: one query per block, 256 threads. Avoids
  per-row reduction limitation. Online softmax with running
  (m, l, O) state rescaled per K/V tile.
- flash_attn_kernel + flash_attn_causal_kernel (block_size=(256,1))
- attention_flash() + attention_flash_causal() host APIs
- d_k <= 256 runtime validation
- 9 GPU tests: 4 sizes, causal, flash_matches_standard,
  causal_first_rows, single-tile, partial last tile (seq=257)
- Standard attention unchanged as correctness oracle
- All 22 attention tests pass, zero regression

* fix: flash_attention_tiny now tests flash path, add d_k>256 guard tests

Codex review findings:
- flash_attention_tiny was calling check_attention() (standard path)
  instead of check_flash_vs_standard() — fixed
- Added flash_rejects_dk_over_256 and flash_causal_rejects_dk_over_256
  regression tests for the d_k <= 256 safety guard

* feat: auto-tuner — benchmark variants, JSON cache, auto-dispatch

Sprint 5.5: benchmark-backed dispatch for matmul and attention.

- tune_matmul() / tune_attention() benchmark variants, cache as JSON
- matmul_auto() / attention_auto() dispatch from cache, fall back
  to default. Pure dispatch — no side effects.
- Internal enums (MatmulVariant, AttentionVariant) for exhaustive
  match, strings only for serialization
- Cache: versioned JSON, SM target in key, causal/non-causal
  separated, duplicate overwrite, corrupt-file recovery
- Variant filtering: flash skipped when d_k > 256
- 7 tests: tune, auto dispatch, cache roundtrip, fallback, d_k guard

* cargo.lock

* ci: Windows + Linux matrix, doc build job

Sprint 5.6: CI/CD platform coverage.

- Test matrix: ubuntu-latest + windows-latest
- New doc job: cargo doc --no-deps --workspace
- Fix rustdoc broken link warning (S[i,j] escape)
- Verified passing on both Windows and WSL

* chore: v0.1.0 prep — version bump, CHANGELOG, README, publish dry-run

Sprint 5.7: packaging for v0.1.0 release.

- Version: 0.0.4 -> 0.1.0 (workspace + all inter-crate deps)
- CHANGELOG: Phase 5 section (attention, FlashAttention, auto-tuner,
  Windows CI, 2D reductions, DSL friction report)
- README: feature table updated (attention, flash, auto-tuner),
  Phase 5 checked in roadmap, limitations updated
- kaio/README.md: status -> Phase 5 complete
- kaio-ops/README.md: attention + auto-tuner sections
- docs/phases.md: Phase 5 status -> Complete
- Publish dry-run: kaio-core + kaio-macros pass clean

Assets 2

11 Apr 21:43

dmriding

v0.0.4

d359b0c

v0.0.4 Pre-release

Pre-release

docs: adoption polish — examples, README rewrite, patterns, limitations

Sprint 4.9: Make KAIO frictionless for first-time users.

- Add 4 runnable examples: vector_add, saxpy, reduction, matmul
- README: add "When to Use KAIO", comparison table, examples table,
  3 copy-paste patterns, limitations, gotchas, feedback CTA
- README: quickstart now shows printed output + expected terminal result
- kaio/README.md: expanded crates.io landing page (<150 lines)
- reduction example uses real array data, not synthetic constants

Assets 2

11 Apr 01:07

dmriding

v0.0.3

cda6031

v0.0.3-beta — Phase 3 Complete Pre-release

Pre-release

KAIO v0.0.3-beta — Loops, Reductions & Softmax

Early development release. Name reservation + Phase 3 milestone.

What's in this release

#[gpu_kernel] proc macro — write GPU kernels in Rust syntax, compile to PTX at build time
Arithmetic, comparisons, if/else, for/while loops
Shared memory (shared_mem![f32; 256]), barrier sync, warp shuffle
Block-level reductions (block_reduce_sum, block_reduce_max)
Softmax kernel validated on RTX 4090 (< 1e-5 error vs CPU reference)
19 built-in functions: thread indices, math (sqrt, exp, log, tanh, sin, cos)
Cross-platform: Windows + Linux
200 host tests + 24 GPU tests, zero clippy warnings

Not ready for production use

This is a work-in-progress. API will change. Missing features include tiled matmul (Phase 4) and fused attention (Phase 5). See roadmap.

Crates

kaio v0.0.3
kaio-core v0.0.3
kaio-macros v0.0.3
kaio-runtime v0.0.3

Assets 2

Uh oh!

Releases: dmriding/kaio

v0.5.0

v0.5.0 — FlashAttention Backward, bf16 Tensor Cores, ldmatrix

Highlights

Also in this release

Install

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!

v0.4.0

What's Changed

Contributors

Uh oh!

v0.3.0

Uh oh!

v0.2.2

What's Changed

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!

v0.2.0

What's Changed

Contributors

Uh oh!

v0.1.0

Uh oh!

v0.0.4

Uh oh!

v0.0.3-beta — Phase 3 Complete

KAIO v0.0.3-beta — Loops, Reductions & Softmax

What's in this release

Not ready for production use

Crates

Uh oh!