These are empirically-backed heuristics, not hard requirements. A violation is not necessarily wrong — but should be deliberate and documented. Flag unchecked items as "needs justification", not "needs fix". See elementwise-evidence.md for reasoning and data.
- Unary-like kernel →
register_copy - Binary broadcast →
explicit_parallel - Binary same-shape → auto-selects
register_copy
-
register_copypath usesT.alloc_fragment+T.copy - Bool inputs packed as
uint8in Op layer
-
explicit_parallelnpt: fp16/bf16=4, fp32=4, fp8=16 -
register_copynpt: fp16/bf16=8, fp32=4, fp8=16 -
autotune_configsdefined for all template kernels (Unary, Binary, FusedGated): threads∈{128,256,512} × npt∈{2,4,8}; fp8: npt∈{16,32} - All kernel classes (template and custom) cache
_compiled_fnininit_config() - Serialization-fallback
autotune()override for template kernels with closure-basedop_func
-
op_funcuses TileLang intrinsics over manual comparison chains - Results written in-place to input register fragment