Problem
coreai-torch relies on PyTorch's
un_decompositions() to break down operations like softplus, mish, logsumexp, and logcumsumexp into primitive ops before conversion. However, PyTorch's default decompositions produce naïve forms that overflow in fp16 on Apple Neural Engine:
| Operation |
Naïve Decomposition |
Failure Threshold |
Failure Mode |
softplus |
log(1 + exp(x)) |
x ≈ 10.4 |
Output → 0 |
mish |
x * tanh(log(1 + exp(x))) |
x ≈ 10.4 |
Output → 0 |
logsumexp |
log(sum(exp(x_i))) |
x ≈ 7.63 |
Output → 0 |
logcumsumexp |
log(cumsum(exp(x_i))) |
x ≈ 11.09 |
Output → ∞/NaN |
These operations are not in the _COMPOSITE_OPS preserve list in _decomp.py, so they get decomposed into overflow-prone primitives.
Note: log_softmax is already correctly handled with a stable max-shift implementation (replace_log_softmax in _aten_to_core.py).
Root Cause
In _decomp.py, the decomposition table preserves only 6 ops (hardsigmoid, hardswish, instance_norm, pixel_shuffle, scaled_dot_product_attention, silu). When softplus is not in this list, PyTorch decomposes it to log(1 + exp(x)), where exp(x) overflows fp16 (max 65,504) for x > ~11.09. On the ANE specifically, the overflow occurs even earlier at x ≈ 10.4 due to an internal 2^15-bounded representation.
Proposed Fix
Apply algebraically equivalent, numerically stable decompositions at the converter layer:
Softplus:
python softplus(x) = max(x, 0) + log(1 + exp(-|x|))
Since -|x| <= 0, exp(-|x|) ∈ (0, 1] — no overflow possible.
Mish:
python mish(x) = x * tanh(softplus_stable(x))
Logsumexp (max-shift):
python logsumexp(x) = max(x) + log(sum(exp(x - max(x))))
Prior Art
These exact fixes have been implemented and validated in apple/coremltools (the predecessor framework):
- PR #2725 — softplus, mish
- PR #2726 — logsumexp
- PR #2727 — log_softmax, logcumsumexp
- Issue #2687 — original fp16 overflow report
Validated across M3 Max and M5 silicon, 128+ test configurations, zero regressions.
Environment
- coreai-torch: latest (cloned June 21, 2026)
- macOS 26 / Apple Silicon
- PyTorch 2.7+
Problem
coreai-torch relies on PyTorch's
un_decompositions() to break down operations like softplus, mish, logsumexp, and logcumsumexp into primitive ops before conversion. However, PyTorch's default decompositions produce naïve forms that overflow in fp16 on Apple Neural Engine:
softpluslog(1 + exp(x))x ≈ 10.4mishx * tanh(log(1 + exp(x)))x ≈ 10.4logsumexplog(sum(exp(x_i)))x ≈ 7.63logcumsumexplog(cumsum(exp(x_i)))x ≈ 11.09These operations are not in the
_COMPOSITE_OPSpreserve list in_decomp.py, so they get decomposed into overflow-prone primitives.Note:
log_softmaxis already correctly handled with a stable max-shift implementation (replace_log_softmaxin_aten_to_core.py).Root Cause
In
_decomp.py, the decomposition table preserves only 6 ops (hardsigmoid,hardswish,instance_norm,pixel_shuffle,scaled_dot_product_attention,silu). Whensoftplusis not in this list, PyTorch decomposes it tolog(1 + exp(x)), whereexp(x)overflows fp16 (max 65,504) forx > ~11.09. On the ANE specifically, the overflow occurs even earlier atx ≈ 10.4due to an internal 2^15-bounded representation.Proposed Fix
Apply algebraically equivalent, numerically stable decompositions at the converter layer:
Softplus:
python softplus(x) = max(x, 0) + log(1 + exp(-|x|))Since
-|x| <= 0,exp(-|x|) ∈ (0, 1]— no overflow possible.Mish:
python mish(x) = x * tanh(softplus_stable(x))Logsumexp (max-shift):
python logsumexp(x) = max(x) + log(sum(exp(x - max(x))))Prior Art
These exact fixes have been implemented and validated in
apple/coremltools(the predecessor framework):Validated across M3 Max and M5 silicon, 128+ test configurations, zero regressions.
Environment