A Four-Level Testing Framework for Numerical Software, or: How to Find Where Things Went Wrong
This testing framework is designed not merely to detect failures, but to locate them. When a deviation occurs, the level at which it manifests tells us where to look.
| Level | Name | Question Answered | What We Change |
|---|---|---|---|
| 1 | Regression | Does the code work? Does it still work after changes? | ✓ Unilateral changes permitted |
| 2 | Mathematical | Does the algorithm match the mathematics? | ✗ Read-only |
| 3 | Historical | Does output match what IBM 360/370 users saw? | ✗ Read-only |
| 4 | Hostile | Do compilers, processors, or OSes change anything? | ✗ Read-only |
- Level 1 fails → Code bug. Fix the code.
- Level 2 fails → Algorithm doesn't match mathematics. Either the original was wrong, or the modernisation broke something. Investigate.
- Level 3 fails → Deviation from historical baseline. Document in DEVIATIONS.md with explanation.
- Level 4 fails → Portability issue. Compiler/platform-specific behaviour. Document and potentially ifdef.
The beauty of this structure is that a failure at Level N but success at Level N-1 immediately narrows the search space. If Level 2 passes but Level 3 fails, we know the mathematics is correct but something about the platform differs from the original IBM hardware.
"Does the code work? If we change things, does it still work?"
These are the tests we run continuously. They verify basic functionality and catch regressions when code is modified. This is the only level where we make unilateral changes.
| Module | Status | Location |
|---|---|---|
| service | ⏳ Pending | test/level1_regression/ |
| special_functions | Partial (Bessel) | test/test_bessel_regression.f90 |
| linear (BLAS) | ✓ 18/18 PASS | test/level1_regression/test_l1_linear_blas.f90 |
| linear (LINPACK) | ⏳ In Progress | test/level1_regression/test_l1_linear_linpack.f90 |
| diff_integ | ✓ 12/12 PASS | test/level1_regression/test_l1_diff_integ.f90 |
| diff_integ_eq | Partial | test/test_diff_integ_eq.f90 |
| interpolation | ✓ 9/9 PASS | test/level1_regression/test_l1_interpolation.f90 |
| integ_trans | ⏳ Pending | — |
| approximation (MINPACK) | ✓ 9/9 PASS | test/level1_regression/test_l1_minpack.f90 |
| nonlin_eq | ⏳ Pending | — |
| optimisation | ⏳ Pending | — |
| data_handling | ⏳ Pending | — |
"Does the algorithm match the original mathematics?"
These tests compare computed values against authoritative mathematical references. If Level 2 fails, either:
- The original SLATEC algorithm was incorrect (rare but documented historically)
- The modernisation introduced a mathematical error
- The reference value is wrong (check multiple sources)
References:
- Abramowitz & Stegun (1964) — Handbook of Mathematical Functions
- NIST Digital Library of Mathematical Functions — https://dlmf.nist.gov/
- Closed-form solutions where available (e.g. the integral of x squared from 0 to 1 equals one-third)
| Module | Status | Reference Source |
|---|---|---|
| special_functions | Partial | A&S Tables 9.1, 9.8 (Bessel) |
| linear (BLAS) | ✓ 16/16 PASS | Lawson et al., ACM TOMS 5(3), 1979 |
| diff_integ | ✓ 17/17 PASS | Closed-form integrals, A&S, NIST DLMF |
| interpolation | ⚠ 14/17 PASS | de Boor (1978), Fritsch & Carlson (1980) |
| approximation (MINPACK) | ✓ 17/17 PASS | Moré, Garbow, Hillstrom (1981) |
Test Files:
test/test_bessel_golden.f90— Bessel J, I, K vs A&S tablestest/test_bessel_reference.f90— Extended validationtest/test_specfun_bessel.f90— Edge casestest/level2_mathematical/test_l2_linear_blas.f90— BLAS linearity, orthogonality, Pythagorean propertiestest/level2_mathematical/test_l2_minpack_mgh.f90— MGH test functions (Rosenbrock, Powell, Helical Valley, Freudenstein-Roth)test/level2_mathematical/test_l2_diff_integ.f90— Gauss-Kronrod exactness, classical integrals, infinite intervalstest/level2_mathematical/test_l2_interpolation.f90— B-spline, PCHIP, polynomial interpolation properties
"Does the output match what users would have seen on the original hardware?"
This is the "last known good" reference. SLATEC was developed and validated on IBM mainframes. If our modernised code produces different output than the original would have on IBM hardware, we need to understand why.
| Component | Version | Notes |
|---|---|---|
| Compiler | IBM FORTRAN G Level 21 (IEYFORT) | 1966 - predates SLATEC by 16 years |
| System | MVS 3.8j (TK4-) | The mainframe that refuses to die |
| Emulator | Hercules 3.07 | Making virtual iron from actual silicon |
| Library | SLATEC 4.1 | Sandia/Los Alamos/Air Force origin |
IBM hexadecimal floating-point differs from IEEE 754:
| Property | IBM Hex FP | IEEE 754 |
|---|---|---|
| Base | 16 | 2 |
| Single mantissa | 24 bits (6 hex) | 23 bits |
| Double mantissa | 56 bits (14 hex) | 52 bits |
| Gradual underflow | No | Yes |
| NaN / Infinity | No | Yes |
| Precision | "Wobbling" (0-3 bits lost) | Consistent |
A deviation at Level 3 is not necessarily a bug - it may be an unavoidable consequence of the floating-point representation change. But it must be documented and explained.
IBM System/360 uses hexadecimal floating-point because of course it does.
I1MACH (Integer Machine Constants)
I1MACH(1) = 5 Standard input unit
I1MACH(2) = 6 Standard output unit
I1MACH(10) = 16 Base for floating-point (hex!)
I1MACH(11) = 6 Hex digits in single precision mantissa
I1MACH(14) = 14 Hex digits in double precision mantissa
R1MACH (Single Precision): epsilon = 9.54e-7, range 1e-78 to 1e76
D1MACH (Double Precision): epsilon = 2.22e-16, range 1e-79 to 1e75
| Routine | Category | Precision | Status | Max Error |
|---|---|---|---|---|
| I1MACH | Machine Constants | Integer | PASS | exact |
| R1MACH | Machine Constants | Single | PASS | exact |
| D1MACH | Machine Constants | Double | PASS | exact |
| GAMLN | Special Functions | Single | PASS | ~1e-6 |
| DGAMLN | Special Functions | Double | PASS | ~1e-15 |
| PYTHAG | Numerical Utilities | Single | PASS | ~1e-6 |
| ENORM | Vector Operations | Single | PASS | 0 |
| DENORM | Vector Operations | Double | PASS | ~1e-15 |
| CDIV | Complex Arithmetic | Single | PASS | ~1e-7 |
| CSROOT | Complex Arithmetic | Single | PASS | 0 |
| RC | Elliptic Integrals | Single | PASS | ~1e-6 |
| DRC | Elliptic Integrals | Double | PASS | ~1e-16 |
| VNWRMS | Vector Operations | Single | PASS | ~1e-6 |
| HVNRM | Vector Operations | Single | PASS | 0 |
14 routines tested. 14 routines passed. 0 surprises.
Carlson Elliptic Integrals - Computing pi via the duplication theorem:
- RC(0, 1/4) = 3.1415930 (single) - pi to 7 digits on 1966 hardware
- DRC(0, 1/4) = 3.14159265358979 (double) - pi to 15 digits, error 2.4e-15
Overflow Protection - PYTHAG and ENORM handle extreme values correctly:
- PYTHAG(1e30, 1e30) = sqrt(2)*1e30 without overflow
- DENORM([1e30, 1e30]) = sqrt(2)*1e30 with error 0.0
Single Precision (6 hex digits = 7.2 decimal digits)
- Observed errors: 1e-5 to 1e-7
- Within expected tolerance for IBM 360 hexadecimal FP
Double Precision (14 hex digits = 16.8 decimal digits)
- Observed errors: 1e-15 to 1e-16
- Approaching machine epsilon, as expected
Code modifications required to compile on IBM FORTRAN G (1966):
| Issue | FORTRAN 77 | FORTRAN IV Fix |
|---|---|---|
| Deck markers | *DECK NAME |
C DECK NAME |
| Generic intrinsics | MAX, MIN, ABS |
AMAX1, AMIN1, ABS (SP) |
| Generic intrinsics (DP) | MAX, MIN, LOG |
DMAX1, DMIN1, DLOG |
| Assumed-size arrays | X(*) |
X(N) |
| SAVE statement | SAVE VAR |
(removed) |
| Hex constants | Z'00100000' |
Decimal via EQUIVALENCE |
| Block IF | IF (...) THEN |
IF (...) GO TO label |
/c/dev/fortran360/tests/slatec/
| Module | Routines Tested | Status |
|---|---|---|
| service | D1MACH, R1MACH, I1MACH | PASS |
| special_functions | GAMLN, DGAMLN, CDIV, CSROOT | PASS |
| linear (BLAS) | DAXPY, DROTG | 9/9 PASS |
| linear | ENORM, PYTHAG | PASS |
| elliptic | RC, DRC | PASS |
| approximation (MINPACK) | DENORM | 7/7 PASS |
| approximation (MINPACK) | DQRFAC | Pending |
| approximation (MINPACK) | DNLS1 | Pending |
| approximation (MINPACK) | DNSQ | Pending |
| interpolation | DPLINT, DPOLVL (polynomial) | 4/4 PASS |
| interpolation | DPCHIM, DPCHFE (PCHIP) | N/A (post-1980) |
| interpolation | DBINT4, DBVALU (B-spline) | N/A (post-1978) |
| diff_integ | Trapezoidal rule | 2/2 PASS |
| diff_integ | QUADPACK routines | N/A (post-1983) |
- SLATEC: 1982 joint effort by Sandia, Los Alamos, and Air Force labs
- FORTRAN G: 1966 IBM compiler, predating SLATEC by 16 years
- IBM 360: 1964 architecture, still running (virtually) today
The fact that 1982 library code runs perfectly on a 1966 compiler is a testament to FORTRAN's legendary backward compatibility.
BLAS tested with Pythagorean triple golden values. FORTRAN IV source in /c/dev/fortran360/tests/slatec/blas/.
DENORM tested on IBM System/360 (Hercules/TK4-) using FORTRAN G, 1 January 2026.
"Do compilers, processors, or operating systems change anything?"
These are the portability stress tests. The same code compiled with different compilers, on different processors, under different operating systems, should produce the same results (within floating-point tolerance). When it doesn't, we need to know.
| Compiler | Platform | Status |
|---|---|---|
| gfortran 13.x | Linux (x86_64) | CI tested |
| gfortran 13.x | macOS (ARM64) | CI tested |
| gfortran 13.x | Windows (MSYS2) | CI tested |
| ifort/ifx | Linux | ⏳ Untested |
| flang | Linux | ⏳ Untested |
| nvfortran | Linux | ⏳ Untested |
| Test Category | Tests | Default | -ffast-math |
|---|---|---|---|
| Subnormals | 3 | ✓ PASS | May FAIL (DAZ/FTZ) |
| Signed Zero | 3 | ✓ PASS | ✓ PASS |
| Inf/NaN Propagation | 3 | ✓ PASS | ✓ PASS |
| Extreme Values | 4 | ✓ PASS | ✓ PASS |
| Accumulation Precision | 3 | ✓ PASS | ✓ PASS |
| SIMD Edge Cases | 4 | ✓ PASS | ✓ PASS |
| Rounding Modes | 4 | ✓ PASS | ✓ PASS |
| FMA Detection | 3 | ✓ PASS | ✓ PASS |
| Catastrophic Cancellation | 3 | ✓ PASS | May vary |
| Associativity | 4 | ✓ PASS | May FAIL |
| Extended Precision (x87) | 4 | ✓ PASS | ✓ PASS |
| Reproducibility | 3 | ✓ PASS | ✓ PASS |
| Compiler Flag Detection | 4 | ✓ PASS | FAIL (expected) |
| NaN Variants | 4 | ✓ PASS | May FAIL |
| ULP Accuracy | 4 | ✓ PASS | ✓ PASS |
| Memory Alignment | 12 | ✓ PASS | ✓ PASS |
| Total | 65 | 65/65 PASS | May vary |
| Test Category | Tests | Default | -ffast-math |
|---|---|---|---|
| ULP Precision | 4 | ✓ PASS | ✓ PASS |
| Edge Cases | 5 | ✓ PASS | May FAIL |
| Associativity | 4 | ✓ PASS | May FAIL |
| Rounding Modes | 4 | ✓ PASS | ✓ PASS |
| FMA Effects | 3 | ✓ PASS | ✓ PASS |
| Catastrophic Cancellation | 3 | ✓ PASS | May vary |
| Condition Number | 3 | ✓ PASS | ✓ PASS |
| Convergence Reproducibility | 3 | ✓ PASS | ✓ PASS |
| Compiler Flag Detection | 4 | ✓ PASS | FAIL (expected) |
| Extended Precision (x87) | 3 | ✓ PASS | ✓ PASS |
| Scaling Sensitivity | 3 | ✓ PASS | ✓ PASS |
| Jacobian Computation | 3 | ✓ PASS | May vary |
| LM Parameter Sensitivity | 3 | ✓ PASS | ✓ PASS |
| Total | 45 | 45/45 PASS | May FAIL |
| Issue | Affected | Mitigation |
|---|---|---|
-ffast-math FTZ |
DENORM subnormals | Subnormals flushed to zero. DENORM returns 0 instead of correct norm. Do not use. |
-ffast-math |
Bessel K functions | 2-4 ULP deviation. Do not use. |
| Aggressive optimisation | Some QUADPACK routines | May reorder operations. Test carefully. |
| ARM vs x86 FPU | Potentially all | Under investigation |
See DEVIATIONS.md for full analysis.
test/test_bessel_portability.f90— Cross-platform ULP measurementtest/level4_hostile/test_l4_linear_blas.f90— BLAS hostile tests (subnormals, Inf/NaN, SIMD edges)test/level4_hostile/test_l4_minpack.f90— MINPACK portability (detects FTZ)
Last updated: 18 January 2026
| Module | Routines | L1 | L2 | L3 | L4 | Overall |
|---|---|---|---|---|---|---|
| service | 3 | — | — | ✓ | — | ~33% |
| special_functions | 270 | ~20 | ~20 | 4 | ~20 | ~9% |
| linear (BLAS) | 217 | 18 | 16 | 9 | 65 | ~50% |
| diff_integ | 81 | 12 | 17 | 2 | 12 | ~53% |
| diff_integ_eq | 225 | ~5 | — | — | — | ~2% |
| interpolation | 80 | 9 | 14 | 4 | 19 | ~58% |
| integ_trans | 48 | — | — | — | — | 0% |
| approximation | 78 | 9 | 17 | 7 | 45 | ~100% |
| nonlin_eq | 15 | — | — | — | — | 0% |
| optimisation | 46 | — | — | — | — | 0% |
| data_handling | 16 | — | — | — | — | 0% |
| TOTAL | 1,079 | ~83 | ~98 | ~28 | ~161 | ~25% |
Testing the routines we just modernised (eliminated 27 GOTOs)
| Level | Tests | Result | Notes |
|---|---|---|---|
| L1 Regression | 9 | ✓ 9/9 PASS | DENORM (7) + ENORM (2) |
| L2 Mathematical | 17 | ✓ 17/17 PASS | Rosenbrock, Powell, Helical, F-R, DENORM |
| L3 Historical | 7 | ✓ 7/7 PASS | IBM 360 golden values |
| L4 Hostile | 45 | ✓ 45/45 PASS | 13 categories, comprehensive portability |
| L4 with -ffast-math | 45 | May FAIL | Subnormal flush, associativity detected |
| TOTAL | 78 | 78/78 PASS | (without hostile flags) |
| Level | File | Description |
|---|---|---|
| L1 | test/level1_regression/test_l1_minpack.f90 |
Modern Fortran regression |
| L2 | test/level2_mathematical/test_l2_minpack_mgh.f90 |
MGH test functions |
| L3 | test/level3_historical/test_l3_minpack.f90 |
IBM 360 golden comparisons |
| L3 | /c/dev/fortran360/tests/slatec/minpack/test_enorm.f |
FORTRAN IV for Hercules |
| L4 | test/level4_hostile/test_l4_minpack.f90 |
Portability stress tests |
| Test File | Routines | Problems | Status |
|---|---|---|---|
| test_enorm.f | DENORM | 7 norm tests | ✓ 7/7 PASS |
| test_qrfac.f | DQRFAC | 4 QR tests | ⏳ Pending |
| test_dnls1.f | DNLS1 | Rosenbrock, Powell, Freudenstein-Roth | ⏳ Pending |
| test_dnsq.f | DNSQ | Helical, Trigonometric, Broyden | ⏳ Pending |
Testing core linear algebra routines (DAXPY, DSCAL, DCOPY, DSWAP, DROT, DROTG)
| Level | Tests | Result | Notes |
|---|---|---|---|
| L1 Regression | 18 | ✓ 18/18 PASS | DAXPY, DSCAL, DCOPY, DSWAP, DROT, DROTG |
| L2 Mathematical | 16 | ✓ 16/16 PASS | Linearity, orthogonality, Pythagorean |
| L3 Historical | 9 | ✓ 9/9 PASS | IBM 360 golden values (Pythagorean triples) |
| L4 Hostile | 65 | ✓ 65/65 PASS | 16 categories, comprehensive portability |
| TOTAL | 108 | 108/108 PASS |
| Level | File | Description |
|---|---|---|
| L1 | test/level1_regression/test_l1_linear_blas.f90 |
Modern Fortran regression |
| L2 | test/level2_mathematical/test_l2_linear_blas.f90 |
Mathematical property verification |
| L3 | test/level3_historical/test_l3_linear_blas.f90 |
IBM 360 golden comparisons |
| L3 | /c/dev/fortran360/tests/slatec/blas/test_daxpy.f |
FORTRAN IV for Hercules |
| L3 | /c/dev/fortran360/tests/slatec/blas/test_drotg.f |
FORTRAN IV for Hercules |
| L4 | test/level4_hostile/test_l4_linear_blas.f90 |
Hostile environment tests |
| Property | Test |
|---|---|
| Distributivity | (a+b)X = aX + b*X |
| Distributivity | a*(X+Y) = aX + aY |
| Orthogonality | c² + s² = 1 for Givens rotations |
| Pythagorean | r = √(a² + b²) for Givens |
| Norm preservation | ||Rx|| = ||x|| for rotations |
| Triangle inequality | ||x+y|| ≤ ||x|| + ||y|| |
| Category | Tests | What It Catches |
|---|---|---|
| Subnormals | 3 | DAZ/FTZ from -ffast-math |
| Signed Zero | 3 | IEEE 754 ±0 handling |
| Inf/NaN | 3 | Special value propagation |
| Extreme Values | 4 | Near overflow/underflow |
| Accumulation | 3 | Precision loss in summations |
| SIMD Edges | 4 | Vectorization boundary bugs |
| Rounding Modes | 4 | All four IEEE rounding modes |
| FMA Detection | 3 | Fused multiply-add vs separate |
| Catastrophic Cancellation | 3 | Nearly-equal value subtraction |
| Associativity | 4 | Summation reordering, -ffast-math |
| Extended Precision (x87) | 4 | 80-bit register leakage |
| Reproducibility | 3 | Identical results across runs |
| Compiler Flag Detection | 4 | Automatic dangerous flag detection |
| NaN Variants | 4 | qNaN vs sNaN, payload preservation |
| ULP Accuracy | 4 | Mathematical precision deviation |
| Memory Alignment | 12 | Non-aligned, strided, reverse access |
Testing QUADPACK and GAUS8 numerical integration routines
| Level | Tests | Result | Notes |
|---|---|---|---|
| L1 Regression | 12 | ✓ 12/12 PASS | DGAUS8, DQAGS, DQAGI, DQNG |
| L2 Mathematical | 17 | ✓ 17/17 PASS | Gauss-Kronrod exactness, classical integrals |
| L3 Historical | 2 | ✓ 2/2 PASS | Trapezoidal rule (QUADPACK N/A - post-1983) |
| L4 Hostile | 12 | ✓ 12/12 PASS | Extreme values, oscillatory, reproducibility |
| TOTAL | 43 | 43/43 PASS |
| Level | File | Description |
|---|---|---|
| L1 | test/level1_regression/test_l1_diff_integ.f90 |
Basic integration regression tests |
| L2 | test/level2_mathematical/test_l2_diff_integ.f90 |
Gauss-Kronrod exactness, special functions |
| L3 | test/level3_historical/test_l3_diff_integ.f90 |
IBM 360 trapezoidal golden values |
| L4 | test/level4_hostile/test_l4_diff_integ.f90 |
Hostile environment tests |
| Property | Test |
|---|---|
| Gauss-Kronrod exactness | QK15 exact for degree ≤ 29 |
| Gauss-Kronrod exactness | QK21 exact for degree ≤ 41 |
| Gauss-Kronrod exactness | QK31 exact for degree ≤ 61 |
| Classical integrals | ∫sin(x)dx, ∫exp(-x)dx, ∫1/√x dx |
| Infinite intervals | ∫₀^∞ exp(-x)dx = 1 |
| Gaussian integral | ∫_{-∞}^{∞} exp(-x²)dx = √π |
| Category | Tests | What It Catches |
|---|---|---|
| Tiny Intervals | 2 | Very small [a,b], A≈B handling |
| Extreme Values | 3 | 1e-100 to 1e100 function scaling |
| Narrow Peaks | 2 | Sharp Gaussian, narrow Lorentzian |
| Discontinuities | 2 | Step functions, kinks |
| Oscillatory | 2 | sin(10x), cos(10x) moderate oscillation |
| Reproducibility | 1 | Identical results across runs |
QUADPACK was developed by Piessens, de Doncker, et al. and published in 1983. The algorithms post-date the IBM 360 era, so no authentic L3 golden values exist. Level 3 tests use trapezoidal rule integration (1960s compatible) for IBM 360 validation.
Testing B-spline, PCHIP, and polynomial interpolation routines
| Level | Tests | Result | Notes |
|---|---|---|---|
| L1 Regression | 9 | ✓ 9/9 PASS | DBINT4, DBVALU, DPCHIM, DPCHFE, DPLINT, DPOLVL |
| L2 Mathematical | 17 | ⚠ 14/17 PASS | B-spline issues identified |
| L3 Historical | 4 | ✓ 4/4 PASS | IBM 360 polynomial golden values |
| L4 Hostile | 19 | ✓ 19/19 PASS | Runge phenomenon, subnormals, cancellation |
| TOTAL | 49 | 46/49 PASS | 3 B-spline mathematical issues |
| Level | File | Description |
|---|---|---|
| L1 | test/level1_regression/test_l1_interpolation.f90 |
Basic interpolation regression |
| L2 | test/level2_mathematical/test_l2_interpolation.f90 |
Mathematical property verification |
| L3 | test/level3_historical/test_l3_interpolation.f90 |
IBM 360 golden comparisons |
| L3 | /c/dev/vintage/fortran360/tests/slatec/interpolation/test_poly.f |
FORTRAN IV for Hercules |
| L4 | test/level4_hostile/test_l4_interpolation.f90 |
Hostile environment tests |
Captured via Hercules/TK4- on 18 January 2026:
| Test | y = x³ at midpoints | IBM 360 Value |
|---|---|---|
| p(0.5) | 0.125 | 0.125000000000000D 00 |
| p(1.5) | 3.375 | 0.337500000000000D 01 |
| p(2.5) | 15.625 | 0.156250000000000D 02 |
| p(3.5) | 42.875 | 0.428750000000000D 02 |
- DBINT4 natural spline: Does not interpolate exactly (max error: 3.56e-03)
- DBINT4 clamped spline: Does not interpolate exactly (max error: 0.36)
- Natural spline boundary: S''(1) = 60.1 instead of 0
PCHIP and polynomial interpolation work correctly.
When adding tests, consider which level they belong to:
- Level 1: Basic functionality. Does
DGEMM(A, B)return something reasonable? - Level 2: Mathematical correctness. Does
DGEMM(I, A)returnA? - Level 3: Historical baseline. Does our
DGEMMmatch IBM 360 output? - Level 4: Portability. Does
DGEMMgive the same answer on ARM vs x86?
Only Level 1 tests should be modified when code changes. Levels 2-4 are reference tests — if they fail, investigate before changing them.
All deviations from expected values are documented in DEVIATIONS.md, including:
- Which level detected the deviation
- Root cause analysis
- Whether it's a bug, a floating-point artefact, or expected behaviour
- Remediation (if any)
"The purpose of testing is not to prove the code works. It's to find where it doesn't."
— Someone who has clearly spent too long debugging numerical software