Hi Gemmini team,
I am debugging a possible issue in the BF16 mixed-precision PE path, specifically in output-stationary (OS) dataflow. I would like to ask whether the behavior below is expected, or whether it may indicate a type-width issue in PE.scala.
Setup
I am using the official BF16 high-performance / mixed-precision Gemmini configuration in my local Chipyard/Gemmini setup.
The test uses the official high-level matmul API:
tiled_matmul_auto(...)
This is not a manually assembled preload / compute / mvout primitive sequence.
The test case is small:
dataflow: OS
matrix size: DIM x DIM, currently DIM = 4
no bias, D = NULL
A is filled with BF16 0x3f99
B is filled with BF16 0x3f99
0x3f99 is approximately 1.1953125, so each output element should be approximately:
4 * 1.1953125 * 1.1953125 ~= 5.715
The expected BF16 result is around:
0x40b6
However, the actual output is close to zero, often 0x0001.
Debug instrumentation
I added debug-only printfs in ExecuteController.scala and PE.scala.
In PE.scala, I only split the original assignment:
io.out_d := io.in_c.mac(io.in_a, io.in_b)
into:
val mac_raw = io.in_c.mac(io.in_a, io.in_b)
io.out_d := mac_raw
and printed mac_raw, out_d, PE inputs, and PE internal registers. This was only for observation and was not intended to change functionality.
Key observations
At the ExecuteController / Mesh boundary, the inputs appear to be correct. For example, I can see A/B containing 0x3f99:
dataA=0x3f993f993f993f99
meshA=0x3f993f993f993f99
dataB=0x3f993f993f993f99
meshB=0x3f993f993f993f99
Inside the PE, the MAC input also appears correct:
[PEDBG] ... a=0x3f99 b=0x3f99 ...
mac_in_a=0x3f99 mac_in_b=0x3f99 mac_in_c=0x00000000 ...
The raw MAC result appears reasonable:
[MACDBG] in_a=0x3f99 in_b=0x3f99 in_c=0x00000000
mac_raw=0x3fb6e200
mac_as_out=0xe200
out_d=0xe200
0x3fb6e200 looks like a reasonable FP32 result for one multiplication of 0x3f99 * 0x3f99.
However, out_d becomes 0xe200, which looks like the low 16 bits of the FP32 result:
FP32 mac_raw = 0x3fb6e200
high 16 bits = 0x3fb6
low 16 bits = 0xe200
observed out_d = 0xe200
If this were a correct FP32-to-BF16 conversion, I would expect something close to 0x3fb6, not 0xe200.
After that, the value written back into the PE internal accumulator register appears to become:
c2=0x0000e200
mac_in_c=0x0000e200
This looks like the truncated 16-bit value 0xe200 being zero-extended or otherwise placed back into the FP32 accumulator register. As a result, the OS accumulation chain seems to be corrupted.
Later, I also see values like:
c2=0x00010000
out_c=0x0001
and the Mesh response eventually contains:
resp_data=0x0001000100010001
So it looks like the final output is already wrong before being written back to the accumulator.
Suspected cause
In the original PE.scala, the MAC unit is instantiated as:
val mac_unit = Module(new MacUnit(inputType, weightType,
if (df == Dataflow.WS) outputType else accType, outputType))
In OS mode, this seems to mean:
in_a = inputType // BF16
in_b = weightType // BF16
in_c = accType // FP32 accumulator
out_d = outputType // BF16
So the MAC raw result is computed with an FP32 accumulator input, but the MacUnit output port is still outputType, which is BF16 in this configuration.
My concern is that in OS mode, the FP32 MAC result may be connected to a 16-bit BF16 out_d, causing a bit-level truncation before being written back into c1 / c2. This would explain why the raw MAC result looks reasonable, but the accumulated result becomes corrupted.
Question
Is this behavior expected for the BF16 mixed-precision OS dataflow?
Should the OS MAC feedback path keep the MAC result as accType and only convert to outputType at the final out_c output stage?
In other words, should the OS path be closer to:
BF16 A * BF16 B + FP32 C -> FP32 result -> write back to c1/c2
and only later:
FP32 c1/c2 -> BF16 out_c
rather than routing the FP32 MAC result through a BF16 out_d before writing it back to c1/c2?
Any guidance would be very helpful. I can provide the small test case and the debug log if needed.
Thx!
Hi Gemmini team,
I am debugging a possible issue in the BF16 mixed-precision PE path, specifically in output-stationary (OS) dataflow. I would like to ask whether the behavior below is expected, or whether it may indicate a type-width issue in
PE.scala.Setup
I am using the official BF16 high-performance / mixed-precision Gemmini configuration in my local Chipyard/Gemmini setup.
The test uses the official high-level matmul API: