Please refer to the discussion at #20 (comment).
The main issue is that it uses two builder.get() to read each spec $j$'s evaluations at layer $i$. It takes about 15 cycles to do that. If we change the memory layout of spec_evals: Vec<Vec<Vec<E>>> to be indexed by layer $i$ first and then indexed by spec $j$, then we can share the 1st memory loading among specs.