Feature Request: 16 bit FFT and Matmul

There is currently an 8 bit matrix multiplication, and a 32 bit FFT. However, our data fits neatly into 16 bit, but not 8 bit. So at the moment, we convert to/from 32 bit for the FFT, and implement our own 16 bit matrix multiplication.

Native 16 bit implementations could potentially be faster and consume less memory. Any speedup for matrix multiplications in particular would be very welcome.