|
| 1 | +Why Even Bother With FPGAs? |
| 2 | +########################### |
| 3 | + |
| 4 | +.. TODO |
| 5 | + - why write this post (fpgas have skeptics) |
| 6 | + - central theme (nano second inference) |
| 7 | + - conclusion |
| 8 | +
|
| 9 | +FPGAs being alternative processors enjoy a fair bit of skepticism, especially |
| 10 | +from people higher up in the pyramid of computer abstractions (Software |
| 11 | +Engineers and the like). This post is my attempt at trying to persuade the |
| 12 | +skeptics by way of an instance where FPGAs blow every other kind of processor |
| 13 | +out of the water. |
| 14 | + |
| 15 | +**TLDR**; FPGAs can allow full DNN inference at nanosecond latency only limited |
| 16 | +by the time it takes for electrons to move across a circuit. In comparison, |
| 17 | +CPU/GPUs may only be able to run a couple instructions in nanosecond timeframe, |
| 18 | +entire inference will require many million/billion of these instructions. |
| 19 | + |
| 20 | +FPGAs for the Unenlightened |
| 21 | +--------------------------- |
| 22 | + |
| 23 | +FPGAs are circuit emulators. Digital Circuits consists of logic gates and |
| 24 | +connections between them, FPGAs emulate logic gates and their connections. |
| 25 | + |
| 26 | +Logic gates can be represented by their `Truth Table |
| 27 | +<https://en.wikipedia.org/wiki/Truth_table>`_. Truth tables are a form of hash |
| 28 | +table where the key is a tuple of binary values corresponding to each input and |
| 29 | +output is a single bit representing the output of the gate. One kind of FPGA |
| 30 | +(SRAM-based), emulate logic gates by storing truth tables in memory. |
| 31 | + |
| 32 | +Connections are emulated via Programmable Interconnects. Think of a network |
| 33 | +switch, programmable interconnects are pretty much like the same except on a |
| 34 | +very low-level. `This document |
| 35 | +<https://cse.usf.edu/~haozheng/teach/cda4253/doc/fpga-arch-overview.pdf>`_ |
| 36 | +explains in detail the different VLSI architectures present in modern FPGAs. |
| 37 | + |
| 38 | +A programmer usually does not describe circuits in the form of logic gates, they |
| 39 | +use abstractions in the form of HDLs to behaviorally describe operations that a |
| 40 | +circuit must perform. A compiler converts/maps HDL programs to FPGA primitives. |
| 41 | + |
| 42 | +As it should be obvious by now, FPGAs are unlike processors. They do not have |
| 43 | +any "Instruction Set Architecture". If there is a need, the programmer must |
| 44 | +design and implement an ISA [#fpga_arch]_. FPGAs require thinking of problems as |
| 45 | +circuits with inputs and outputs. |
| 46 | + |
| 47 | +The Central Argument for FPGAs |
| 48 | +------------------------------ |
| 49 | + |
| 50 | +Now, let's build the argument. |
| 51 | + |
| 52 | +Deep Neural Networks (DNN) inference on demands a lot of compute and is a pretty |
| 53 | +challenging problem. Solutions to this problem manifests in the form of ASIC |
| 54 | +accelerators and GPUs. More performance can always be brought by scaling said |
| 55 | +processors but of-course there is a limit to how far one can scale. For example, |
| 56 | +on the `NVIDIA Jetson Nano <https://developer.nvidia.com/embedded/jetson-nano>`_ |
| 57 | +the time taken to infer a single image for the CNN model ResNet50 is ~72ms. What |
| 58 | +if we needed something much faster, say the same inference in integral |
| 59 | +nanoseconds? GPUs/ASICs would only be able to execute a couple instructions in |
| 60 | +that timeframe let alone complete the inference. Certainly they won't suffice. |
| 61 | + |
| 62 | +This requirement is not made up. Nanosecond DNN inference is a real problem |
| 63 | +faced by a team at CERN working on the Large Hadron Collider. |
| 64 | + |
| 65 | +Here's a little description of the problem from their `paper |
| 66 | +<https://arxiv.org/pdf/2006.10159>`_: |
| 67 | + |
| 68 | + *The hardware triggering system in a particle detector at the CERN LHC is one |
| 69 | + of the most extreme environments one can imagine deploying DNNs. Latency is |
| 70 | + restricted to O(1)µs, governed by the frequency of particle collisions and |
| 71 | + the amount of on-detector buffers. The system consists of a limited amount of |
| 72 | + FPGA resources, all of which are located in underground caverns 50-100 meters |
| 73 | + below the ground surface, working on thousands of different tasks in |
| 74 | + parallel. Due to the high number of tasks being performed, limited cooling |
| 75 | + capabilities, limited space in the cavern, and the limited number of |
| 76 | + processors, algorithms must be kept as resource-economic as possible. In |
| 77 | + order to minimize the latency and maximize the precision of tasks that can be |
| 78 | + performed in the hardware trigger, ML solutions are being explored as fast |
| 79 | + approximations of the algorithms currently in use.* |
| 80 | + |
| 81 | +Solutions |
| 82 | +--------- |
| 83 | + |
| 84 | +There are, broadly speaking, two ways of solving this problem: |
| 85 | + |
| 86 | +1. The ASIC Way |
| 87 | +=============== |
| 88 | + |
| 89 | +This includes CPUs/GPUs/TPUs or any other ASIC. The idea would be to to have a |
| 90 | +large grid of multipliers and adders to carry out as many multiply-accumulate |
| 91 | +operations in parallel. To achieve more performance, research would be put to |
| 92 | +increase the frequency of the chip (Moore's law). Compilers and specialized |
| 93 | +frameworks help abstract computation. And if, we need more performance, |
| 94 | +specialized engineers (who have mastered assembly language) are called upon to |
| 95 | +write performant kernels, making use of clever tricks to have the fastest |
| 96 | +possible dot product. |
| 97 | + |
| 98 | +2. The FPGA way |
| 99 | +=============== |
| 100 | + |
| 101 | +Through this way, the idea is to exploit FPGA's programming model. Instead of |
| 102 | +writing a program for our problem, we design a circuit for it. Each layer of |
| 103 | +a neural network would be represented by a circuit. Inside the layer, all |
| 104 | +dot-products themselves are represented by a circuit. If the neural network is |
| 105 | +not prohibitively large, we can even fit the entire NN as a combinational |
| 106 | +circuit. |
| 107 | + |
| 108 | +As you might have learnt in your digital circuits course, combinational circuits |
| 109 | +do not contain any clocks i.e. there's no notion of frequency — inputs come in, |
| 110 | +outputs go out. The speed of computation is only bottleneck'ed by the time it |
| 111 | +takes electrons to pass in that chip. How cool is that?! |
| 112 | + |
| 113 | +Flaws with the FPGA way |
| 114 | +----------------------- |
| 115 | + |
| 116 | +One of the biggest flaw with fitting entire problems on the FPGA is that of |
| 117 | +`combinatorial |
| 118 | +explosion <https://en.wikipedia.org/wiki/Combinatorial_explosion>`_ in |
| 119 | +complexity. For example, in order to design a circuit for a multiplier, there |
| 120 | +are `well known algorithms |
| 121 | +<https://en.wikipedia.org/wiki/Booth's_multiplication_algorithm>`_ that result |
| 122 | +in very efficient multiplier. One can avoid going this route by directly |
| 123 | +encoding the multipliers into truth-tables. Instead of calculating the outputs |
| 124 | +of a multiplication, we remember and look-it-up. Here's verilog for a 2-bit |
| 125 | +multiplication: |
| 126 | + |
| 127 | +.. code:: |
| 128 | +
|
| 129 | + module mul ( input signed [1:0] a, input signed [1:0] b, output signed [3:0] out); |
| 130 | + assign out[3] = (a[0] & a[1] & b[0] & b[1]); |
| 131 | + assign out[2] = (~a[0] & a[1] & b[1]) | (a[1] & ~b[0] & b[1]); |
| 132 | + assign out[1] = (~a[0] & a[1] & b[0]) | (a[0] & ~b[0] & b[1]) | (a[0] & ~a[1] & b[1]); |
| 133 | + assign out[0] = (a[0] & b[0]); |
| 134 | + endmodule |
| 135 | +
|
| 136 | +Each output is just a combination of its inputs. |
| 137 | + |
| 138 | +Here's the problem: this method of designing multipliers does not scale! The |
| 139 | +2bit multiplier takes 4 LUTs (pretty reasonable). But the same for an 8bit |
| 140 | +multiplier takes ~18,000 LUTs and 3+ hrs to synthesize (awful). The increase is |
| 141 | +at the rate of 2^n. Many large neural networks will have a hard time to fit on |
| 142 | +the FPGA in this way. |
| 143 | + |
| 144 | +This doesn’t signal the end for FPGAs, however. There’s still a strong case to be |
| 145 | +made for their use—just as the team at CERN has demonstrated. In fact, they are |
| 146 | +actively leveraging this potential. They discovered that neural network layers |
| 147 | +can be *heterogeneously quantized* — meaning each layer can have a different |
| 148 | +precision level depending on its significance in the computation pipeline, as |
| 149 | +outlined in their work `here <https://fastmachinelearning.org/hls4ml/>`_ |
| 150 | + |
| 151 | +If an entire network cannot fit on an FPGA, fast reconfiguration can provide a |
| 152 | +solution. This involves configuring the hardware for one layer, processing its |
| 153 | +outputs, then reconfiguring the hardware for the next layer, and so on. The |
| 154 | +approach can be further refined to enable reconfiguration at a per-channel |
| 155 | +level, allowing smaller FPGAs with limited resources to participate. A |
| 156 | +'compiler' would orchestrate the computation offline, determining the sequence |
| 157 | +and timing of reconfigurations before the actual computation begins. |
| 158 | + |
| 159 | +Recent interest in hyper-quantization i.e. `1bit |
| 160 | +<https://github.com/kyegomez/BitNet>`_, 2bit, 3bit ... networks is a |
| 161 | +big win for the FPGA way. The lower the resolution, the more efficient and |
| 162 | +practical the solution becomes, making FPGAs a great fit for this approach. |
| 163 | + |
| 164 | +Conclusion |
| 165 | +---------- |
| 166 | + |
| 167 | +With the FPGA way, many problems spanning different domains can be solved in |
| 168 | +interesting and (sometimes) superior ways. At my workplace, we've started |
| 169 | +research in the FPGA way, trying to bring it out of the depths of complexities |
| 170 | +and solve practical problems. |
| 171 | + |
| 172 | +The intention of this post is not to compare ASICs and FPGAs (comparisons are |
| 173 | +futile), but to highlight how FPGAs ought to be seen and used. In the following |
| 174 | +few months, i'll write more on this research as I uncover it myself. I'll leave |
| 175 | +you with some links advocating for the FPGA way [#fpga_way]_ |
| 176 | + |
| 177 | +- `Learning and Memorization - Satrajit Chatterjee |
| 178 | + <https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18a.pdf>`_ |
| 179 | +- `LUTnet <https://arxiv.org/abs/1904.00938>`_ |
| 180 | +- `George Constantinides and his team |
| 181 | + <https://scholar.google.com/citations?user=NTn1NJAAAAAJ&hl=en>`_ |
| 182 | +- `hls4ml team <https://fastmachinelearning.org/hls4ml/>`_ |
| 183 | + |
| 184 | +.. rubric:: Footnotes |
| 185 | + |
| 186 | +.. [#fpga_arch] The term "architecture" is a bit overloaded. The first |
| 187 | + meaning is of the VLSI sense i.e. how LUTs and interconnect are organized to |
| 188 | + make the FPGA. Another usage is for describing what all higher level |
| 189 | + components are being designed **on top** of the FPGA. Think matmul engines, |
| 190 | + caches etc. "Architecture" has meaning on different levels of circuit design. |
| 191 | +
|
| 192 | +.. [#fpga_way] The is a term i've coined myself. I've not seen anyone else use |
| 193 | + it in their works. |
0 commit comments