FINN: A Framework for Fast, Scalable Binarized Neural Network Inference_2016_CSCV

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference_2016_CSCV
Abstract

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values.
冗余足够，即使用二进制也能得到很高的分类精度。

In this paper,we present Finn, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture.
提出了快速、灵活的FPGA加速器，使用一种灵活的异构流结构。

By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements.
通过使用一组灵活的优化集，可以高效的将BNN映射到硬件上，实现了FC、卷积和池化，并将每层计算资源都裁剪到用户提供的通量需求。

On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 µs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 µs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy.

1. Introduction

Modern CNNs may contain millions of floating-point parameters and require billions of floating-point operations to recognize a single image. Furthermore, these requirements tend to increase as researchers explore deeper networks. For instance, AlexNet [14] (the winning entry for ImageNet Large Scale Visual Recognition Competition (ILSVRC) [22] in 2012) required 244 MB of parameters and 1.4 billon floating point operations (GFLOP) per image, while VGG-16 [24] from ILSVRC 2014 required 552MB of parameters and 30.8 GFLOP per image.

a growing body of research demonstrates this approach incorporates significant redundancy.
Recently, it has been shown [5, 26, 21, 12, 31] that neural networks can classify accurately using one- or two-bit quantization for weights and activations.
越来越多研究表明，NN使用1bit或者2bit量化的权重和激活就能实现精确分类。

Such a combination of low-precision arithmetic and small memory footprint presents a unique opportunity for fast and energy-efficient image classification using Field Programmable Grid Arrays (FPGAs).

Binarized Neural Networks (BNNs), proposed by Courbariaux et al. [5], are particularly appealing since they can be implemented almost entirely with binary operations, with the potential to attain performance in the teraoperations per second (TOPS) range on FPGAs.

In this work, we propose Finn, a framework for building scalable and fast BNN inference accelerators on FPGAs.
Finn-generated accelerators can perform millions of classifications per second with sub-microsecond latency, thereby
making them ideal for supporting real-time embedded applications such as augmented reality, autonomous driving and
robotics.

The novel contributions are:
- Quantification of peak performance for BNNs on FPGAs using a roofline model. 量化性能
- A set of novel optimizations for mapping BNNs onto FPGA more efficiently. 优化的映射方法
- A BNN architecture and accelerator construction tool, permitting customization of throughput. BNN结构和加速创建工具
- A range of prototypes that demonstrate the potential of BNNs on an off-the-shelf FPGAs platform. 原型展示
文章组织方式：
- 第二部分：CNN和BNN的背景及硬件实现
- 第三部分：讨论BNN的精度和最高性能（基于FPGA）
- 第四部分：描述FINN的结构和优化方法
- 第五部分：实验评估
- 第六部分：结论
2. Background

2.2 Binary Neural Networks

Although floating point numbers are a natural choice for handling the small updates that occur during neural networktraining, the resulting parameters can contain a lot of redundant information [8]. One of several possible dimensions possessing redundancy is precision [26].

An extreme case are BNNs in which some or all the arithmetic involved in computing the outputs are constrained to single-bit values.
极端的例子是：BNNs结构设计的计算输出值全部被限制为单个bit。

We consider three aspects of binarization for neural network layers: binary input activations, binary synapse weights and
binary output activations. If all three components are binary, we refer to this as full binarization, and the cases with one or two components as partial binarization.

They report 98.7% accuracy with fully-connected networks on the MNIST dataset, and observe that only XNOR and bitcount operations are necessary for computing with such neural networks.

XNOR-Net by Rastegari et al. [21] applies convolutional BNNs on the ImageNet dataset with topologies inspired by AlexNet, ResNet and GoogLeNet, reporting top-1 accuracies of up to 51.2% for full binarization and 65.5% for partial binarization.

DoReFa-Net by Zhou et al. [31] explores reduced precision during the forward pass as well as the backward pass, and note that this opens interesting possibilities for training neural networks on FPGAs.

Finally, the work by Courbariaux et al. [5] describes how to train fully-connected and convolutional networks with full binarization and batch normalization layers, reporting competitive accuracy on the MNIST, SVHN and CIFAR-10 datasets.
Training for this work was performed using their open source implementation.

2.3 Neural Networks in Hardware

A great deal of prior work on mapping neural networks to hardware exist both for FPGAs and as ASICs. We refer the reader to the work by **Misra and Saha [16] **for a comprehensive survey.

We cover a recent and representative set of works here, roughly dividing them into four categories based on their basic architecture:
- 1. a single processing engine [19, 30, 4, 2], usually in the form of a systolic array, which processes each layer sequentially;
- 1. a streaming architecture [27, 1], consisting of one processing engine per network layer;
- 1. a vector processor [7] with instructions specific to accelerating the primitives operations of convolutions;
- 1. a neurosynaptic processor [6], which implements many digital neurons and their interconnecting weights.
按照基本结构分为四类：
- 1 单处理引擎。压缩数组，顺序处理各层
- 2 流结构。每层一个处理引擎
- 3 向量处理器。专用指令加速卷积操作原语
- 4 神经突触处理器。实现大量数字神经元及其交互权值
Systolic arrays: Zhang et al. [30] describes a single processing engine style architecture, using theoretical roofline models tool to design accelerators optimized for the execution of each layer. Ovtcharov et al. [19] implement a similar style architecture, but achieved a 3× speedup over Zhang et al. [30]. Eyeriss by Chen et al. [4] use 16-bit fixed point rather than floating point, and combine several different data reuse strategies.Each 2D convolution is mapped to 1D convolutions across multiple processing engines, allowing for completely regular access patterns for each processing element. The authors report that their data reuse provides 2.5× better energy efficiency over other methods.

Streaming architectures: Venieris and Bouganis [27] proposed a synchronous dataflow (SDF) model for mapping CNNs to FPGAs, which is a similar approach to ours. The main difference is that our design is optimized for BNNs while their design targets conventional CNNs.
Alemdar et al. [1] implement fully-connected ternary-weight neural networks with streaming and report up to 255K frames per second on the MNIST dataset, but concentrate on the training aspect for those networks.

Vector processors: Farabet et al. [7] describe a programmable ConvNet Processor (CNP), which is a RISC vector processor with specific macro-instructions for CNNs including 2D convolutions, 2D spatial pooling, dot product and an elementwise non-linear mapping function. The authors also created a tool to compile a high level network description into host code which is used to call the CNP

Neurosynaptic processors: TrueNorth [6] is a low power, parallel ASIC with 4096 neurosynaptic cores, each implementing 256 binary inputs, 256 neurons and a 256 × 256 array of synapses. An internal spiking router can connect any input on any core to any neuron on any core, allowing many network topologies to be implemented on fixed hardware.

The authors are not aware of any publication that investigates mapping of fully binarized neural networks onto FPGAs.
作者还没有发现任何研究是将完全的BNN映射到FPGAs上的。
In comparison to prior art, the binary network inference engine can significantly increase classification rates, while reducing power consumption and minimizing latency.
可以显著增加分类率，同时减少功率消耗和延迟。

This currently comes at the cost of a small drop in accuracy for larger networks, however we believe a) there are use cases that do not require the highest level of accuracy, or can be solved with smaller networks (such as classification of playing cards or handwritten digits [15]) and b) that the accuracy can be improved by increasing network sizes [26]. This last point is a ongoing topic in machine learning research.

3. Binary performance and accuracy

3.1 Estimate Performance Using Rooflines

To estimate and compare BNN performance with fixedpoint CNN, we use a roofline model [29] which considers memory bandwidth, peak computational performance and arithmetic intensity (the number of mathematical operations performed for each byte of off-chip memory read or written)

The intersection of the roofline curve with a vertical line for a particular arithmetic intensity, gives the theoretical peak performance point, which is either compute-bound or memory-bound.
roofline曲线与代表特定计算强度的垂直线的交叉点，代表了不受计算限制和内存限制的理论上的峰值性能点。
具体看法文献讲的也不特别清楚，曲线是根据FPGA型号用一种方法生成的，该方法来自于别人的文章，本文未讲。

3.2 Accuracy-Computation Tradeoffs

we conducted a set of experiments on the MNIST dataset that compare accuracy of floating point and binary precision for the same topology.
The binary networks are obtained via replacing regular layers by their binary equivalents, as described by Courbariaux et al. [5].
We also binarize the input images for the BNN as our experiments show that input binarization works well for MNIST.

Since the space of possible network topologies that can be trained is infinite, we adopted the approach in [26] to simplify the problem. We fix the network topology to a 3 hidden layer, fully connected network while scaling the number of neurons in each layer, and plot the resulting accuracy in Table 1 along with the number of parameters and operations per frame.
固定使用一个模型来比较。

A few trends are apparent for this problem and network configuration space: 1) similar to what was found in by Sung et al. [26], as the network size increases, the difference in accuracy between low precision networks and floating point networks decreases; and 2) in order to achieve the same level of accuracy as floating point networks, BNNs require 2-11× more parameters and operations.
随着网络模型的增大，低精度网络和浮点网络的准确率差异反而会变小；为了达到与浮点网络相同的准确度，BNN需要再多2-11x的参数和操作。

Our BNN performance estimates from Section 3.1 suggest a 16× speedup for BNN over 8-bit fixed point, which is greater than the 2-11× increase in parameter and operation size. Thus, we expect that BNNs with comparable accuracy will be faster than fixed-point networks, even though they may require more parameters and operations.
BNN性能按照3.1的方法计算，要比8bit定点快16倍，这比参数和操作数增加的2-11倍要大。因此认为，在相同高精度下，BNN将会比定点网络快，即使需要的参数和操作数会更多一些。

4. BNNs on Reconfigurable Logic

4.1 Architecture

We build a custom architecture for a given topology rather than scheduling a operations on top of a fixed architecture. Separate compute engines are dedicated to each layer, which communicate via on-chip data streams. Each engine starts to compute as soon as the previous engine starts to produce output. Additionally, owing to the compact model size of BNNs, all neural network parameters are kept in on-chip memory. This avoids most accesses to off-chip memory, minimizes the latency (the time to finish classifying one image) by overlapping computation and communication, and minimizes the initiation interval: a new image can enter the accelerator as soon as the first compute array is finished with the previous image.
使用定制结构而不是固定结构下的操作调度。每层都有独立的计算引擎，引擎之间通过片上数据流通讯。当前级引擎开始产生数据的时候，当前引擎就开始计算。此外，由于BNNs精简的尺寸，所有的网络参数都被保存在片内存储器上。这避免了大量的片外内存访问，最小化了延迟，最小化了启动间隔时间。

The separate mapping of layers to compute arrays also enables heterogeneity. By tailoring compute arrays separately for each layer’s requirements, we can avoid the one-size-fits-all" inefficiencies and reap more of the benefits of reconfigurable computing.
由层到计算数组的独自映射也可以实现异构。通过独立地裁剪计算数组以满足每层所需，可以避免一刀切的低效，并获得更多可配置计算的好处。
This requires a different bitfile when the neural network topology is changed but we consider this an acceptable cost for the performance gains obtained.
不同的网络拓扑结构需要不同的bitfile，但是值得。

A BNN accelerator may have various constraints imposed upon it depending on the use case. User-imposed constraints include the choice of FPGA and platform, desired classification throughput in frames per second (FPS) and clock frequence.
Simultaneously, the BNN topology constrains how the compute resources must be allocated to obtain an efficient heterogeneous streaming architecture.

Finn offers parameterizable building blocks and a way of controlling the classification throughput, as described in Sections 4.3 and 4.4.

To achieve portability, we chose a commercial high level synthesis tool, Vivado High-Level Synthesis (HLS), for the implementation. The tool enables faster development cycles via high-level abstractions, and provides automated pipelining to meet the clock frequency target.

4.2 BNN-specific Operator Optomizations

BNNs have several properties that enable a more efficient mapping to FPGAs without affecting the network accuracy, which we describe in the following subsections.

4.2.1 popcount for accumulation

The regular and value-constrained nature of BNN computations enable computing binary dot products with fewer hardware resources.

The practical consequence for hardware is that the summation of a binary dot product can be implemented by a popcount operation that counts the number of set bits instead of accumulation with signed arithmetic.
使用popcount方法来代替加法计算，大大减少资源消耗。

For instance, with a target Fclk = 200 MHz, a 128-bit popcountaccumulate requires 376 LUTs and 29 FFs, while a 128-bit bipolar-accumulate requires 759 LUTs and 84 FFs

4.2.2 batchnorm-activation as threshold

All BNN layers use batch normalization [11] on convolutional or fully connected layer outputs, then apply the sign function to determine the output activation.We show how the same output can be computed via thresholding.
全BNN层使用批归一化来输出卷积或者FC结果，然后使用符号函数输出激活。我们展示：怎么使用阈值计算来实现相同的输出。

Using these techniques, we can compute the output activation using an unsigned comparison and avoid computing the batch normalized value altogether during inference.
通过比较来输出激活，从而避免计算batch归一化值。

Synthesis reports from Vivado HLS for 16-bit dot product output values indicate that regular batchnorm-and-sign activation requires 2 DSPs, 55 FFs and 40 LUTs, whereas the threshold activation we describe here only requires 6 LUTs.

4.2.3 Boolean OR for Max-pooling

The networks described in [5] perform pooling prior to activations, i.e. pooling is performed on non-binarized numbers, which are then batch normalized and fed into the activation function. We show that the same layer outputs can be derived by max pooling after the activations without having to re-train the network.
[5]中是在激活前池化，本文提出可以激活后再池化。
As the threshold comparisons are already computed for the activations, max-pooling can be effectively implemented with the Boolean OR-operator. We note that similar principles apply for min-pooling (as Boolean AND) and average-pooling (as Boolean majority function) as well.

4.3 FINN Design Flow and Hardware Library

Figure 4 illustrates the design flow for converting a trained BNN into an FPGA accelerator.

The user supplies a FPS target alongside a Theano-trained BNN to the Finn synthesizer. The synthesizer first determines the folding parameters (Section 4.4) to meet the FPS target and applies the optimizations from Section 4.2, then produces a synthesizable C++ description of a heterogeneous streaming architecture. The architecture is composed of building blocks from the Finn hardware library described in the following subsections.

4.3.1 Matrix-Vector-Threshold Unit (MVTU)

The Matrix{Vector{Threshold Unit (MVTU) forms the computational core for our accelerator designs. The vast majority of compute operations in a BNN can be expressed as matrix{vector operations followed by thresholding. For instance...
Convolutions can also be implemented as matrix{vector products, as will be described in Section 4.3.2. As such, the MVTU implements fully-connected layers as a standalone component, and is also used as part of the convolutional layers.
MVTU可以使FC作为一个独立组件，也可以作为卷积层的一部分。

Internally, the MVTU consists of an input and output buffer, and an array of Processing Elements (PEs) each with a number of SIMD lanes. The number of PEs (P) and SIMD lanes (S) are configurable to control the throughput as discussed in Section 4.4.1.
可配置的MVTU的组成。

The synapse weight matrix to be used is kept in On-Chip Memory (OCM) distributed between PEs, and the input images stream through the MVTU as each one is multiplied with the matrix.
权重保存在PEs的片内寄存器上

In terms of the taxonomy described in [4], this architecture is both weight stationary (since each weight remains local to the PE) and output stationary (since each popcount computation remains local to the PE).

Figure 6 shows the datapath of an MVTU PE.

Following this, the number of set bits in the result is counted (see Section 4.2.1) and added to the accumulator register. Once the entire dot product is accumulated, it is thresholded. The accumulator, adder and threshold memory are T -bits wide, which can be scaled down to T = 1+log2(Y) for additional resource savings.

4.3.2 Convolution: The Sliding Window Unit

The convolutional layer consists of a Sliding Window Unit (SWU), which generates the image matrix from incoming feature maps, and a MVTU that actually computes the matrix{matrix product using a different column vector from the image matrix each time.

In order to better cater for the SIMD parallelism of the MVTU and minimize buffering requirements, we interleave the feature maps such that each pixel contains all the Input Feature Map (IFM) channel data for that position ...
...
Note that interleaving the filter matrix has no additional cost since it is done offline, and interleaving the input image can be done on-the-fly in the FPGA.
...
Storing the pixels in this fashion allows us to implement the SWU with a single wide OCM instead of multiple narrow OCMs, and also enables the output of the MVTU to be directly fed to the next layer without any transposition.
该部分讲卷积的实现方法：以特定方式交叉权重矩阵和图像矩阵。达到的目的是达到更高的并行，同时减少计算维度，同时能与MVTU格式匹配。
具体原理没完全明白，讲的不够清楚。

4.3.3 The Pooling Unit

4.4 Folding

In terms of the MVTU description given in Section 4.3.1, each PE corresponds to a hardware neuron, while each SIMD lane acts as a hardware synapse.

the amount of hardware resources on an FPGA is limited, and it is necessary to time-multiplex (or fold) the BNN onto fewer hardware synapses and neurons.
复用

We now describe how the folding is performed subject to user constraints.

The work by Venieris et al. [27] describes a method for folding neural networks expressed as streaming dataflow graphs, with focus on formalizing the folding and design space exploration.

In this work, we consider a simpler variant that only controls the folding of matrix-vector products to achieve a given FPS requirement set by the user, and focus on how the folding is implemented in terms of the workload mapping.
提出只控制矩阵向量的折叠方法，关注于负载映射下的折叠实现。

Folding directly affects the resource and power consumption of the final system as well.

4.4.1 Folding Matrix-Vector Products

Folding matrix{vector products is achieved by controlling two parameters of the MVTU: P the number of PEs, and S the number of SIMD lanes per PE.

A P-high, S-wide tile of the matrix is processed at a time, with each row in the tile mapped to a different PE, and each column to a different SIMD lane. For a X × Y matrix, we refer to (F^n = X/P) as the neuron fold and (F^s = Y/S) as the synapse fold. The total fold F is then obtained as F = (F^n · F^s)
MVTU马赛克高为P，宽为S。对于XxY的矩阵，定义神经元折叠数和突触折叠数。

The same principle applies for convolutional layers, but these always have an inherent amount of folding due to our current matrix{matrix product as multiple matrix-vector products implementation.

4.4.2 Determing (F^n) and (F^s)

Avoiding the "one-size-fits-all" inefficiencies requires tailoring each MVTU’s compute resources to layer requirements.
The guiding principle here is rate-balancing the heterogeneous streaming architecture:the slowest layer (with IImax: initiation interval) will determine the overall throughput, so each layer should use a roughly equal number of cycles to process one image.

As this is a streaming system, the classification throughput FPS will be approximately Fclk/IImax , where Fclk is the clock frequency.

Therefore, balancing a fully-connected BNN can be achieved by using (F^n) and (F^s) such that (F^n · F^s = Fclk / FPS) for each layer.
我的理解是：折叠数近似等于时钟周期时间与单个时钟能处理的图像数的比值，大概是S/(S/images)。<比如，一个时钟周期要处理4张图，那就要折叠数为4>

Depending on the BNN and the FPS requirements, the number of memory channels or sliding window generation may constitute bottlenecks. For such cases, we match the throughput of all other layers to the bottleneck in order not to waste resources.

5. Evaluation

5.1 Experimental Setup

To evaluate Finn, we created a number of prototypes that accelerate BNNs inference on the MNIST [15] (28 × 28 handwritten digits), CIFAR-10 [13] (32 × 32 color images in 10 categories) and cropped SVHN [18] (32 × 32 images of Street View House Numbers) datasets.
We consider three different BNN topologies for classifying the datasets as follows:
- SFC and LFC are three-layer fully connected network topologies for classifying the MNIST dataset, with different numbers of neurons to demonstrate accuracycomputation tradeoffs (Section 3.2).
- CNV is a convolutional network topology inspired by BinaryNet [5] and VGG-16 [24]. It contains a succession of (3x3 convolution, 3x3 convolution, 2x2 maxpool) layers repeated three times with 64-128-256 channels, followed by two fully connected layers of 512 neurons each.
To further demonstrate the flexibility of the framework, we consider two usage scenarios for each BNN topology to guide the choice of parametrization:
- max is the maximum performance scenario where it is desirable to reach the peak FPS permitted by the platform, topology and Finn’s architecture.
- fix represents a scenario with a fixed FPS requirement, which is often determined by an I/O device for real life applications. For instance, consider a 640 × 480 video stream at 30 FPS, which is to be chopped up into 32 × 32 tiles for neural network inference. Handling this task with real-time performance would require a BNN inference rate of 9000 FPS, which we set as the requirement for this usage scenario
  疑问：图像是分块然后按批次处理的吗？那这样的FPS才是本文一般所指的FPS？
For each prototype, the folding factors (Section 4.4) were determined to meet the requirements of its usage scenario, and the Finn design flow (Section 4.3) was followed to generate the hardware accelerator. Vivado HLS and Vivado version 2016.3 were used for the bitfile synthesis. A target clock frequency of 200 MHz was used for both Vivado HLS and Vivado, and to run the resulting accelerator unless otherwise stated.
流程

The host code runs on the CortexA9 cores of the Zynq. It initializes 10000 images with test data in the Zynq’s shared DRAM, launches and times the accelerator execution to measure classification throughput, then measures accuracy by comparing against the correct classifications.

Two power measurements P_chip and P_wall are provided for each experiment; P_chip using the PMBus interface to monitor the FPGA power supply rails, and P_wall using a wall power meter for the total board power consumption.
两种功率测定方式:通过端口监视的FPGA芯片功率、功率计测得的总功率

5.2 Result

We focus on particular aspects of the results in the following subsections

5.2.1 Maximum Throughput and Bottlenecks

To assess the quality of results for the max scenarios, we compare the achieved performance (XNOR{popcount operations per second) with the peak throughput in TOPS indicated by the roofline model.
图9没完全看明白，主要是横轴没看懂
CNV-max achieves 2.5 TOPS and is architecture-bound. The current SWU design does not scale as well as the MVTU and constitutes a bottleneck... Despite its higher complexity, observe that CNV-max actually requires ∼2× fewer LUTs than SFC-max since the folding parameters for CNV-max are chosen in accordance with the maximum performance dictated by the bottleneck.
SFC-max achieves 8.2 TOPS and is memory-bound. Observe that the SFC arithmetic intensity line intersects the memory-bound (sloped) part of the roofline, thus the performance cannot be scaled up without adding more DRAM memory bandwidth.
LFC-max achieves 9.1 TOPS, which is 46% of the roofline, and is resource-bound. As folding factors are integers, the smallest increment is 2× which roughly doubles the resource cost.

5.2.2 Energy Efficiency

It is desirable to minimize the energy spent per image classification, which corresponds to maximizing FPS per Watt when many images are to be classified.

In general, we see that the higher FPS prototypes have better energy efficiency.

It is also worth noting that the board’s idle power consumption is about 7 W, which forms a lower bound on all wall power measurements, and could be improved by e.g. using LPDDR memory.

To maximize energy efficiency with a fixed target FPS, is it better to use a highly parallel design at low clock frequency, or a less parallel design at high clock frequency? We ran an additional experiment to investigate this question by slowing down the SFC-max prototype to meet the fix FPS requirement of 9000 FPS....
This suggests that a high degree of parallelism benefits energy efficiency as long as the FPGA resources are available.

5.2.3 Resource Efficiency

We consider two aspects of resource efficiency for Finn:
how efficiently the compute units are used during runtime (runtime efficiency), and how efficiently FPGA resources are turned into compute units (mapping efficiency).
计算单元的运行效率；FPGA资源的映射使用效率

To assess runtime efficiency, we divide the FPS-based (actual) operations per cycle (frac{FPS cdot Ops}{F_{clk}}) by the (peak) number of synaptic operations per cycle from the design ((sum 2 · P · S)).
运行效率：每个时钟的总操作数/每个时钟运行的总指令数

The prototypes exhibit good runtime efficiency, with ∼70% for CNV, ∼80% for SFC and ∼90% for LFC. The efficiency can be increased further by fine-tuning the folding factors between different layers.

Evaluating the mapping efficiency directly on the prototypes loses some insight, since CNV uses LUTs on SWU and PU, while fully-connected topologies do not.
直接对原型评估映射效率不够好，会忽略一些东西。

Instead, for a single 256 × 256 fully-connected layer, we fix S = 64 and vary P , and plot the LUTs per synaptic operation in Figure 11, which should be minimized to maximize efficiency.
作图来描述当PEs增加的时候，每个操作/周期下LUT的使用，同时也通过BRAM的使用来说明资源使用情况，未细看。

Thus, the depth and number of BRAMs, and the LUT-to-BRAM ratio of the FPGA plays a key role in determining how well the resources will be utilized by a BNN. For instance, on another FPGA with the same amount of LUTs but twice the number of half-depth BRAMs, LFC-max could achieve 2× throughput.

6. Conclusion

They (BNN) are particularly well-suited for FPGA implementations as parameters can be fit entirely in OCM and arithmetic is simplified, enabling high computational performance. The novel parameterizable dataflow architecture and optimizations presented enable unprecedented classification rates, minimal power consumption and latency, while offering the flexibility and scalability required for accelerating larger and more complex networks. We hence believe that this technology is eminently suitable for embedded applications requiring real-time response, including surveillance, robotics and augmented reality.

Future work will focus on providing support for non-binary low precision, implementing larger networks like AlexNet, higher performance convolutions, and a more thorough design space exploration.
非二进制的低精度，大型网络实现，高性能卷积，设计空间探索

读后总结

针对小规模的网络提出的架构，网络结构大约三层卷积池化层。
主要的特点是提出了：
1.理论上的计算上限的评估方法
2.基于层的定制异构流结构
3.二进制的乘，加，激活，池化计算方法
4.基于矩阵向量阈值单元在FPGA上实现的卷积，激活，池化
5.基于客户需求和资源限制的折叠方法
6.对结果的瓶颈分析，能效分析，资源使用效率分析
相关阅读:
[Linear Algebra] Matrix-Matrix Multiplication
[Linear Algebra] Matrix Vector Multiplication
[Linear Algebra] Matrices and Vectors
[XState] Invoke with callback
[PostgresSQL] Install and start the service
html+JavaScript超大视频上传解决方案
 html+js超大视频上传解决方案
 SiteFactory如何能实现直接粘贴把图片上传到服务器中
 动易CMS如何能实现直接粘贴把图片上传到服务器中
 织梦CMS如何能实现直接粘贴把图片上传到服务器中
原文地址：https://www.cnblogs.com/Osler/p/8427915.html

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference_2016_CSCV

Abstract

1. Introduction

2. Background

2.2 Binary Neural Networks

2.3 Neural Networks in Hardware

3. Binary performance and accuracy

3.1 Estimate Performance Using Rooflines

3.2 Accuracy-Computation Tradeoffs

4. BNNs on Reconfigurable Logic

4.1 Architecture

4.2 BNN-specific Operator Optomizations

4.2.1 popcount for accumulation

4.2.2 batchnorm-activation as threshold

4.2.3 Boolean OR for Max-pooling

4.3 FINN Design Flow and Hardware Library

4.3.1 Matrix-Vector-Threshold Unit (MVTU)

4.3.2 Convolution: The Sliding Window Unit

4.3.3 The Pooling Unit

4.4 Folding

4.4.1 Folding Matrix-Vector Products

4.4.2 Determing (F^n) and (F^s)

5. Evaluation

5.1 Experimental Setup

5.2 Result

5.2.1 Maximum Throughput and Bottlenecks

5.2.2 Energy Efficiency

5.2.3 Resource Efficiency

6. Conclusion

读后总结