13.06.2022
Lester Kalms: Methods and Algorithms for Efficient Programming of FPGA-based Heterogeneous Systems for Object Detection (Verteidigung der Promotion)
28.06.2022, 14:00 Uhr
Einladung zur Verteidigung der Promotion von Herrn Lester Kalms
Thema: Methods and Algorithms for Efficient Programming of FPGA-based Heterogeneous Systems for Object Detection
Betreuer: Prof. Dr.-Ing. Diana Göhringer
Zweitgutachter: Prof. Dr. Marco D. Santambrogio
Abstract: Nowadays, there is a high demand for computer vision applications in numerous application areas, such as autonomous driving or unmanned aerial vehicles. However, the application areas and scenarios are becoming increasingly complex, and their data requirements are growing. To meet these requirements, it needs increasingly powerful computing systems. FPGA-based heterogeneous systems offer an excellent solution in terms of energy efficiency, flexibility, and performance, especially in the field of computer vision. Due to complex applications and the use of FPGAs in combination with other architectures, efficient programming is becoming increasingly difficult. Thus, developers need a comprehensive framework with efficient automation, good usability, reasonable abstraction, and seamless integration of tools. It should provide an easy entry point, and reduce the effort to learn new concepts, programming languages and tools. Additionally, it needs optimized libraries for the user to focus on developing applications without getting involved with the underlying details. These should be well integrated, easy to use, and cover a wide range of possible use cases. The framework needs efficient algorithms to execute applications on heterogeneous architectures with maximum performance. These algorithms should distribute applications across various nodes with low fragmentation and communication overhead and find a near-optimal solution in a reasonable amount of time. This thesis addresses the research problem of an efficient implementation of object detection applications, their distribution across FPGA-based heterogeneous systems, and methods for automation and integration using toolchains. Within this, the three contributions are the HiFlipVX object detection library, the DECISION framework, and the APARMAP application distribution algorithm.
HiFlipVX is an open-source HLS-based FPGA library optimized for performance and re-
source efficiency. It contains 66 highly parameterizable computer vision functions including neural networks, ideally for design space exploration. It extends the OpenVX standard for feature extraction, which is challenging due to unknown element size at design time. All functions are streaming capable to achieve maximum performance by increasing parallelism and reducing off-chip memory access. It does not require external or vendor libraries, which eases project integration, device coverage, and vendor portability, as shown for Intel. The library consumed on average 0.39 % FFs and 0.32 % LUTs for a set of image processing functions compared to a vendor library. A HiFlipVX implementation of the AKAZE feature detector computes between 3.56 and 4.13 times more pixels per second than the related work, while its resource consumption is comparable to optimized VHDL designs. Its neural network extension achieved a speedup of 3.23 for an AlexNet layer in comparison to a related work, while consuming 73 % less on-chip memory. Furthermore, this thesis proposes an improved feature extraction implementation that achieves a repeatability of 72.57 % when weighting complex cases, while the next best algorithm only achieves 62.99 %.
DECISION is a framework consisting of two toolchains for the efficient programming of FPGA-based heterogeneous systems. Both integrate HiFlipVX and use a joint OpenVX-based frontend to implement computer vision applications. It abstracts the underlying hardware and algorithm details while covering a wide range of architectures and applications. The first toolchain targets x86-based systems consisting of CPUs, GPUs, and FPGAs using OpenCL. To create a heterogeneous schedule, it considers device profiles, kernel profiles and estimates, and FPGA dataflow characteristics. It manages synchronization, memory transfers and data coherence at design time. It creates a runtime optimized program which excels by its high parallelism and a low overhead. Additionally, this thesis looks at the integration of OpenCL-based libraries, automatic OpenCL kernel generation, and OpenCL kernel optimization and comparison for different architectures. The second toolchain creates an application specific and adaptive NoC-based architecture. The streaming-optimized architecture enables the reusability of vision functions by multiple applications to improve the resource efficiency while maintaining high performance. For a set of example applications, the resource consumption was more than halved, while its overhead was only 0.015 % in terms of performance.
APARMAP is an application distribution algorithm for partition-based and mesh-like FPGA
topologies. It uses a NoC (Network-on-Chip) as communication infrastructure to connect
reconfigurable regions and generate an application-specific hardware architecture. The al-
gorithm uses load balancing techniques to find reasonable solutions within a predictable
and scalable amount of time. It optimizes solutions using various heuristics, such as Simu-
lated Annealing and Tabu Search. It uses a multithreaded grid-based approach to prevent
threads from calculating the same solution and getting stuck in local minimums. Its con-
straints and objectives are the FPGA resource utilization, NoC bandwidth consumption, NoC hop count, and execution time of the proposed algorithm. The evaluation showed that the algorithm can deal with heterogeneous and irregular host graph topologies. The algorithm showed a good scalability in terms of computation time for an increasing number of nodes and partitions. It was able to achieve an optimal placement for a set of example graphs up to a size of 196 nodes on host graphs of up to 49 partitions. For a real application with 271 nodes and 441 edges, it was able to achieve a distribution with low resource fragmentation in an average time of 149 ms.