Project deliverables (details)
COMP4601 projects aim to provide experience at accelerating compute-intensive software using programmable logic. Project groups choose a typical algorithm to investigate, profile its performance in software, and employ techniques acquired during their studies to reduce the execution time by partitioning the computation between a processor and tightly-coupled programmable logic. Challenges include: maintaining focus on a well-defined compute-bound process, efficiently exchanging data between processor and logic, and managing the complexity of unfamiliar hardware platforms and software tools with tight deadlines.
With our focus shifting to the use of HLS to accelerate individual C/C++ functions and limited access/experience with implementing those accelerated cores on hardware, the scope of the projects needs to be concentrated on accelerating and measuring the benefit of accelerating a small number of functions. Moreover, because of time constraints and difficulties providing fair access to hardware, we are not expecting that you demonstrate your solution working on a target hardware platform - it will be perfectly fine to simulate the acceleration of an application using the techniques developed during the labs and as detailed in the suggested project development flow we describe below. If, in addition, you also implement your solution on the ZedBoard (or similar) platform, it will earn your team much kudos and commensurate consideration during assessment.
In 2021, you will be expected to form teams of 2 people to work on one of the projects listed below. Those project teams that include an overseas based student will be permitted to have 3 members, and it is preferred that no team includes more than one overseas based member. Please choose a team rep to email Oliver with the names of the team members working together and the project you are planning to work on.
In 2021, teams are permitted to work on any of the projects listed below or to propose their own project idea if it has the potential to exhibit speedup from FPGA acceleration (see the "Roll your own project" description below).
As a guide to how to proceed and complete your project, we recommend you follow the set of steps outlined in the suggested project development flow below.
The following flow is divided into two streams (a & b) depending upon whether you only wish to simulate the acceleration of an application (stream a), or whether you are also implementing and testing your solution on the ZedBoard (additional steps listed in stream b).
The project deliverables are listed in detail on a separate page.
In outline the following reports and presentations are expected to be made:
CNNs provide state of the art performance on image recognition tasks, with the cost of very high computational load. The CNN algorithm involves a large number of convolutions and matrix multiplications executed in parallel. On desktop platforms, GPU and ASICs like Google's TPU accelerate CNN algorithms by executing operations in parallel. And on the Zynq-7000 device, acceleration is possible with the acceleration circuit being configured on the programmable logic.
In this project, you'll develop FPGA cores to accelerate the execution of CNNs on Zynq-7000 device. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the processing system (PS).
CNN background:
Opportunities of acceleration in a CNN:
Steps to approach this task:
In traditional machine learning techniques (non-deep-learning ones), image features extracted from input images replace raw images for more efficient learning and inferencing. As this example demonstrates, Histogram of Oriented Gradients (HoG) descriptors coupled with SVM can be very powerful in image classification tasks. In this project, you'll focus on accelerating the computation of HoG descriptors. For the sake of simplicity, only consider 8-bins HoG in this project. (https://medium.com/@basu369victor/handwritten-digits-recognition-d3d383431845)
You'll use your HoG implementation to compute the HoG descriptors for a batch of images. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the PS.
HoG algorithm background:
Steps to approach this task:
For this project you will implement a basic hardware accelerated Bitcoin miner. The Bitcoin mining algorithm is defined as:
```python def mine_bitcoin(header): # Loop until we found a good nonce or we run out of time or killed. while(timeout == False): # Increment the nonce (garbage field we are allowed to set to influence the output hash) header.nonce += 1 # Return our golden nonce if we find it! if (sha256d(header) < header.target): break return header def sha256d(header): sha256(sha256(header)) ```
NOTES:
Alternative hash cores:
Three extra hash functions suitable for substituting SHA256D. When
integrated into the miner will mine a hypothetical coin based on these
hash functions.
We choose hash functions that produce 256-bit digests instead of 512-bit, as that allows for more exploration of unroll factors on the relatively resource constrained Zedboard.
We will implement these with the assumption that the input size is fixed at 80-bytes (640-bits), the size of the Bitcoin block header.
This function is quite new, proposed in Jan 10, 2020 by the people behind Blake2, which was a final contestant for the SHA3 standard (KECCAK won out) but is still widely used today as an alternative to SHA3. See the attached specification.
Tony Wu: "Due to it being so new, very few public implementations exist for FPGAs, if any. I believe this is doable within the COMP4601 project timeframe, especially with application of HLS."
Students can read more about it and find a reference implementation written in Rust; https://github.com/BLAKE3-team/BLAKE3/
This is an important application for FPGAs because the datawidth can be matched and the computation unrolled and pipelined to gain a good speedup over a CPU. Try designing and implementing a general solution that can process text as it is streamed through the FPGA part, potentially as keys change. Choose from AES, DES, RSA, ...
Tony, who is a crypto expert on FPGAs, suggests the following:
AES Accelerator
Design an AES accelerator that operates in ECB mode.
SHA3 Accelerator
Design a SHA3 accelerator that can take an arbitrary input size.
Milestones will be similar to AES Accelerator.
Decide on medium and algorithm; focus on either compression or decompression and perform the opposite operation in software for verification purposes. Given the limitations on time, stick to problems that use simple data structures!
One possibility is to read Chapter 11 of PP4FPGAs on Huffman Encoding, work through the exercises and write a report on what you have learnt from the experience.
Opportunities abound in this HOT area. Consider accelerating "k-means clustering", or "k-nearest neighbours", or "support vector machine" or "Naive Bayes classifier" or "neural net".
At the heart of many machine learning, inference, big data analytics and scientific applications are matrix-matrix and matrix-vector multiplications. In this project I would suggest parameterizing the datatypes and matrix/vector sizes. Compared with a baseline algorithm using float type data, how do speeds, utilization and accuracy compare as the problem size scales?
String and regular expression matching has frequently been targetted for FPGA acceleration.
Propose a project to Oliver...the problem should have the potential
to be spedup using FPGA hardware.
Potential applications for acceleration include: