2021 COMP4601 Projects (Release: 11/06/2021)

Historical Overview

COMP4601 projects aim to provide experience at accelerating compute-intensive software using programmable logic. Project groups choose a typical algorithm to investigate, profile its performance in software, and employ techniques acquired during their studies to reduce the execution time by partitioning the computation between a processor and tightly-coupled programmable logic. Challenges include: maintaining focus on a well-defined compute-bound process, efficiently exchanging data between processor and logic, and managing the complexity of unfamiliar hardware platforms and software tools with tight deadlines.

2021 Update

With our focus shifting to the use of HLS to accelerate individual C/C++ functions and limited access/experience with implementing those accelerated cores on hardware, the scope of the projects needs to be concentrated on accelerating and measuring the benefit of accelerating a small number of functions. Moreover, because of time constraints and difficulties providing fair access to hardware, we are not expecting that you demonstrate your solution working on a target hardware platform - it will be perfectly fine to simulate the acceleration of an application using the techniques developed during the labs and as detailed in the suggested project development flow we describe below. If, in addition, you also implement your solution on the ZedBoard (or similar) platform, it will earn your team much kudos and commensurate consideration during assessment.

Outline

In 2021, you will be expected to form teams of 2 people to work on one of the projects listed below. Those project teams that include an overseas based student will be permitted to have 3 members, and it is preferred that no team includes more than one overseas based member. Please choose a team rep to email Oliver with the names of the team members working together and the project you are planning to work on.

In 2021, teams are permitted to work on any of the projects listed below or to propose their own project idea if it has the potential to exhibit speedup from FPGA acceleration (see the "Roll your own project" description below).

As a guide to how to proceed and complete your project, we recommend you follow the set of steps outlined in the suggested project development flow below.

Suggested project development flow

The following flow is divided into two streams (a & b) depending upon whether you only wish to simulate the acceleration of an application (stream a), or whether you are also implementing and testing your solution on the ZedBoard (additional steps listed in stream b).

1. write or obtain C/C++ code that encapsulates the compute intensive task you are aiming to accelerate; verify its correctness
2. get the code running on the ZedBoard
1. profile this code on the development platform you have used to write the code in order to identify the "hotspots" or compute-intensive kernels/loop nests you will accelerate
2. profile the code on the ZedBoard
1. partition the code so that the kernels are contained within functions that will lend themselves directly to acceleration using HLS
2. do this for the ZedBoard
1. obtain baseline performance and utilization data for the kernels
2. do this for the ZedBoard and then implement the baseline (unaccelerated) kernels as IP cores in programmable logic; take care to use a sensible communications approach (AXI-lite for control, AXI-streaming for FIFO data coming from the code executing on the ARM processors, or DMA for large data transfers from off-chip DRAM)
1. determine and implement a strategy for transforming the kernel code into a high-performing core using HLS; report on changes in performance and utilization as you progress your strategy
2. do so on the ZedBoard, updating your comms strategy as required
1. report on the techniques used to improve the performance of the core using HLS and compute the resulting overall improvement in performance for the code obtained in step 1a.
2. do so for the ZedBoard and compare the calculated improvement with the actual improvement in performance

Deliverables/reporting requirements

The project deliverables are listed in detail on a separate page.

In outline the following reports and presentations are expected to be made:

Project plan, posted to wiki by 5pm Friday 16 July (Week 7)
Project plan presentation, during lab class prior to Week 8 - email Oliver to schedule
Final presentation, during lab class of Week 10 - Oliver to schedule during Week 9
Demo video, posted to wiki by 5pm Friday 6 August (Week 10)
Project report, posted to wiki by 5pm Friday 6 August (Week 10)

Project list

Accelerate Convolutional Neural Network

CNNs provide state of the art performance on image recognition tasks, with the cost of very high computational load. The CNN algorithm involves a large number of convolutions and matrix multiplications executed in parallel. On desktop platforms, GPU and ASICs like Google's TPU accelerate CNN algorithms by executing operations in parallel. And on the Zynq-7000 device, acceleration is possible with the acceleration circuit being configured on the programmable logic.

In this project, you'll develop FPGA cores to accelerate the execution of CNNs on Zynq-7000 device. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the processing system (PS).

CNN background:
- https://towardsdatascience.com/an-introduction-to-convolutional-neural-networks-eb0b60b58fd7#:~:text=A%20Convolutional%20neural%20network%20(CNN,a%20filter%20over%20the%20input
- https://towardsdatascience.com/convolutional-neural-networks-explained-9cc5188c4939
Opportunities of acceleration in a CNN:
- Parallelize the execution of convolution operators (big nest loops here)
- Speed up the execution of dense layers (lots of multiply and accumulate here)
Steps to approach this task:
1. Study the CNN algorithm carefully. Take note of parallel operations.
2. Develop a software implementation. Test it thoroughly before continuing.
3. Profile the implementation to identify performance bottlenecks
4. Ideally, you should be able to create your HLS design using C code for the PS implementation.
Accelerated Histogram of Oriented Gradients

In traditional machine learning techniques (non-deep-learning ones), image features extracted from input images replace raw images for more efficient learning and inferencing. As this example demonstrates, Histogram of Oriented Gradients (HoG) descriptors coupled with SVM can be very powerful in image classification tasks. In this project, you'll focus on accelerating the computation of HoG descriptors. For the sake of simplicity, only consider 8-bins HoG in this project. (https://medium.com/@basu369victor/handwritten-digits-recognition-d3d383431845)

You'll use your HoG implementation to compute the HoG descriptors for a batch of images. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the PS.

HoG algorithm background:
Steps to approach this task:
1. Study the HoG algorithm carefully. Take note of parallel operations.
2. Develop a software implementation. Test it thoroughly before continuing.
3. Profile the implementation to identify performance bottlenecks
4. Ideally, you should be able to create your HLS design using C code for the PS implementation.
Here's a test set (10000) images from the MNIST dataset. Junning has converted them into bitmap format, so it should be ready to use with no need further conversions.
Bitcoin Miner

For this project you will implement a basic hardware accelerated Bitcoin miner. The Bitcoin mining algorithm is defined as:
```
```python
def mine_bitcoin(header):
  # Loop until we found a good nonce or we run out of time or killed.
  while(timeout == False):
    # Increment the nonce (garbage field we are allowed to set to influence the output hash)
    header.nonce += 1
    # Return our golden nonce if we find it!
    if (sha256d(header) < header.target):
       break
  return header

def sha256d(header):
  sha256(sha256(header))
```
```
NOTES:
- Talk to Tony before embarking on this project!
- We will not build a profitable miner. However we will hopefully demonstrate a significant speedup when compared to a software implementation.
- We will not implement the network protocols in order to connect our miner to a real mining pool.
Alternative hash cores:
Three extra hash functions suitable for substituting SHA256D. When integrated into the miner will mine a hypothetical coin based on these hash functions.

We choose hash functions that produce 256-bit digests instead of 512-bit, as that allows for more exploration of unroll factors on the relatively resource constrained Zedboard.
- KECCAK-256 (SHA3-256):
  The specification is defined here: https://keccak.team/files/Keccak-reference-3.0.pdf
  Reference pseudo-code can be found here: https://keccak.team/keccak.html
- BLAKE2S:
  The specification is defined here: https://www.blake2.net/blake2.pdf
  Reference C implementations can be found here: https://github.com/BLAKE2/BLAKE2
- GROESTL-256:
  The specification is defined here: http://www.groestl.info/Groestl.pdf
  Implementation guide: http://www.groestl.info/groestl-implementation-guide.pdf
We will implement these with the assumption that the input size is fixed at 80-bytes (640-bits), the size of the Bitcoin block header.
Blake3 implementation and evaluation

This function is quite new, proposed in Jan 10, 2020 by the people behind Blake2, which was a final contestant for the SHA3 standard (KECCAK won out) but is still widely used today as an alternative to SHA3. See the attached specification.
Tony Wu: "Due to it being so new, very few public implementations exist for FPGAs, if any. I believe this is doable within the COMP4601 project timeframe, especially with application of HLS."
Students can read more about it and find a reference implementation written in Rust; https://github.com/BLAKE3-team/BLAKE3/
Encryption/decryption

This is an important application for FPGAs because the datawidth can be matched and the computation unrolled and pipelined to gain a good speedup over a CPU. Try designing and implementing a general solution that can process text as it is streamed through the FPGA part, potentially as keys change. Choose from AES, DES, RSA, ...
Tony, who is a crypto expert on FPGAs, suggests the following:

AES Accelerator

Design an AES accelerator that operates in ECB mode.
- Milestone 1: Design an AES encryption core that can take a 128-bit plaintext, 128-bit key and generate a 128-bit ciphertext. Verify this works with a C testbench.
- Milestone 2: Connect the HLS core to the ARM processor on the Zynq SoC. (You will need to think about how best to transfer the data). Write a test program that runs on the ARM processor to send plaintext to the AES core and receive the cipher test. What is the performance? Are there any bottlenecks?
- Milestone 3: Optimize your HLS core or interconnect to alleviate bottlenecks. What is the performance now?
- Milestone 4: Compare your hardware accelerated AES core to one you implemented in purely software (run your software miner on the ARM processor AND your x86 laptop/desktop). Is there anything surprising?
- Bonus: Change your AES core such that it operates in CBC mode. Make it go as fast as possible!
NOTE: Please read up on the difference between ECB and CBC mode.

SHA3 Accelerator

Design a SHA3 accelerator that can take an arbitrary input size.

Milestones will be similar to AES Accelerator.
Text/audio/image compression/decompression

Decide on medium and algorithm; focus on either compression or decompression and perform the opposite operation in software for verification purposes. Given the limitations on time, stick to problems that use simple data structures!
One possibility is to read Chapter 11 of PP4FPGAs on Huffman Encoding, work through the exercises and write a report on what you have learnt from the experience.
Machine learning

Opportunities abound in this HOT area. Consider accelerating "k-means clustering", or "k-nearest neighbours", or "support vector machine" or "Naive Bayes classifier" or "neural net".
Real-valued Matrix operations at scale

At the heart of many machine learning, inference, big data analytics and scientific applications are matrix-matrix and matrix-vector multiplications. In this project I would suggest parameterizing the datatypes and matrix/vector sizes. Compared with a baseline algorithm using float type data, how do speeds, utilization and accuracy compare as the problem size scales?
Pattern matching

String and regular expression matching has frequently been targetted for FPGA acceleration.
Roll your own project

Propose a project to Oliver...the problem should have the potential to be spedup using FPGA hardware.

Potential applications for acceleration include:
- sorting - possibly work through Chapter 10 of PP4FPGAs on Sorting and report on what you have learnt from the experience
- filtering
- encoding/decoding
- video processing
- image processing