Project deliverables (details)

2021 COMP4601 Projects (Release: 11/06/2021)

Historical Overview

COMP4601 projects aim to provide experience at accelerating compute-intensive software using programmable logic. Project groups choose a typical algorithm to investigate, profile its performance in software, and employ techniques acquired during their studies to reduce the execution time by partitioning the computation between a processor and tightly-coupled programmable logic. Challenges include: maintaining focus on a well-defined compute-bound process, efficiently exchanging data between processor and logic, and managing the complexity of unfamiliar hardware platforms and software tools with tight deadlines.

2021 Update

With our focus shifting to the use of HLS to accelerate individual C/C++ functions and limited access/experience with implementing those accelerated cores on hardware, the scope of the projects needs to be concentrated on accelerating and measuring the benefit of accelerating a small number of functions. Moreover, because of time constraints and difficulties providing fair access to hardware, we are not expecting that you demonstrate your solution working on a target hardware platform - it will be perfectly fine to simulate the acceleration of an application using the techniques developed during the labs and as detailed in the suggested project development flow we describe below. If, in addition, you also implement your solution on the ZedBoard (or similar) platform, it will earn your team much kudos and commensurate consideration during assessment.

Outline

In 2021, you will be expected to form teams of 2 people to work on one of the projects listed below. Those project teams that include an overseas based student will be permitted to have 3 members, and it is preferred that no team includes more than one overseas based member. Please choose a team rep to email Oliver with the names of the team members working together and the project you are planning to work on.

In 2021, teams are permitted to work on any of the projects listed below or to propose their own project idea if it has the potential to exhibit speedup from FPGA acceleration (see the "Roll your own project" description below).

As a guide to how to proceed and complete your project, we recommend you follow the set of steps outlined in the suggested project development flow below.

Suggested project development flow

The following flow is divided into two streams (a & b) depending upon whether you only wish to simulate the acceleration of an application (stream a), or whether you are also implementing and testing your solution on the ZedBoard (additional steps listed in stream b).

    1. write or obtain C/C++ code that encapsulates the compute intensive task you are aiming to accelerate; verify its correctness
    2. get the code running on the ZedBoard
    1. profile this code on the development platform you have used to write the code in order to identify the "hotspots" or compute-intensive kernels/loop nests you will accelerate
    2. profile the code on the ZedBoard
    1. partition the code so that the kernels are contained within functions that will lend themselves directly to acceleration using HLS
    2. do this for the ZedBoard
    1. obtain baseline performance and utilization data for the kernels
    2. do this for the ZedBoard and then implement the baseline (unaccelerated) kernels as IP cores in programmable logic; take care to use a sensible communications approach (AXI-lite for control, AXI-streaming for FIFO data coming from the code executing on the ARM processors, or DMA for large data transfers from off-chip DRAM)
    1. determine and implement a strategy for transforming the kernel code into a high-performing core using HLS; report on changes in performance and utilization as you progress your strategy
    2. do so on the ZedBoard, updating your comms strategy as required
    1. report on the techniques used to improve the performance of the core using HLS and compute the resulting overall improvement in performance for the code obtained in step 1a.
    2. do so for the ZedBoard and compare the calculated improvement with the actual improvement in performance

Deliverables/reporting requirements

The project deliverables are listed in detail on a separate page.

In outline the following reports and presentations are expected to be made:

Project list

  1. Accelerate Convolutional Neural Network

    CNNs provide state of the art performance on image recognition tasks, with the cost of very high computational load. The CNN algorithm involves a large number of convolutions and matrix multiplications executed in parallel. On desktop platforms, GPU and ASICs like Google's TPU accelerate CNN algorithms by executing operations in parallel. And on the Zynq-7000 device, acceleration is possible with the acceleration circuit being configured on the programmable logic.

    In this project, you'll develop FPGA cores to accelerate the execution of CNNs on Zynq-7000 device. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the processing system (PS).

    CNN background:

    Opportunities of acceleration in a CNN:

    Steps to approach this task:

    1. Study the CNN algorithm carefully. Take note of parallel operations.
    2. Develop a software implementation. Test it thoroughly before continuing.
    3. Profile the implementation to identify performance bottlenecks
    4. Ideally, you should be able to create your HLS design using C code for the PS implementation.

  2. Accelerated Histogram of Oriented Gradients

    In traditional machine learning techniques (non-deep-learning ones), image features extracted from input images replace raw images for more efficient learning and inferencing. As this example demonstrates, Histogram of Oriented Gradients (HoG) descriptors coupled with SVM can be very powerful in image classification tasks. In this project, you'll focus on accelerating the computation of HoG descriptors. For the sake of simplicity, only consider 8-bins HoG in this project. (https://medium.com/@basu369victor/handwritten-digits-recognition-d3d383431845)

    You'll use your HoG implementation to compute the HoG descriptors for a batch of images. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the PS.

    HoG algorithm background:

    Steps to approach this task:

    1. Study the HoG algorithm carefully. Take note of parallel operations.
    2. Develop a software implementation. Test it thoroughly before continuing.
    3. Profile the implementation to identify performance bottlenecks
    4. Ideally, you should be able to create your HLS design using C code for the PS implementation.

    Here's a test set (10000) images from the MNIST dataset. Junning has converted them into bitmap format, so it should be ready to use with no need further conversions.

  3. Bitcoin Miner

    For this project you will implement a basic hardware accelerated Bitcoin miner. The Bitcoin mining algorithm is defined as:

    ```python
    def mine_bitcoin(header):
      # Loop until we found a good nonce or we run out of time or killed.
      while(timeout == False):
        # Increment the nonce (garbage field we are allowed to set to influence the output hash)
        header.nonce += 1
        # Return our golden nonce if we find it!
        if (sha256d(header) < header.target):
           break
      return header
    
    def sha256d(header):
      sha256(sha256(header))
    ```
    

    NOTES:

    Alternative hash cores:
    Three extra hash functions suitable for substituting SHA256D. When integrated into the miner will mine a hypothetical coin based on these hash functions.

    We choose hash functions that produce 256-bit digests instead of 512-bit, as that allows for more exploration of unroll factors on the relatively resource constrained Zedboard.

    We will implement these with the assumption that the input size is fixed at 80-bytes (640-bits), the size of the Bitcoin block header.

  4. Blake3 implementation and evaluation

    This function is quite new, proposed in Jan 10, 2020 by the people behind Blake2, which was a final contestant for the SHA3 standard (KECCAK won out) but is still widely used today as an alternative to SHA3. See the attached specification.

    Tony Wu: "Due to it being so new, very few public implementations exist for FPGAs, if any. I believe this is doable within the COMP4601 project timeframe, especially with application of HLS."

    Students can read more about it and find a reference implementation written in Rust; https://github.com/BLAKE3-team/BLAKE3/

  5. Encryption/decryption

    This is an important application for FPGAs because the datawidth can be matched and the computation unrolled and pipelined to gain a good speedup over a CPU. Try designing and implementing a general solution that can process text as it is streamed through the FPGA part, potentially as keys change. Choose from AES, DES, RSA, ...

    Tony, who is a crypto expert on FPGAs, suggests the following:

    AES Accelerator

    Design an AES accelerator that operates in ECB mode.

    NOTE: Please read up on the difference between ECB and CBC mode.

    SHA3 Accelerator

    Design a SHA3 accelerator that can take an arbitrary input size.

    Milestones will be similar to AES Accelerator.

  6. Text/audio/image compression/decompression

    Decide on medium and algorithm; focus on either compression or decompression and perform the opposite operation in software for verification purposes. Given the limitations on time, stick to problems that use simple data structures!

    One possibility is to read Chapter 11 of PP4FPGAs on Huffman Encoding, work through the exercises and write a report on what you have learnt from the experience.

  7. Machine learning

    Opportunities abound in this HOT area. Consider accelerating "k-means clustering", or "k-nearest neighbours", or "support vector machine" or "Naive Bayes classifier" or "neural net".

  8. Real-valued Matrix operations at scale

    At the heart of many machine learning, inference, big data analytics and scientific applications are matrix-matrix and matrix-vector multiplications. In this project I would suggest parameterizing the datatypes and matrix/vector sizes. Compared with a baseline algorithm using float type data, how do speeds, utilization and accuracy compare as the problem size scales?

  9. Pattern matching

    String and regular expression matching has frequently been targetted for FPGA acceleration.

  10. Roll your own project

    Propose a project to Oliver...the problem should have the potential to be spedup using FPGA hardware.

    Potential applications for acceleration include: