Harnessing the Multicores with Data Parallel Haskell

Data Parallel Haskell is a set of program transformations and a self-optimising, parallel array library integrated into the Glasgow Haskell Compiler, the most widely used Haskell compiler. It extends the broad-spectrum, purely functional programming language Haskell with nested data parallelism, targetting multi-core CPUs and GPUs.

The Problem

Computers are no longer getting faster; instead, we will be offered computers containing more and more CPUs, each of which is no faster than the previous generation. As the number of CPUs increases, it becomes more and more difficult for a programmer to deal with the interactions of large numbers of threads. Moreover, the physical limitations of bus bandwidth will mean that memory access times will be increasingly non-uniform (even if the address space is shared), and locality of reference will be increasingly important.

State of the Art

In the world of massively-parallel computing with strong locality requirements there is already a well-established, demonstrably successful brand leader, namely data parallelism. In a data-parallel computation one performs the same computation on a large collection of differing data values. Well-known examples of data-parallel programming environments are parallel Fortran dialects, the collective operations of the Message Passing Interface (MPI), OpenMP, NVIDIA's Compute Unified Device Architecture (CUDA) API for graphics processors, and Google's map/reduce framework.

Unfortunately, flat data parallelism is rather restrictive for the programmer: subcomputations performed in parallel must all process substructures of the same shape and size. The resulting parallel structure is restricted to rectangular, multi-dimensional arrays, and it lacks composability and modularity.

Nested Data Parallelism

Nested data parallelism is more expressive; it supports irregular and sparse data structures and algorithms by enabling parallel subcomputations of widely varying shape and size. It is dynamic, composable and modular. The basic idea has been around for a considerable whilie - c.f., Blelloch's programming language NESL. However, it has never been integrated into a general-purpose language. We are adding nested data parallelism to the broad spectrum, purely functional language Haskell. Haskell is pure -restricting side-effects- by default. Haskell's purity is essential to implement nested data parallelism, as computations are radically re-ordered.

Data Parallel Haskell

We express data parallelism in the form of collective operations over nested parallel arrays. Here we define dense vectors as arrays of double-precision floating point numbers, sparse vectors as arrays of index-value pairs, and sparse matrices as arrays of sparse vectors.

type Vector       = [: Double :]
type SparseVector = [: (Int, Double) :]
type SparseMatrix = [: SparseVector :]

Given these data types, we can implement a dot product between a sparse and a dense vector using a parallel sum consuming the result of an array comprehension that generates the products by indexing the dense vector with the indices of the sparse vector.

sdotp :: SparseVector -> Vector -> Double
sdotp sv v = sumP [: x * (v !: i) | (i, x) <- sv :]

Finally, we multiply an entire sparse matrix with a dense vector by computing a sparse dot product for every row of the matrix.

smvm :: SparseMatrix -> Vector -> Vector
smvm m v = [: sdotp row v | row <- m :]

This algorithm nests parallelism in an essential way as we may perform all of the sparse dot products in parallel, although those dot products are themselves parallel operations and the size of different sparse rows may we very different.

Implementing Data Parallel Haskell

Our implementation of Data Parallel Haskell is based on the mature, highly-optimising Glasgow Haskell Compiler developed at Microsoft Research, Cambridge, U.K., together with international collaborators. At the heart of our approach is a vectorising program transformation and a high-performance, self-optimising array library, which are described in more detail in the papers Harnessing the Multicores: Nested Data Parallelism in Haskell and Data Parallel Haskell: a status report, respectively.

Currently, the array library has only two backends: one is purely sequential and the other is using GHC's multi-threading support to parallelise array computations for SMT hardware, specifically for multi-core CPUs. However, the architecture is generic and we are planning further array backends in the future, such as a backend targeting GPUs via CUDA and/or OpenCL.

Performance

We gathered performance figures for an early implementation of Data Parallel Haskell for three simple benchmarks: (1) sumsq computes the parallel sum of the squares from 1 to 10 million; (2) dotp computes the dot product of two dense vectors of 100 million double-precision floating-point numbers; and (3) smvm multiplies a sparse matrix with 10 million non-zero double-precision floating-point elements with a dense vector, using the code displayed above. In contrast to earlier benchmarks, the numbers presented here were gathered from vectorised code using nested parallel arrays and array comprehensions as described above, not from programs that directly target the self-optimising array library.

The graphs illustrate the performance relative to a sequential C program - i.e., as soon a curve climbs over 1 on the y-axis, it's absolute performance is better than that of a sequential C program. The benchmarks were executed on two different hardware architectures, both of which were equipped with 8 cores. The first one has 8 cores with one hardware thread each, in the form of two Quad-Core 3GHz Xeon processors. The second one has 8 cores with 8 hardware threads each, in the form of a single 1.4GHz UltraSPARC T2.

The scalability of these three programs in Data Parallel Haskell is very good, although dotp is limited by memory bandwidth on the first architecture. A parallel C implementation of dotp (the grey curve) suffers the same fate and the absolute runtime of the Haskell program for 8 cores is almost identical to the C code. Given the very high level of abstraction of purely functional Haskell programs, these results are very positive and demonstrate the potential of functional programming for harnessing the multicores.

Further Information

People (alphabetically)

• Copyright 2009 Manuel M T Chakravarty • Last modified: 10 March 2009