## Writing GPU Programs with Nikola

10 Oct 2012

Repa is a fantastic library for regular array computation in Haskell. Version 31 is particularly nice—it leverages Haskell’s type system so that an array’s type tells you about its representation. There are plenty of other great features in Repa, like easy parallelization. In rewriting Nikola, an embedded Haskell DSL for GPU computation, I’ve used Repa as a model for what regular array programming should look like—even if it’s a GPU that is being programmed.

In this post I will explain how to use Nikola to program GPUs through the development of a Mandelbrot set viewer. This example is adapted from the Accelerate mandelbrot example and from the mandelbrot example that is part of Simon Marlow’s par tutorial. All code is available from the nikola github repository. You can install Nikola and the Mandelbrot example by following the instructions in the README. Note that the Nikola that I describe here is a completely different beast from the “Nikola” that I described in the Haskell Symposium 2010 paper Nikola: Embedding Compiled GPU Functions in Haskell—only the name and the problem domain (programming GPUs from Haskell) are the same.

## A Repa Mandelbrot viewer

Since I’m using Repa as a model, let’s start with a Repa Mandelbrot viewer. If you’re not familiar with the Mandelbrot set, you can read the Wikipedia entry. Briefly, for each point $c$ in the complex plain we iteratively compute $z_i$ as follows

\begin{align} z_0 &= 0 \\ z_{i+1} &= z_i^2 + c \end{align}

The point $c$ is a member of the Mandelbrot set iff the $z_i$’s are bounded. It turns out that if $|z_i| > 2$ for some $i$, then the $z_i$’s are not bounded, so $c$ is not a member of the Mandelbrot set. We will visualize the Mandelbrot set using the escape time algorithm—we iterate the computation of $z_i$ until it escapes, i.e., $|z_i| > 2$, or until we reach a fixed “depth” limit, in which case we declare that the point is a potential member of the Mandelbrot set. The iteration count is then used to color the point $c$ in the complex plane and make a pretty picture.

Let’s write this using Repa. You can find the full version of the code in this section on github as examples/mandelbrot/Mandelbrot/RepaV1.hs.

When we move to the Nikola version, many of the types will have the same name, but they will be Nikola types rather than Repa types. Therefore I want to be explicit about what we are importing, so I’ll start with the imports.

Next we have a few types aliases. We use the R type alias to abstract over the representation we use for real numbers. We will represent complex numbers as pairs of real numbers. I omit the Num instance for Complex.

Next are a bunch of type aliases for the Repa arrays we will use. Note that they are representation polymorphic—that’s why they are parametrized by the type variable r.

Each pixel in our image corresponds to a point in the complex plane, and the set of complex points corresponding to the pixels in the image that we are generating is represented by a ComplexPlane. In other words, the ComplexPlane array holds the values of $c$ that correspond to the pixels in our image. A StepPlane holds the current $z_i$ and the current iteration, $i$, for each point in our image.

The step function, which computes $z_{i+1}$ according to the above equation, takes a ComplexPlane and a StepPlane and returns a new StepPlane. The U type index tells us that we are using unboxed vectors to hold the data in our arrays—in other words, we’re using an efficient unboxed data representation. Note that zipWith produces a delayed array—an array represented as a function from arrays indices to values. A delayed array tells us how to compute the elements in the array, but it doesn’t actually perform the computation. We force the computation of all the elements in the array using computeP, which computes the individual elements of the delayed array in parallel and places the result in a new array, which in our case is an unboxed array (type tag U).

Before we can call step, we need to compute the ComplexPlane and the initial StepPlane. The mandelbrot function takes the boundaries of the portion of the complex plane that we want to visualize, the size of the image that should result from the visualization, a depth limit, and computes the visualization.

The ComplexPlane is generated b genPlane, and our initial StepPlane by mkinit. These are both straightforward, so I’ll show the code but won’t walk through them.

Note that we are always iterating the step computation depth times, even if we could bail out early. We also happen to be computing—and allocating!—a new StepPlane every time we call step. This is rather inefficient, but the inefficient version will serve as the stepping-off point for a Nikola implementation. A more efficient version is available on github as examples/mandelbrot/Mandelbrot/RepaV3.hs. The full implementation also generates an image from the final StepPlane and displays it on the screen. If you configure and build Nikola using the instructions in the README, you can execute the “optimized” Repa version of the Mandelbrot viewer as follows. Use whatever value for the -N options makes sense for you machine.

> ./dist/build/mandelbrot/mandelbrot --backend repa +RTS -N4


The mouse wheel and page down/page up keys will zoom in and out, and the arrow keys will move you around. You can also try dragging the image with the mouse, but note that it’s not very responsive! Hopefully we can fix that with our Nikola implementation.

The mandelbrot program provides no fewer than 8 back ends—3 versions of the Repa back end, and 5 versions of the Nikola back end. You can try out the different versions using the --backend flag with an argument repavN or nikolavN where N is 1–3 in the Repa case and 1–5 in the Nikola case. The default (when =–backend– isn’t specified) is the Nikola version 5 back end.

I have also benchmarked all versions of the Mandelbrot back ends by measuring the time they take to generate a frame on two different machines. The first is my laptop, which has a i7-2620M CPU running at 2.70GHz and an NVS 4200M. The second is my desktop, which has a i7-2600K CPU running at 3.40GHz and a GeForce GTX 560 Ti. Both sets of results were run with +RTS -N4, and you can see the exact set of optimization options passed to GHC in the cabal file file in the github repo.

## An aside: GPU details

Before I get to the Nikola version of the Mandelbrot viewer, I need to talk about some of the details of GPU computation. The GPU is a strange beast with an unusual programming model, a different instruction set, and memory that is distinct from CPU memory. Nikola hides some of these details from the programmer—in particular, it tries to expose the programming model in a nice, high-level way—but not all of them. In fact, I think the programmer must be aware of some of these details. For example, a GPU computation can only use data that is in GPU memory. If we have an array in CPU memory, we have to copy it to GPU memory to use it in a GPU computation. These copies are expensive, so the programmer should have control over how and when they happen.

Repa’s type tags provide the perfect vehicle for programmer control of GPU-CPU memory copies, so Nikola introduces a new Repa type tag that signifies that the array’s data lives in GPU memory. All the “standard” Repa functions work with these arrays, but this is extremely slow because these operations all run on the CPU and therefore require GPU-to-CPU memory copies to do any work. So why bother with GPU Repa arrays? Because Nikola lets you compute with them efficiently!

We still need a nice framework for copying Repa arrays between the CPU and GPU. Nikola uses the CUDA runtime API to do this. Because this requires calling C functions, we can only efficiently work with Repa arrays that hold data in a ForeignPtr, for example, arrays with the type tag F. However, we would also like to be able to efficiently work with arrays that have more structure, e.g., Repa unboxed arrays that contain pairs. Repa unboxed arrays, which have a type tag U, use Data.Vector.Unboxed to hold their contents. In the unboxed representation, the content of an unboxed array of pairs is stored in a pair of vectors: one holds the first element of each pair, and the other holds the second element. Despite the details of the underlying representation, we still access the elements of the Repa arrays as if they were pairs. Unfortunately for us, Data.Vector.Unboxed vectors, and therefore Repa unboxed arrays, store their contents in GHC byte arrays instead of ForeignPtr’s, so we can’t (safely) efficiently copy them to the GPU!

What we need are arrays that use the efficient unboxed representation of Data.Vector.Unboxed but store data in ForeignPtr’s instead of byte arrays. Nikola provides such data types, which I will refer to as “unboxed foreign” in contrast to just “unboxed,” in the Data.Vector.UnboxedForeign and Data.Array.Repa.Repr.UnboxedForeign modules. There are corresponding Data.Array.Repa.Repr.CUDA.UnboxedForeign and Data.Array.Repa.Repr.CUDA.UnboxedForeign modules that provide GPU versions of the unboxed foreign representations but store their data in ForeignDevicePtr’s, which as you can imagine point to data on the GPU. Using the unboxed foreign representation allows us to efficiently copy structured array data between the GPU and CPU. Without this extra support, we would only be able to use Nikola with arrays that contained simple base types, like Int32 and Double, instead of arrays that contain, e.g., tuples of Double’s.

In summary, to support Nikola we have introduced the following new Repa type tags:

Repa type tag Meaning
CF The GPU equivalent of the CPU F tag. Stored in a ForeignDevPtr.
UF Unboxed foreign data.
CUF CUDA unboxed foreign data.

New Repa type tags

## A word about Nikola and DSLs

Nikola is an embedded DSL. In fact, it is a deeply embedded DSL: a Nikola program internally produces an abstract syntax tree that represents a computation. There are then various methods for actually running the program that is produced. It is important to remember this “phase distinction” when writing Nikola programs. In particular, instead of functions that return, e.g., a Double, a Nikola function will return an Exp Double. The Exp tells us that we are getting back an abstract representation of a computation, and the Double type index tells us that if we run the computation, we will get a Double out. Because Exp is a computation, we can’t hope to do things like test an Exp for equality with another Exp—such a test is not even decidable! However, we can build a new Exp that represents a computation that tests two Exp’s for equality—at run time it will evaluate the two sub-computations and perform the test. Staging a computation this way—writing a program that generates a program, and then running the generated program—can be hard to wrap your head around. You should maintain a mental model where an Exp represents a computation and a Nikola program is building up a big computation.

The method we will use to run Nikola programs is compilation via Template Haskell. This roughly works as follows. We first write a bunch of Nikola functions and stick them in a module. Recall that a Nikola function produces an abstract representation of a computation. We leverage this by providing a function compile that takes a Nikola function, calls it appropriately to produce a concrete representation of the computation the function embodies, and then transforms this representation into GPU code and compiles it. The compile function actually translates the internal Nikola program representation to CUDA, and then calls NVIDIA’s CUDA compiler, nvcc, to compile the CUDA code to GPU binary code. Then compile builds up a Template Haskell terms that contains the compiled GPU binary code and the necessary glue to call it from Haskell. By calling compile from a Template Haskell splice, we end up with a Haskell function that invokes a GPU computation. All compilation to GPU binary code is done when the Template Haskell splice runs, which is at Haskell compile time. This means there is no run time GPU code generation in the Nikola code I will show. Although tune time code generation is also possible with Nikola, I won’t discuss it further here.

Template Haskell has something called the stage restriction. Basically this means that when we use a Template Haskell splice in a module, everything used inside the splice has to come from a different module. Because we have to use Nikola’s compile function in a splice, this means that we need to put our Nikola functions in a separate module. Don’t worry about the particulars of using compile right now—we’ll see how to use it below. The main point here is that we have to put our Nikola functions in a separate module.

## A Nikola Mandelbrot viewer: A first cut

I will first describe the Nikola part of the implementation—that is, the functions that will actually run on the GPU. The code is available on github in mandelbrot/Mandelbrot/NikolaV1/Implementation.hs.

I’ll start again with our imports

We are using the CUDA back end for Nikola. Eventually Nikola will have multiple back ends (there is currently partial support for OpenMP), but we only need to worry about the CUDA back end. Note that we are not importing any Repa modules.

Next are our type aliases.

The Array type here is the Nikola Array type, not the Repa array type. Similarly, the DIM2 we see here is the Nikola DIM2. Like Repa arrays, Nikola arrays use type tags to describe their representation, so we once again see the r type variable show up in our array type aliases.

There is one glaring difference here: the presence of the Exp data type. When we wrote our Repa implementation, we represented complex numbers as pairs of values of type R, i.e., as pairs of Double’s. Now we are representing complex numbers as pairs of computations that, when run, produce values of type Double. Analogously, bitmaps, complex planes, and step planes all contain Exp computations. Note that Nikola lets us use nested tuples—a StepPlane contains pairs consisting of a complex number, which is itself a tuple, and a step.

Here is the step function:

If you go back and look at the Repa version, you’ll see that it’s almost exactly the same module a few INLINE directives and bang patterns. The one significant difference is the use of >* instead of > in the if expression. Remember that I said we can’t perform tests on Nikola expressions, but we can construct Nikola expressions that perform tests? The >* combinator constructs a Nikola expression that performs a test, and through the magic of the RebindableSyntax GHC language extension, the if expression constructs a Nikola term that uses this test to decide between the two branches in the if.

The genPlane and mkInit functions are also much like their Repa counterparts:

We see in their type signatures that, roughly speaking, what was previously a foo in the Repa implementation becomes an Exp foo in the Nikola version. One exception is the Complex type: in the Repa version a Complex was a (R, R), but in Nikola it is an (Exp R, Exp R), not an Exp (R, R)! The reason is that Exp (a, a) is isomorphic to (Exp a, Exp a) and it is easier to work with a pair of expressions instead of an expression representing a pair, so Nikola sticks with the “friendlier” form. If you doubt the superiority of (Exp a, Exp a) over Exp (a, a), have a look at the Num instance in the full Nikola implementation.

You will also notice that we have pushed the call to computeP into the functions genPlane and mkinit. This is because we want the elements to be computed on the GPU! However, instead of seeing the IO monad as we would if we used Repa’s computeP, we see the P monad. You can think of P as the “parallel” or “program” monad. It is the monad in which all Nikola monadic computations are performed.

Finally, we see an unfamiliar array type tag, G. This signifies that the contents of a Nikola array is stored in GPU global memory. Nikola also has delayed arrays, with type tag D (just like Repa), and a few other representations, one of which we will see a bit later.

### Calling a Nikola function

To call a Nikola function, we first have to compile it using our funny Template Haskell trick. Because of the stage restriction, the compiled Nikola functions live in a different module from the Nikola implementation we just saw. You can find this file in the github repo as mandelbrot/Mandelbrot/NikolaV1.hs.

On to the imports again!

OK, there’s a bunch of junk here. Note that we are once again importing various Repa modules. We are importing only one Nikola module, and then only so we can use the compile function. The Nikola functions we implemented above are imported (qualified) from the Mandelbrot.NikolaV1.Implementation module.

Type aliases again:

These are the same Repa type aliases from before—the Array and DIM2 we see are Repa’s Array and DIM2.

Now let’s compile those Nikola functions:

I’ve placed signatures on these functions as a form of documentation, but will infer them so they are not actually necessary. You can see that Nikola types are mapped by compile to Repa types. This mapping is somewhat complicated, but is (very) roughly approximated by the following table:

Repa type tag Meaning
Exp a a
P (Exp a) IO a
Array G sh (Exp a) Array CUF sh a
P (Array r sh (Exp a)) IO (Array CUF sh a)

Translation from Nikola type to Repa type

The Repa type tag CUF tells us that these functions deal with data stored on the GPU. Compiling a Nikola function that works with Nikola arrays produces a Haskell function that works with Repa arrays.

Finally, here is the mandelbrot function. It’s much the same as the Repa version except for the CUF type tag (recall that we pushed the computeP into the Nikola functions so it is missing here).

There are a few lines of additional glue in the full implementation to copy the data back from the GPU to the CPU and display it.

You can run this version as follows

> ./dist/build/mandelbrot/mandelbrot --backend nikolav1 +RTS -N4


If you do so, you’ll notice it’s not very fast. There’s a good reason for that—we’re allocating a new array in GPU memory for every step, just as we did in the Repa version above. On my laptop, the first-cut Nikola version is still about 3x faster than the first-cut Repa solution. The optimized Repa version, which I will not show here, is more than 21x faster than the Repa version above and more than 7x faster (!) than the first-cut Nikola version we just wrote.

This is quite a testament to Repa. We still have quite a few Nikola tricks up our sleeve, and the final Nikola implementation we write will be 10x (my laptop) to 20x (my desktop) faster than the first version.

## Mutable arrays

Although it won’t immediately get us a big performance boost, one way to improve our performance is to pre-allocate buffers when the dimensions of our image (i.e. display window) change and then re-use them as the user scrolls and zooms instead of allocating a bunch of new buffers every time we have to draw anything.

The point here in this section is to show that Nikola is a monadic embedded DSL, and that it allow us to work with mutable arrays using familiar Haskell idioms, namely monads. I won’t show the code that pre-allocates buffers, but will focus on manipulating mutable arrays in Nikola. There is a fair amount of unsafe-fu below: you have been warned.

### An aside: mutable arrays in Repa

Repa doesn’t provide mutable arrays, but it does provide the MVec associated type as part of the Target type class. MVec is a vector—it doesn’t have a Repa shape. For various reasons, it is convenient to have the shape around when working with mutable arrays in Nikola, so I have defined a Mutable type class and associated type MArray in the Data.Array.Repa.Mutable module in the Nikola repo. There is an analogous MArray type in the Nikola module Data.Array.Nikola.Eval.Target.

### Mutable arrays in Nikola

The full version of the Nikola implementation, modified to use mutable arrays, is available on github as mandelbrot/Mandelbrot/NikolaV2/Implementation.hs.

We begin by adding a few more type aliases for the mutable versions of our various arrays types.

Our step function now uses loadP instead of computeP to load the result of a step into an existing mutable array instead of computing it and producing a newly-allocated array. Note that Repa also provides loadP for Repa arrays. Please forgive the unsafeFreezeMArray—I did warn you about the unsafe-fu.

The new versions of genPlane and mkinit also use loadP instead of computeP—the changes aren’t very interesting, so I won’t show them here.

The highlight here is the fact that we are writing monadic code in a DSL. P is a real monad! We get to use all the nice Haskell support for programming with monads, like do syntax. Furthermore, we can used monads to distinguish pure Nikola code from impure Nikola code just as we do in Repa (and in Haskell in general).

# Iteration on the GPU

Although slightly faster, the Nikola implementation using mutable arrays still suffers from a major flaw—it iterates the step function on the CPU. Each iteration requires calling out to a GPU kernel. Although we have avoided allocating a new chunk of GPU memory for each step (thus the slight time savings), to really speed things up, we need to perform this iteration directly on the GPU. If we can do this, then we will replace depth calls to the step GPU function with a single call to a GPU function.

The natural thing to try is to rewrite the step function so that it performs all the $z_i$ calculations is a single loop. Here’s what that might look like:

If we try to compile this version, we get an unpleasant error message:

examples/mandelbrot/Mandelbrot/NikolaV3.hs:54:16:
Exception when trying to run compile-time code:
(==) Exp: incomparable
Code: compile I.mandelbrot
In the expression: $(compile I.mandelbrot) In an equation for mandelbrot': mandelbrot =$(compile I.mandelbrot)
make: *** [mandelbrot] Error 1


The fact that we’re working with a DSL is biting us. Remember when I said that the Exp type is abstract and we can’t directly inspect it, only use combinators that construct tests? Our implementation of stepPoint pattern matches on its first argument which is of type Exp Int32. That is desugared by GHC into an equality test which is causing the error we just saw. Think of it this way: the Exp Int32 that is passed to stepPoint could represent some gnarly, nasty computation, so stepPoint can’t in general decide whether or not it is equal to zero, thus the error.

The solution is to use a combinator to perform the iteration. Nikola provides a combinator, iterateWhile, which is just what we need. Using it, we can write stepN as follows:

The iterateWhile combinator is actually fairly powerful. In other words, we haven’t cheated by writing a one-off combinator suitable just for this example. I won’t get into the details of how it works, but you can think of it as having a type like Exp Int32 -> (a -> (b, a)) -> a -> a with suitable constraints on the type variables a and b to ensure that b is “Exp Bool-like” and that a is “Exp-like”.

With this change, on my desktop the mandelbrot program is fast enough for live dragging of the image.

When calculating the Mandelbrot set, different points in the complex plan require different amounts of work because some points escape ($z_i$ goes above 2) faster than others. That is, the work being doing in our Mandelbrot viewer is unbalanced across the different array elements. Repa provides a special array type, I, that hints that computing the array is an unbalanced workload. The final version of our Repa implementation, which is in the github repo but not shown here, uses this array type to improve runtime significantly. All the programmer has to do to mark an array as producing an irregular workload is to use the hintInterleave whose type is Array r sh e -> Array (I r) sh e. This one-line change gives us a very significant speed-up in the final Repa back end.

Unbalanced workloads are a particular problem for GPUs due to their SIMD (or SIMT if you’re from NVIDIA) architecture. The Mandelbrot example in the CUDA SDK actually uses some tricks to handle an unbalanced workload on GPUs. I’ve implemented these tricks in Nikola, but it didn’t result it a performance gain—perhaps I’m just doing it wrong. However, similarly to Repa, all a programmer has to do to mark a Nikola computation as being irregular is to use the hintIrregular combinator, which has type Array r sh e -> Array (I r) sh e. We can use the hintIrregular in the stepN function as follows:

As a result, the CUDA code that Nikola generates to perform the loadP will use the CUDA SDK trick to partition the load operations across the GPU’s compute elements instead of using the standard compilation scheme.

I’m disappointed that there was no performance improvement, but I’ve included this version to highlight one of the key advantages of DSLs in general and Nikola in particular—a single combinator can express a sophisticated code transformation that would take a great deal of effort for a programmer to perform by hand. This makes it easy to try out different methods of computing arrays—trying out a new parallel computation strategy means changing one line of Haskell code instead of hundreds of lines (or more) of CUDA.

## Avoiding CPU-GPU memory transfers

It seems a bit silly to be doing all this computation on the GPU only to copy the result back to the CPU so that we can display it—by copying it back to the GPU! CUDA provides various graphics interoperability APIs that allow a CUDA kernel to directly manipulate, e.g., a OpenGL pixel buffer objects (PBO). The final version of the Nikola Mandelbrot viewer, on github at mandelbrot/Mandelbrot/NikolaV5.hs, uses the interop API to avoid all CPU-GPU copies—the Mandelbrot image is generated on the GPU and stays there!

Like the experiment with the irregular work scheme, this one didn’t result in a performance gain—in fact it slightly decreased performance. I think this is due to the unintelligent way that I registered the OpenGL PBO for interop with CUDA, but I’m not sure. Nevertheless, it shows the flexibility of the framework as a whole.

## Conclusions

Nikola lets programmers write code that is similar to Repa code they might write for the same problem, but the resulting program runs on the GPU instead of the CPU. Being able to write GPU programs that “look like” Repa programs is not, in my opinion, such a great thing in and of itself. The real benefit is that programmers can reason about these GPU programs using many of the same techniques they use to reason about Repa programs. These reasoning techniques, like using type tags to tell how an array is represented or knowing that monadic code has certain memory allocation and performance implications, are what are important. The fact that the reasoning techniques can transfer from Repa to a DSL is pretty cool and, I think, a real win.

I would like to point out that Nikola borrows many ideas from Accelerate as well as Repa—for example, I’ve deliberately chosen the Accelerate name for combinators like >* when possible. There are some programs that can be written in Accelerate but not Nikola, and Nikola has some features that Accelerate currently lacks. Accelerate also uses a very different method for generating GPU code. There is plenty of room for two Haskell GPU DSLs, especially since Accelerate and Nikola are really quite different under the hood. I would be very interested to hear from anyone who has tried to use both Accelerate and Nikola for the same problem. In fact, I would love to hear from anyone who uses—or wants to use—Nikola for any problem.

Although the mandelbrot` program is small, it demonstrates that programming GPUs doesn’t have to be painful. I hope I’ve convinced you that Nikola is a powerful and flexible approach to programming GPUs in a high-level language, and perhaps even persuaded you to use it!

1. Lippmeier, Ben et al. “Guiding Parallel Array Fusion with Indexed Types.” Haskell Symposium 2012.