Repa is a fantastic library for regular array computation in Haskell. Version 3^{1} is particularly nice—it leverages Haskell’s type system so that an array’s type tells you about its representation. There are plenty of other great features in Repa, like easy parallelization. In rewriting Nikola, an embedded Haskell DSL for GPU computation, I’ve used Repa as a model for what regular array programming should look like—even if it’s a GPU that is being programmed.
In this post I will explain how to use Nikola to program GPUs through the
development of a Mandelbrot set viewer. This example is adapted from the
Accelerate mandelbrot
example
and from the mandelbrot
example that is
part of Simon Marlow’s par
tutorial. All code is
available from the nikola github
repository. You can install Nikola and the
Mandelbrot example by following the instructions in the
README
. Note that
the Nikola that I describe here is a completely different beast from the
“Nikola” that I described in the Haskell Symposium 2010 paper Nikola: Embedding
Compiled GPU Functions in
Haskell—only
the name and the problem domain (programming GPUs from Haskell) are the same.
A Repa Mandelbrot viewer
Since I’m using Repa as a model, let’s start with a Repa Mandelbrot viewer. If you’re not familiar with the Mandelbrot set, you can read the Wikipedia entry. Briefly, for each point $c$ in the complex plain we iteratively compute $z_i$ as follows
The point $c$ is a member of the Mandelbrot set iff the $z_i$’s are bounded. It turns out that if $z_i > 2$ for some $i$, then the $z_i$’s are not bounded, so $c$ is not a member of the Mandelbrot set. We will visualize the Mandelbrot set using the escape time algorithm—we iterate the computation of $z_i$ until it escapes, i.e., $z_i > 2$, or until we reach a fixed “depth” limit, in which case we declare that the point is a potential member of the Mandelbrot set. The iteration count is then used to color the point $c$ in the complex plane and make a pretty picture.
Let’s write this using Repa. You can find the full version of the code in this section on github as examples/mandelbrot/Mandelbrot/RepaV1.hs.
When we move to the Nikola version, many of the types will have the same name, but they will be Nikola types rather than Repa types. Therefore I want to be explicit about what we are importing, so I’ll start with the imports.
Next we have a few types aliases. We use the R
type alias to abstract over the
representation we use for real numbers. We will represent complex numbers as
pairs of real numbers. I omit the Num
instance for Complex
.
Next are a bunch of type aliases for the Repa arrays we will use. Note that they
are representation polymorphic—that’s why they are parametrized by the type
variable r
.
Each pixel in our image corresponds to a point in the complex plane, and the set
of complex points corresponding to the pixels in the image that we are
generating is represented by a ComplexPlane
. In other words, the
ComplexPlane
array holds the values of $c$ that correspond to the pixels in
our image. A StepPlane
holds the current $z_i$ and the current iteration, $i$,
for each point in our image.
The step
function, which computes $z_{i+1}$ according to the above equation,
takes a ComplexPlane
and a StepPlane
and returns a new StepPlane
. The U
type index tells us that we are using unboxed vectors to hold the data in our
arrays—in other words, we’re using an efficient unboxed data
representation. Note that zipWith
produces a delayed array—an array
represented as a function from arrays indices to values. A delayed array tells
us how to compute the elements in the array, but it doesn’t actually perform
the computation. We force the computation of all the elements in the array using
computeP
, which computes the individual elements of the delayed array in
parallel and places the result in a new array, which in our case is an unboxed
array (type tag U
).
Before we can call step
, we need to compute the ComplexPlane
and the initial
StepPlane
. The mandelbrot
function takes the boundaries of the portion of
the complex plane that we want to visualize, the size of the image that should
result from the visualization, a depth limit, and computes the visualization.
The ComplexPlane
is generated b genPlane
, and our initial StepPlane
by
mkinit
. These are both straightforward, so I’ll show the code but won’t walk
through them.
Note that we are always iterating the step
computation depth
times, even if
we could bail out early. We also happen to be computing—and allocating!—a
new StepPlane
every time we call step
. This is rather inefficient, but the
inefficient version will serve as the steppingoff point for a Nikola
implementation. A more efficient version is available on github as
examples/mandelbrot/Mandelbrot/RepaV3.hs. The
full implementation also generates an image from the final StepPlane
and
displays it on the screen. If you configure and build Nikola using the
instructions in the
README
, you can
execute the “optimized” Repa version of the Mandelbrot viewer as follows. Use
whatever value for the N
options makes sense for you machine.
> ./dist/build/mandelbrot/mandelbrot backend repa +RTS N4
The mouse wheel and page down/page up keys will zoom in and out, and the arrow keys will move you around. You can also try dragging the image with the mouse, but note that it’s not very responsive! Hopefully we can fix that with our Nikola implementation.
The mandelbrot
program provides no fewer than 8 back ends—3 versions of the
Repa back end, and 5 versions of the Nikola back end. You can try out the
different versions using the backend
flag with an argument repavN
or
nikolavN
where N
is 1–3 in the Repa case and 1–5 in the Nikola case. The
default (when =–backend– isn’t specified) is the Nikola version 5 back end.
I have also benchmarked all versions of the Mandelbrot back ends by measuring
the time they take to generate a frame on two different machines. The first is
my laptop, which has a i72620M CPU
running at 2.70GHz and an NVS 4200M. The second is my
desktop, which has a i72600K CPU
running at 3.40GHz and a GeForce GTX 560 Ti. Both sets of results were run with
+RTS N4
, and you can see the exact set of optimization options passed to GHC
in the cabal file
file in the github repo.
An aside: GPU details
Before I get to the Nikola version of the Mandelbrot viewer, I need to talk about some of the details of GPU computation. The GPU is a strange beast with an unusual programming model, a different instruction set, and memory that is distinct from CPU memory. Nikola hides some of these details from the programmer—in particular, it tries to expose the programming model in a nice, highlevel way—but not all of them. In fact, I think the programmer must be aware of some of these details. For example, a GPU computation can only use data that is in GPU memory. If we have an array in CPU memory, we have to copy it to GPU memory to use it in a GPU computation. These copies are expensive, so the programmer should have control over how and when they happen.
Repa’s type tags provide the perfect vehicle for programmer control of GPUCPU memory copies, so Nikola introduces a new Repa type tag that signifies that the array’s data lives in GPU memory. All the “standard” Repa functions work with these arrays, but this is extremely slow because these operations all run on the CPU and therefore require GPUtoCPU memory copies to do any work. So why bother with GPU Repa arrays? Because Nikola lets you compute with them efficiently!
We still need a nice framework for copying Repa arrays between the CPU and
GPU. Nikola uses the CUDA runtime API to do this. Because this requires calling
C functions, we can only efficiently work with Repa arrays that hold data in a
ForeignPtr
, for example, arrays with the type tag F
. However, we would also
like to be able to efficiently work with arrays that have more structure, e.g.,
Repa unboxed arrays that contain pairs. Repa unboxed arrays, which have a type
tag U
, use Data.Vector.Unboxed
to hold their contents. In the unboxed
representation, the content of an unboxed array of pairs is stored in a pair of
vectors: one holds the first element of each pair, and the other holds the
second element. Despite the details of the underlying representation, we still
access the elements of the Repa arrays as if they were pairs. Unfortunately for
us, Data.Vector.Unboxed
vectors, and therefore Repa unboxed arrays, store
their contents in GHC byte arrays instead of ForeignPtr
’s, so we can’t
(safely) efficiently copy them to the GPU!
What we need are arrays that use the efficient unboxed representation of
Data.Vector.Unboxed
but store data in ForeignPtr
’s instead of byte
arrays. Nikola provides such data types, which I will refer to as “unboxed
foreign” in contrast to just “unboxed,” in the Data.Vector.UnboxedForeign
and
Data.Array.Repa.Repr.UnboxedForeign
modules. There are corresponding
Data.Array.Repa.Repr.CUDA.UnboxedForeign
and
Data.Array.Repa.Repr.CUDA.UnboxedForeign
modules that provide GPU versions
of the unboxed foreign representations but store their data in
ForeignDevicePtr
’s, which as you can imagine point to data on the GPU. Using
the unboxed foreign representation allows us to efficiently copy structured
array data between the GPU and CPU. Without this extra support, we would only be
able to use Nikola with arrays that contained simple base types, like Int32
and Double
, instead of arrays that contain, e.g., tuples of Double
’s.
In summary, to support Nikola we have introduced the following new Repa type tags:
Repa type tag  Meaning 

CF 
The GPU equivalent of the CPU F tag. Stored in a ForeignDevPtr . 
UF 
Unboxed foreign data. 
CUF 
CUDA unboxed foreign data. 
A word about Nikola and DSLs
Nikola is an embedded DSL. In fact, it is a deeply embedded DSL: a Nikola
program internally produces an abstract syntax tree that represents a
computation. There are then various methods for actually running the program
that is produced. It is important to remember this “phase distinction” when
writing Nikola programs. In particular, instead of functions that return, e.g.,
a Double
, a Nikola function will return an Exp Double
. The Exp
tells us
that we are getting back an abstract representation of a computation, and the
Double
type index tells us that if we run the computation, we will get a
Double
out. Because Exp
is a computation, we can’t hope to do things like
test an Exp
for equality with another Exp
—such a test is not even
decidable! However, we can build a new Exp
that represents a computation
that tests two Exp
’s for equality—at run time it will evaluate the two
subcomputations and perform the test. Staging a computation this way—writing
a program that generates a program, and then running the generated program—can
be hard to wrap your head around. You should maintain a mental model where an
Exp
represents a computation and a Nikola program is building up a big
computation.
The method we will use to run Nikola programs is compilation via Template
Haskell. This
roughly works as follows. We first write a bunch of Nikola functions and stick
them in a module. Recall that a Nikola function produces an abstract
representation of a computation. We leverage this by providing a function
compile
that takes a Nikola function, calls it appropriately to produce a
concrete representation of the computation the function embodies, and then
transforms this representation into GPU code and compiles it. The compile
function actually translates the internal Nikola program representation to CUDA,
and then calls NVIDIA’s CUDA compiler, nvcc
, to compile the CUDA code to GPU
binary code. Then compile
builds up a Template Haskell terms that contains the
compiled GPU binary code and the necessary glue to call it from Haskell. By
calling compile
from a Template Haskell splice, we end up with a Haskell
function that invokes a GPU computation. All compilation to GPU binary code is
done when the Template Haskell splice runs, which is at Haskell compile
time. This means there is no run time GPU code generation in the Nikola code I
will show. Although tune time code generation is also possible with Nikola, I
won’t discuss it further here.
Template Haskell has something called the stage restriction. Basically this
means that when we use a Template Haskell splice in a module, everything used
inside the splice has to come from a different module. Because we have to use
Nikola’s compile
function in a splice, this means that we need to put our
Nikola functions in a separate module. Don’t worry about the particulars of
using compile
right now—we’ll see how to use it below. The main point here
is that we have to put our Nikola functions in a separate module.
A Nikola Mandelbrot viewer: A first cut
I will first describe the Nikola part of the implementation—that is, the functions that will actually run on the GPU. The code is available on github in mandelbrot/Mandelbrot/NikolaV1/Implementation.hs.
I’ll start again with our imports
We are using the CUDA back end for Nikola. Eventually Nikola will have multiple back ends (there is currently partial support for OpenMP), but we only need to worry about the CUDA back end. Note that we are not importing any Repa modules.
Next are our type aliases.
The Array
type here is the Nikola Array
type, not the Repa
array
type. Similarly, the DIM2
we see here is the Nikola DIM2
. Like Repa arrays,
Nikola arrays use type tags to describe their representation, so we once again
see the r
type variable show up in our array type aliases.
There is one glaring difference here: the presence of the Exp
data type. When
we wrote our Repa implementation, we represented complex numbers as pairs of
values of type R
, i.e., as pairs of Double
’s. Now we are representing
complex numbers as pairs of computations that, when run, produce values of
type Double
. Analogously, bitmaps, complex planes, and step planes all contain
Exp
computations. Note that Nikola lets us use nested tuples—a StepPlane
contains pairs consisting of a complex number, which is itself a tuple, and a
step.
Here is the step
function:
If you go back and look at the Repa version, you’ll see that it’s almost exactly
the same module a few INLINE
directives and bang patterns. The one significant
difference is the use of >*
instead of >
in the if
expression. Remember
that I said we can’t perform tests on Nikola expressions, but we can construct
Nikola expressions that perform tests? The >*
combinator constructs a Nikola
expression that performs a test, and through the magic of the
RebindableSyntax
GHC language extension, the if
expression constructs a Nikola term that uses
this test to decide between the two branches in the if.
The genPlane
and mkInit
functions are also much like their Repa
counterparts:
We see in their type signatures that, roughly speaking, what was previously a
foo
in the Repa implementation becomes an Exp foo
in the Nikola version. One
exception is the Complex
type: in the Repa version a Complex
was a (R, R)
,
but in Nikola it is an (Exp R, Exp R)
, not an Exp (R, R)
! The reason is that
Exp (a, a)
is isomorphic to (Exp a, Exp a)
and it is easier to work with a
pair of expressions instead of an expression representing a pair, so Nikola
sticks with the “friendlier” form. If you doubt the superiority of (Exp a, Exp
a)
over Exp (a, a)
, have a look at the Num
instance in the full Nikola
implementation.
You will also notice that we have pushed the call to computeP
into the
functions genPlane
and mkinit
. This is because we want the elements to be
computed on the GPU! However, instead of seeing the IO
monad as we would if we
used Repa’s computeP
, we see the P
monad. You can think of P
as the
“parallel” or “program” monad. It is the monad in which all Nikola monadic
computations are performed.
Finally, we see an unfamiliar array type tag, G
. This signifies that the
contents of a Nikola array is stored in GPU global memory. Nikola also has
delayed arrays, with type tag D
(just like Repa), and a few other
representations, one of which we will see a bit later.
Calling a Nikola function
To call a Nikola function, we first have to compile it using our funny Template Haskell trick. Because of the stage restriction, the compiled Nikola functions live in a different module from the Nikola implementation we just saw. You can find this file in the github repo as mandelbrot/Mandelbrot/NikolaV1.hs.
On to the imports again!
OK, there’s a bunch of junk here. Note that we are once again importing various
Repa modules. We are importing only one Nikola module, and then only so we can
use the compile
function. The Nikola functions we implemented above are
imported (qualified) from the Mandelbrot.NikolaV1.Implementation
module.
Type aliases again:
These are the same Repa type aliases from before—the Array
and DIM2
we see
are Repa’s Array
and DIM2
.
Now let’s compile those Nikola functions:
I’ve placed signatures on these functions as a form of documentation, but will
infer them so they are not actually necessary. You can see that Nikola types are
mapped by compile
to Repa types. This mapping is somewhat complicated, but is
(very) roughly approximated by the following table:
Repa type tag  Meaning 

Exp a 
a 
P (Exp a) 
IO a 
Array G sh (Exp a) 
Array CUF sh a 
P (Array r sh (Exp a)) 
IO (Array CUF sh a) 
The Repa type tag CUF
tells us that these functions deal with data stored on
the GPU. Compiling a Nikola function that works with Nikola arrays produces a
Haskell function that works with Repa arrays.
Finally, here is the mandelbrot
function. It’s much the same as the Repa
version except for the CUF
type tag (recall that we pushed the computeP
into
the Nikola functions so it is missing here).
There are a few lines of additional glue in the full implementation to copy the data back from the GPU to the CPU and display it.
You can run this version as follows
> ./dist/build/mandelbrot/mandelbrot backend nikolav1 +RTS N4
If you do so, you’ll notice it’s not very fast. There’s a good reason for that—we’re allocating a new array in GPU memory for every step, just as we did in the Repa version above. On my laptop, the firstcut Nikola version is still about 3x faster than the firstcut Repa solution. The optimized Repa version, which I will not show here, is more than 21x faster than the Repa version above and more than 7x faster (!) than the firstcut Nikola version we just wrote.
This is quite a testament to Repa. We still have quite a few Nikola tricks up our sleeve, and the final Nikola implementation we write will be 10x (my laptop) to 20x (my desktop) faster than the first version.
Mutable arrays
Although it won’t immediately get us a big performance boost, one way to improve our performance is to preallocate buffers when the dimensions of our image (i.e. display window) change and then reuse them as the user scrolls and zooms instead of allocating a bunch of new buffers every time we have to draw anything.
The point here in this section is to show that Nikola is a monadic embedded DSL, and that it allow us to work with mutable arrays using familiar Haskell idioms, namely monads. I won’t show the code that preallocates buffers, but will focus on manipulating mutable arrays in Nikola. There is a fair amount of unsafefu below: you have been warned.
An aside: mutable arrays in Repa
Repa doesn’t provide mutable arrays, but it does provide the MVec
associated
type as part of the Target
type class. MVec
is a vector—it doesn’t have a
Repa shape. For various reasons, it is convenient to have the shape around when
working with mutable arrays in Nikola, so I have defined a Mutable
type class
and associated type MArray
in the Data.Array.Repa.Mutable
module in the
Nikola repo. There is an analogous MArray
type in the Nikola module
Data.Array.Nikola.Eval.Target
.
Mutable arrays in Nikola
The full version of the Nikola implementation, modified to use mutable arrays, is available on github as mandelbrot/Mandelbrot/NikolaV2/Implementation.hs.
We begin by adding a few more type aliases for the mutable versions of our various arrays types.
Our step
function now uses loadP
instead of computeP
to load the result
of a step
into an existing mutable array instead of computing it and
producing a newlyallocated array. Note that Repa also provides loadP
for Repa
arrays. Please forgive the unsafeFreezeMArray
—I did warn you about the
unsafefu.
The new versions of genPlane
and mkinit
also use loadP
instead of
computeP
—the changes aren’t very interesting, so I won’t show them here.
The highlight here is the fact that we are writing monadic code in a DSL. P
is a real monad! We get to use all the nice Haskell support for programming with
monads, like do syntax. Furthermore, we can used monads to distinguish pure
Nikola code from impure Nikola code just as we do in Repa (and in Haskell in
general).
Iteration on the GPU
Although slightly faster, the Nikola implementation using mutable arrays still
suffers from a major flaw—it iterates the step
function on the CPU. Each
iteration requires calling out to a GPU kernel. Although we have avoided
allocating a new chunk of GPU memory for each step (thus the slight time
savings), to really speed things up, we need to perform this iteration
directly on the GPU. If we can do this, then we will replace depth
calls to
the step
GPU function with a single call to a GPU function.
The natural thing to try is to rewrite the step
function so that it performs
all the $z_i$ calculations is a single loop. Here’s what that might look like:
If we try to compile this version, we get an unpleasant error message:
examples/mandelbrot/Mandelbrot/NikolaV3.hs:54:16:
Exception when trying to run compiletime code:
(==) Exp: incomparable
Code: compile I.mandelbrot
In the expression: $(compile I.mandelbrot)
In an equation for `mandelbrot':
mandelbrot = $(compile I.mandelbrot)
make: *** [mandelbrot] Error 1
The fact that we’re working with a DSL is biting us. Remember when I said that
the Exp
type is abstract and we can’t directly inspect it, only use
combinators that construct tests? Our implementation of stepPoint
pattern
matches on its first argument which is of type Exp Int32
. That is desugared by
GHC into an equality test which is causing the error we just saw. Think of it
this way: the Exp Int32
that is passed to stepPoint
could represent some
gnarly, nasty computation, so stepPoint
can’t in general decide whether or not
it is equal to zero, thus the error.
The solution is to use a combinator to perform the iteration. Nikola provides a
combinator, iterateWhile
, which is just what we need. Using it, we can write
stepN
as follows:
The iterateWhile
combinator is actually fairly powerful. In other words, we
haven’t cheated by writing a oneoff combinator suitable just for this
example. I won’t get into the details of how it works, but you can think of it
as having a type like Exp Int32 > (a > (b, a)) > a > a
with suitable
constraints on the type variables a
and b
to ensure that b
is “Exp
Bool
like” and that a
is “Exp
like”.
With this change, on my desktop the mandelbrot program is fast enough for live dragging of the image.
Dealing with unbalanced workloads
When calculating the Mandelbrot set, different points in the complex plan
require different amounts of work because some points escape ($z_i$ goes above
2) faster than others. That is, the work being doing in our Mandelbrot viewer is
unbalanced across the different array elements. Repa provides a special array
type, I
, that hints that computing the array is an unbalanced workload. The
final version of our Repa implementation, which is in the github repo but not
shown here, uses this array type to improve runtime significantly. All the
programmer has to do to mark an array as producing an irregular workload is to
use the hintInterleave
whose type is Array r sh e > Array (I r) sh e
. This
oneline change gives us a very significant speedup in the final Repa back end.
Unbalanced workloads are a particular problem for GPUs due to their SIMD (or
SIMT if you’re from NVIDIA) architecture. The Mandelbrot example in the CUDA SDK
actually uses some tricks to handle an unbalanced workload on GPUs. I’ve
implemented these tricks in Nikola, but it didn’t result it a performance
gain—perhaps I’m just doing it wrong. However, similarly to Repa, all a
programmer has to do to mark a Nikola computation as being irregular is to use
the hintIrregular
combinator, which has type Array r sh e > Array (I r) sh
e
. We can use the hintIrregular
in the stepN
function as follows:
As a result, the CUDA code that Nikola generates to perform the loadP
will use
the CUDA SDK trick to partition the load operations across the GPU’s compute
elements instead of using the standard compilation scheme.
I’m disappointed that there was no performance improvement, but I’ve included this version to highlight one of the key advantages of DSLs in general and Nikola in particular—a single combinator can express a sophisticated code transformation that would take a great deal of effort for a programmer to perform by hand. This makes it easy to try out different methods of computing arrays—trying out a new parallel computation strategy means changing one line of Haskell code instead of hundreds of lines (or more) of CUDA.
Avoiding CPUGPU memory transfers
It seems a bit silly to be doing all this computation on the GPU only to copy the result back to the CPU so that we can display it—by copying it back to the GPU! CUDA provides various graphics interoperability APIs that allow a CUDA kernel to directly manipulate, e.g., a OpenGL pixel buffer objects (PBO). The final version of the Nikola Mandelbrot viewer, on github at mandelbrot/Mandelbrot/NikolaV5.hs, uses the interop API to avoid all CPUGPU copies—the Mandelbrot image is generated on the GPU and stays there!
Like the experiment with the irregular work scheme, this one didn’t result in a performance gain—in fact it slightly decreased performance. I think this is due to the unintelligent way that I registered the OpenGL PBO for interop with CUDA, but I’m not sure. Nevertheless, it shows the flexibility of the framework as a whole.
Conclusions
Nikola lets programmers write code that is similar to Repa code they might write for the same problem, but the resulting program runs on the GPU instead of the CPU. Being able to write GPU programs that “look like” Repa programs is not, in my opinion, such a great thing in and of itself. The real benefit is that programmers can reason about these GPU programs using many of the same techniques they use to reason about Repa programs. These reasoning techniques, like using type tags to tell how an array is represented or knowing that monadic code has certain memory allocation and performance implications, are what are important. The fact that the reasoning techniques can transfer from Repa to a DSL is pretty cool and, I think, a real win.
I would like to point out that Nikola borrows many ideas from Accelerate as well
as Repa—for example, I’ve deliberately chosen the Accelerate name for
combinators like >*
when possible. There are some programs that can be written
in Accelerate but not Nikola, and Nikola has some features that Accelerate
currently lacks. Accelerate also uses a very different method for generating GPU
code. There is plenty of room for two Haskell GPU DSLs, especially since
Accelerate and Nikola are really quite different under the hood. I would be very
interested to hear from anyone who has tried to use both Accelerate and Nikola
for the same problem. In fact, I would love to hear from anyone who uses—or
wants to use—Nikola for any problem.
Although the mandelbrot
program is small, it demonstrates that programming
GPUs doesn’t have to be painful. I hope I’ve convinced you that Nikola is a
powerful and flexible approach to programming GPUs in a highlevel language, and
perhaps even persuaded you to use it!

Lippmeier, Ben et al. “Guiding Parallel Array Fusion with Indexed Types.” Haskell Symposium 2012. ↩