Make it fast

Posted on 2025-02-05 by Ernesto Hernández-Novich
Tags: , ,

Found a motivating questions on r/haskell a couple of days ago. OP alleges to having a «decent amount of experience writing numerical code», mentions a few non-functional languages, says they are considering using Haskell and «as a test» they are trying to solve a trivial differential equation using iterative methods with literally this

dt = 0.01

x :: Integer -> Double
x 0 = 1
x n = (x (n-1)) + 5 * (x (n-1)) * dt

and wonders why time and space are doubling when n grows, asks if there are better ways of doing it, and (of course) blasts how he thinks Haskell might just be too slow.

The example’s naïveté, using Integer, superfluous parentheses, entry-level floating-point operation grouping, question formulation, and all-around gall, spell «homework» to the trained eye. I normally don’t engage with these kind of questions at all, but being in a good mood felt compelled to give the hints one would to an industrious student:

  • Your function has two non-tail recursive calls. It’s like a naïve Fibonacci implementation. There’s your exponential time and space complexity.

  • Alternatives:

    • Rewrite it into tail-recursive form with accumulators to push down intermediate values,

    • Use Dynamic Programming techniques to save computed values into an immutable array, or

    • Generate values as an unfoldr

Of course OP did not answer or asked for clarification. If you are familiar with numerical computation, you understand that it is impossible to give performance hints that will work for all cases, but the above techniques should be familiar to someone who has a decent amount of experience writing numerical code, amirite?

Anyway, I decided to write these notes for people that might be interested in understanding the techniques, for numerical and non-numerical computation cases. As a bonus, I will show how to effectively compare them using proper tooling available in the Haskell ecosystem.

There’s no Computer Science without Fibonacci

The Fibonacci Sequence is everywhere, it is naturally defined as a recursive function, and can be expressed as succintly as naïvely in Haskell like so

fib0 :: Int -> Int
fib0 0 = 1
fib0 1 = 1
fib0 n = fib0 (n-1) + fib0 (n-2)

First of all, I’m using Int (64-bit machine integers) instead of Integer (arbitrary precision simulated integers) because I feel the need for speed.

The above functions is not tail-recursive because both recursive calls have to return before their values can be added up and returned. Basic algorithm analysis and understading of how programming languages work, will lead you to exponential complexity: each call generates two new calls that must complete, hence it will need O(2n) calls total, with the necessary stack space to boot.

It is extremely expensive in time and space so it will slow down quickly as you try larger n.

λ> fib0 42
433494437
it :: Int
(148.13 secs, 159,526,065,008 bytes)

Rewriting in tail-recursive form

Contrary to popular belief, not every function in Haskell must be written in tail-recursive form to be efficient. When a function is total and produces a single result it’s usually a good idea to write it in tail-recursive form. Most number-crunching functions are, so this is a skill that must be learned and practiced.

In general, the trick to rewrite a function in tail-recursive form is to create an auxiliary function with additional arguments meant to carry intermediate values. Each recursive call computes on the arguments before actually recursing, until the base case simply returns an expression based on said arguments. GHC will notice this pattern and turn your function into a closed efficient assembler loop. Better than the for you were thinking about.

The tail-recursion version would be

fib1 :: Int -> Int
fib1 n = go 0 1 0
  where
    go i f1 f2 = if i == n then f1 else go (i+1) (f1+f2) f1

The auxiliary function go has type signature

go :: Int -> Int -> Int

Its first argument, i, represents the index for the Fibonacci number we pretend to compute. Its second argument, f1, will always be the actual i-ith Fibonacci number. The third argument, f2, will always be the (i-1)-ith Fibonacci number. Now fib1 n becomes go 0 1 0, which reads «the 0-th Fibonacci number is 1, and the -1-th Fibonacci number is 0.

Looking at how go works, you’ll see that:

  • If the current number we pretend to compute is precisely the n we are looking for, then we can immediately return f1, otherwise,

  • go is called recursively, to compute the (i+1) Fibonacci, by having the second argument be the sum of the current two ones (precisely what the next-Fibonacci is supposed to be), and the third argument be the current i-th Fibonacci. This maintains the invariant for go’s computation.

Now, all operations on arguments can be done before actually making the call, due to a GHC optimization technique called Strictness Analysis. This means that even though Haskell is lazy, the compiler can identify opportunities to be more strict, particularly over arithmetic expressions. This turns out to be way more efficient

λ> fib1 42
433494437
it :: Int
(0.00 secs, 123,832 bytes)

This is probably the fastest way to implement this. It requires constant space (one stack frame), and time proportional to the number of additions. But we don’t have access to the intermediate values, just the final one. What if we need to compute Fibonacci many times on account of «reasons»?

Dynamic Programming technique

There are many numerical and non-numerical problems having solutions that can be written as recurrences, because the problem can be expressed as a combination of two or more smaller versions of itself. If we are interested in computing solutions for different values of the input over and over again, a standard technique known as Memoization is used. Saving intermediate results in an array, and re-using them instead of computing them again, saves time at the expense of space.

If we need to repeatedly compute Fibonacci numbers up to, but never over an upper limit N, we can create an immutable array as we go. Haskell’s arrays become immutable, extremely compact, and have constant time access to any position once they are built. And they can be built in self-referential fashion, leveraging Haskell’s laziness to compute every single cell exactly once.

import Data.Array ( (!), array )

fib2 :: Int -> Int
fib2 n = a ! n
  where
    a = array (0,n) 
      $ [(0,1),(1,1)] ++ 
        [(i, a ! (i-2) + a ! (i-1)) | i <- [2..n]]

In this implementation we compute the n-th Fibonacci number by simply looking at the n-th position of the array a. Of course we can always have the array as a top-level definition and use it as many times as we want. But we’re here for the construction trick.

Focusing on how the array is constructed, there are a few key elements:

  • The tuple (0,n) defines the range of the array. This will be an uni-dimensional array with subindexes ranging from 0 to the value of n. The array is created dynamically and it’s stored in a compact efficient way – you cannot change the size of the array once built.

  • The «body» of the array is created using a list of pairs: the first element being the index, the second being the value.

  • The first two element of the array are explicitly set as (0,1), i.e. «the 0-th Fibonacci is 1», and (1,1), i.e. «the 1-st Fibonacci is 1».

  • The array function takes an arbitrarily long list of these pairs, in the order they are placed in the list. This allows later elements in the list to replace earlier elements. But, more importantly, elements can refer to elements that have already made part of the array. And that’s precisely the trick in this list comprehension

          [(i, a ! (i-2) + a ! (i-1)) | i <- [2..n]]

    when setting up the i-th element of the array, its value should be whatever it’s already in positions i-2 and i-1 of the array. So, each position is computed using pre-computed values, from 2 to n, filling the table.

Once array has had a chance to process the whole list, the array is compacted, and can be queried very fast.

λ> fib2 42
433494437
it :: Int
(0.00 secs, 137,032 bytes)

So fast that it is apparently indistinguishable from our tail-recursive solution. It is indeed linear in the number of computations, and also linear in the space used to compute.

There’s no Functional Programming without morphisms

The impulse to think in while or for constructs for data processing is a very hard vice to overcome. We functional programmers (most of the time) do not care or need to know how many elements are there to process. We think in morphisms over data structures. A morphism that takes a given data structure and processes all elements of it to produce a result is called a Catamorphism. Catamorphisms are embodied as the fold family in Haskell; more formally the Foldable typeclass. We don’t think in while or for, we think in folds. Yet, we don’t have a data structure with things we can combine to produce a Fibonacci number, so folds don’t seem to fit the bill.

The dual of a Catamorphism is called an Anamorphism. It produces a data structure out of a seed value, each step using the current seed to generate the next value and a new seed. This is exactly what we need

import Data.List  ( genericDrop, unfoldr )

fib3 :: Int -> Int
fib3 n = head $ genericDrop n $ unfoldr next (0,1)
  where
    next (f1,f2) = Just (f2,(f2,f1+f2))

Haskell lists have a unique anamorphism, called unfoldr. It needs a function that, given a seed, will produce the next element and seed, or signal to stop collecting. Given the initial seed, unfoldr will start collecting all elements, by repeatedly applying the «next step» function, forever, or until no new value is produced.

For Fibonacci numbers, it is convenient to generate them all: see how next takes the current seed, produces the next (f2) Fibonacci to collect, and then does the math to build the next seed? It’s an abstracted implementation of our tail-recursive version.

Now, unfoldr is going to produce a list: an infinite one given how next is written. That’s why we apply genericDrop to ignore the first n elements, and then just take the first one of the remaining ones. Thanks to Haskell’s laziness, evaluation will stop there, and we’ll get our result

λ> fib3 42
433494437
it :: Int
(0.00 secs, 128,976 bytes)

Read. That. Again.

It’s as fast as our carefully hand-written tail-recursive version, almost as cheap in terms of space, and definitely cheaper than the array-based one. What kind of sorcery is this?

You see, unfoldr in and on its own is an anamorphism, and if used alone will create a full Haskell list. It so happens that genericDrop is a form of catamorphism. When a catamorphism is composed with an anamorphism, there are certain algebraic rules that allow extremely clever optimizations that are possible on pure languages only. The result of these optimizations is that there will be no list: at every stage of the computation, the element produced by unfoldr will be immediately fed to genericDrop (to be dropped), but unfoldr will continue with the next seed until the desired value is produced, kept by genericDrop and returned by head.

There will be no list constructor, no garbage-collection cost, no intermediate list decomposition, but simple shuffling of values between machine registers. And we wrote it in a very high-level composable and abstract way. Beat that.

Which one is better, then?

We can’t really compare in terms of flexiblity. I hope you realize that for different problems you might want just the final result, some, or all the intermediate results. I intentionally wrote the last one to illustrate that for really advanced languages, abstractions can result in performant code, without having to think much about implementation details.

But if we restrict our comparison to «what is the fastest way to compute the n-th Fibonacci», we can use one of Haskell’s ecosystem’s best tools: Criterion.

In a nutshell, the Criterion library provides an extremely accurate microbenchmarking toolkit that allows comparing the performance of pure functions. Analysis is extremely high-resolution, and uses statistical models to ensure garbage-collection and other optimization factors do not influence the results. You end up comparing raw performance, and informed of how much «noise» your testing environment caused.

In our scenario, it suffices to write this

import Criterion.Main

main = defaultMain [ bgroup "fib0"  [ bench "42" $ whnf fib0 42 ]
                   , bgroup "fib1"  [ bench "42" $ whnf fib1 42 ]
                   , bgroup "fib2"  [ bench "42" $ whnf fib2 42 ]
                   , bgroup "fib3"  [ bench "42" $ whnf fib3 42 ]
                   , bgroup "fib4"  [ bench "42" $ whnf fib4 42 ]
                   , bgroup "fib5"  [ bench "42" $ whnf fib5 42 ]
                   ]

to compare how fast the different implementations are. After building and running the tests

$ stack ghc -- -O2 fib.hs -o fibbench
$ ./fibbench --output fib-comparison.html
...

we end up with this report. You want to look at the estimate Mean execution time for each implementation, and then continue reading…

Wait… Where is fib4?

It’s not the fastest, but it’s a cute Ouroboros once you figure it out.

fib4 :: Int -> Int
fib4 n = wat !! n 
  where wat = 1 : 1 : zipWith (+) wat (tail wat)

I’m afraid to ask about fib5

Well

λ> fib5 42
433494437
it :: Int
(0.00 secs, 121,528 bytes)

it is the fastest in execution time, and is the cheapest in terms of space

fib5 :: Int -> Int
fib5 n = snd $ foldl' wtf (0,1) (replicate n undefined)
  where
    wtf (a,b) _ = (b,a+b)

I apologize for misleading you, dear reader: it is possible to write an extremely performant, space efficient, Fibonacci function using the simplest of Catamorphisms. Just collapse an unexistent list of nothingness, into the result you need.

Just my two nanoseconds faster.