Sliding windows are hard. They become annoying when you have to carefully synchronize them with I/O operations, and their buffering. For those, and other reasons relating to cardinality, I suppose Day 6 must've been very annoying for those not using lazy functional programming.

The Problem

We are given a lone long line of characters. If you cheat and look at the input, is 4096 bytes. I rather not cheat and write fast, elegant code, that doesn't care how long the input is.

The example given is a much shorter input

    mjqjpqmgbljsphdztnvjfqwrcgsmlb

and the objective is to find the first position in the String such that four different characters appear for the first time. That is, imagine sliding a window of size 4 over the string

    mjqjpqmgbljsphdztnvjfqwrcgsmlb
    \__/ 'j' is repeated
    mjqjpqmgbljsphdztnvjfqwrcgsmlb
     \__/ 'j' is still repeated
    mjqjpqmgbljsphdztnvjfqwrcgsmlb
      \__/ 'q' is repeated
    mjqjpqmgbljsphdztnvjfqwrcgsmlb
       \__/ all characters are different

Once the first window with four different characters is found, we must report the rightmost position of the window, counting from 1. For the example

    mjqjpqmgbljsphdztnvjfqwrcgsmlb
       \__/
    1234567

we should answer 7.

Main Program

The main program is pretty much the same as Day 1.

There's no parsing

As usual, we are going to read our input lazily using readFile. It will take care of buffering, and feeding data to downstream functions as soon as it is needed and not before.

Since readFile provides a String, which is nothing more than [Char], we just need to figure out:

  1. How to check if all Char within a window are different?

  2. How to move the window over the lazy long String while keeping track of our position, until we get to the first valid window.

Verifying the window

To check if all elements of a list are different, there's no other way than to compare each pair. For a window of four elements, and an imperative mentality, this seems pretty straightforward. For the sake of argument, let's say the window is a [Char] (it's going to be), but I want to treat it «as if it were an array» (because I suddenly have a relapse of the imperative programming malaise).

Then you'd be tempted to write something

     fugly :: [Char] -> Bool
     fugly [a,b,c,d] = a /= b && a /= c && a /= d
                              && b /= c && b /= d
                                        && c /= d

or something nastier up to isomorphism. It only works for four, and you explicitly enumerate all cases. «A» for effort in noticing there's a minimum number of comparisons, though.

Let's try a more general approach: process windows of any size, and any kind of content (as long as they are «comparable»). This suggests a signature such as

    allDifferent :: Eq a => [a] -> Bool

Now, consider an arbitrarily sized list of things that can be compared,

    [a0,a1,a2,...,aN]

Now, for each element of the list, let's check if it is a member of the rest of the list. There's a standard Haskell function elem that does precisely this, i.e.

    a0 `elem` [a1,a2,...,aN]

will return True or False if a0 is contained in [a1,a2,...,aN]. And elem is «short circuited» so it returns True as soon as possible, and False if it needs to check all the rest (and there's no other way around that).

After the first element has been compared to the rest, I save whatever result elem produced, and now focus on the second element and repeat, i.e.

    a0 `elem` [a1,a2,...,aN]  -> True or False
    a1 `elem` [a2,...,aN]     -> True or False
    a2 `elem` [a3,...,aN]     -> True or False
    ...
    aN `elem` []              -> False

We are processing each element of the list once, from left to right. At each step we need to compute using the current rest of the list. And we need to save the result of the partial computations. That's a foldl' with a structured accumulator: we collect the Bool results in a list, and we carry the current rest of the list.

So we start with

    allDifferent as = foldl' go ([],tail as) as
      where
        go (bs,es) e = ((e `elem` es) : bs, tail es)

so we can go over every element of as one by one. We start with an empty [Bool] and the initial tail of the list. At each step, we save the result of testing if the element being processed (e) is contained in the current rest of the list (es), and continue with the tail of the current rest.

The first element of the resulting tuple will be a [Bool] holding the results of all elem tests, and the last tail which is empty and we don't care about. So we can improve our code

    allDifferent as = fst $ foldl' go ([],tail as) as
      where
        go (bs,es) e = ((e `elem` es) : bs, tail es)

to get the [Bool].

Now, if there's at least one True in that list, we know there's a repeated element. So we need the boolean disjunction («or») for all elements in the list. If at least one is True, the disjunction will quickly short circuit to True, and we need to negate it to signal they are not allDifferent.

If all elements are different, the list will be full of False, the disjunction will be False and we need to negate it to signal «all different». This is the most expensive case, but it only happens precisely when detecting what we are looking for.

We can finally write

    allDifferent :: Eq a => [a] -> Bool
    allDifferent as = not $ or $ fst $ foldl' go ([],tail as) as
      where
        go (bs,es) e = ((e `elem` es) : bs, tail es)

It doesn't matter how many elements there are. It works for any data type that supports comparisons. It uses short-circuit boolean operations and will do the minimum number of comparisons to figure out the answer. You can't get much better than this.

Moving the window

We can check windows of any size, so it's only natural to try and write code that moves a window of any size over input of any content as long as the contents can be compared.

This suggests a signature

    findMarker :: Eq a => Int -> [a] -> Int

having as first argument the size of the window we want to move, and as second argument the «we don't know nor care how big» list of things that can be compared. It's supposed to return the rightmost position of the window, the moment all its elements are different.

Haskell lists positions start at zero. Starting at zero is the right way to count in Computer Science. It is, it always has been. Shut up. Learn to count.

If we start with a list such as

    [e0,e1,e2,e3,e4,e5,e6,...]

we can use the standard functions take and drop. I'll use 4 to drive the point home

    ghci> take 4 [e0,e1,e2,e3,e4,e5,e6,...]
    [e0,e1,e2,e3]
    ghci> drop 4 [e0,e1,e2,e3,e4,e5,e6,...]
    [e4,e5,e6,...]

This way we can «prime» the initial window, and focus on the remaining input. If we are working with windows of size 4... did you notice that that would be the index to return if this window is the one satisfying the conditions?

From now own, we only need to count from 4 onward, while grabbing one character at a time to shift the window.

    findMarker :: Eq a => Int -> [a] -> Int
    findMarker c s = go c (take c s) (drop c s)
      where
        go count buffer rest = if allDifferent buffer
                                  then count
                                  else go (succ count)
                                          (tail buffer ++ [ head rest ])
                                          (tail rest)

In order for the result to be provided immediately, I've written the logic as the tail recursive function go.

Notice that the moment allDifferent buffer succeeds, the function immediately returns the current count, ignoring the rest of the input. But if the check fails, the count is incremented by one, we «shift out» the leftmost element of the buffer and add the next element in the input.

Tail-recursive functions are as efficient as your carefully written while loop, both in terms of stack space, as well as local variable use, as long as the compiler for the language does the right thing. GHC does. Good C/C++ compilers can as long as there's no aliasing in the body. Chances are your favorite dynamic language doesn't.

The notion of «loop invariants» that many of you learned about if you took a course in algorithm design, can be applied here:

  • count always points to the rightmost position of the buffer, and
  • buffer is of constant size as it loses (tail) exactly one element while gaining ((++)) exactly one more, and
  • rest is always losing (tail) one element.

This allows me to reason and prove that the window is always of arbitrary size c and it will slide from left to right until all of its elements are different.

Generic, efficient, and formal.

Solving Part A

Part A wants windows of size 4. So be it

    partA :: String -> Int
    partA = findMarker 4

Solving Part B

Part B wants windows of size 14. You don't say.

    partB :: String -> Int
    partB = findMarker 14

I pity the fool that used arrays and had to come up with all the comparisons or the convoluted array index relations...

Conclusion

The more general your code, the better the chances for reusing. The less you are bound by the shackles of the length of things, the while using counters, and embrace laziness, fusion, fold and tail-recursive encoded loops, the more your code will compose better.

I must point out that

    someList ++ [e]

is a terrible code smell in the general case, because concatenation (++) is O(n) on the length of it first argument. Since we are working with short lists, and performing a single concatenation, is not much of an issue.

If this was production code, with sliding windows of considerable size, I would've used ByteString for binary data or Seq for structured data. Both data types are carefully crafter for functional programming to afford adding to either end in O(1) time.