Sliding windows are hard. They become annoying when you have to carefully synchronize them with I/O operations, and their buffering. For those, and other reasons relating to cardinality, I suppose Day 6 must've been very annoying for those not using lazy functional programming.
The Problem
We are given a lone long line of characters. If you cheat and look at the input, is 4096 bytes. I rather not cheat and write fast, elegant code, that doesn't care how long the input is.
The example given is a much shorter input
mjqjpqmgbljsphdztnvjfqwrcgsmlb
and the objective is to find the first position in the String
such that four different characters appear for the first time.
That is, imagine sliding a window of size 4 over the string
mjqjpqmgbljsphdztnvjfqwrcgsmlb
\__/ 'j' is repeated
mjqjpqmgbljsphdztnvjfqwrcgsmlb
\__/ 'j' is still repeated
mjqjpqmgbljsphdztnvjfqwrcgsmlb
\__/ 'q' is repeated
mjqjpqmgbljsphdztnvjfqwrcgsmlb
\__/ all characters are different
Once the first window with four different characters is found, we must report the rightmost position of the window, counting from 1. For the example
mjqjpqmgbljsphdztnvjfqwrcgsmlb
\__/
1234567
we should answer 7.
Main Program
The main program is pretty much the same as Day 1.
There's no parsing
As usual, we are going to read our input lazily using
readFile
. It will take care of buffering, and feeding data
to downstream functions as soon as it is needed and not
before.
Since readFile
provides a String
, which is nothing more
than [Char]
, we just need to figure out:
How to check if all
Char
within a window are different?How to move the window over the lazy long
String
while keeping track of our position, until we get to the first valid window.
Verifying the window
To check if all elements of a list are different, there's
no other way than to compare each pair. For a window of
four elements, and an imperative mentality, this seems
pretty straightforward. For the sake of argument, let's
say the window is a [Char]
(it's going to be), but I want
to treat it «as if it were an array» (because I suddenly
have a relapse of the imperative programming malaise).
Then you'd be tempted to write something
fugly :: [Char] -> Bool
fugly [a,b,c,d] = a /= b && a /= c && a /= d
&& b /= c && b /= d
&& c /= d
or something nastier up to isomorphism. It only works for four, and you explicitly enumerate all cases. «A» for effort in noticing there's a minimum number of comparisons, though.
Let's try a more general approach: process windows of any size, and any kind of content (as long as they are «comparable»). This suggests a signature such as
allDifferent :: Eq a => [a] -> Bool
Now, consider an arbitrarily sized list of things that can be compared,
[a0,a1,a2,...,aN]
Now, for each element of the list, let's check if it is
a member of the rest of the list. There's a standard
Haskell function elem
that does precisely this, i.e.
a0 `elem` [a1,a2,...,aN]
will return True
or False
if a0
is contained in [a1,a2,...,aN]
.
And elem
is «short circuited» so it returns True
as soon as
possible, and False
if it needs to check all the rest (and there's
no other way around that).
After the first element has been compared to the rest, I save
whatever result elem
produced, and now focus on the second
element and repeat, i.e.
a0 `elem` [a1,a2,...,aN] -> True or False
a1 `elem` [a2,...,aN] -> True or False
a2 `elem` [a3,...,aN] -> True or False
...
aN `elem` [] -> False
We are processing each element of the list once, from left
to right. At each step we need to compute using the current
rest of the list. And we need to save the result of the
partial computations. That's a foldl'
with a structured
accumulator: we collect the Bool
results in a list, and we
carry the current rest of the list.
So we start with
allDifferent as = foldl' go ([],tail as) as
where
go (bs,es) e = ((e `elem` es) : bs, tail es)
so we can go over every element of as
one by one. We start
with an empty [Bool]
and the initial tail
of the list.
At each step, we save the result of testing if the element
being processed (e
) is contained in the current rest of
the list (es
), and continue with the tail
of the current rest.
The first element of the resulting tuple will be a [Bool]
holding
the results of all elem
tests, and the last tail which is empty
and we don't care about. So we can improve our code
allDifferent as = fst $ foldl' go ([],tail as) as
where
go (bs,es) e = ((e `elem` es) : bs, tail es)
to get the [Bool]
.
Now, if there's at least one True
in that list, we know there's
a repeated element. So we need the boolean disjunction («or»)
for all elements in the list. If at least one is True
, the
disjunction will quickly short circuit to True
, and we need to
negate it to signal they are not allDifferent
.
If all elements are different, the list will be full of False
,
the disjunction will be False
and we need to negate it to
signal «all different». This is the most expensive case, but
it only happens precisely when detecting what we are looking
for.
We can finally write
allDifferent :: Eq a => [a] -> Bool
allDifferent as = not $ or $ fst $ foldl' go ([],tail as) as
where
go (bs,es) e = ((e `elem` es) : bs, tail es)
It doesn't matter how many elements there are. It works for any data type that supports comparisons. It uses short-circuit boolean operations and will do the minimum number of comparisons to figure out the answer. You can't get much better than this.
Moving the window
We can check windows of any size, so it's only natural to try and write code that moves a window of any size over input of any content as long as the contents can be compared.
This suggests a signature
findMarker :: Eq a => Int -> [a] -> Int
having as first argument the size of the window we want to move, and as second argument the «we don't know nor care how big» list of things that can be compared. It's supposed to return the rightmost position of the window, the moment all its elements are different.
Haskell lists positions start at zero. Starting at zero is the right way to count in Computer Science. It is, it always has been. Shut up. Learn to count.
If we start with a list such as
[e0,e1,e2,e3,e4,e5,e6,...]
we can use the standard functions take
and drop
. I'll use 4
to drive the point home
ghci> take 4 [e0,e1,e2,e3,e4,e5,e6,...]
[e0,e1,e2,e3]
ghci> drop 4 [e0,e1,e2,e3,e4,e5,e6,...]
[e4,e5,e6,...]
This way we can «prime» the initial window, and focus on the remaining input. If we are working with windows of size 4... did you notice that that would be the index to return if this window is the one satisfying the conditions?
From now own, we only need to count from 4 onward, while grabbing one character at a time to shift the window.
findMarker :: Eq a => Int -> [a] -> Int
findMarker c s = go c (take c s) (drop c s)
where
go count buffer rest = if allDifferent buffer
then count
else go (succ count)
(tail buffer ++ [ head rest ])
(tail rest)
In order for the result to be provided immediately, I've
written the logic as the tail recursive function go
.
Notice that the moment allDifferent buffer
succeeds, the
function immediately returns the current count
, ignoring
the rest of the input. But if the check fails, the
count
is incremented by one, we «shift out» the leftmost
element of the buffer
and add the next element in the
input.
Tail-recursive functions are as efficient as your carefully
written while
loop, both in terms of stack space, as well
as local variable use, as long as the compiler for the
language does the right thing. GHC does. Good C/C++ compilers
can as long as there's no aliasing in the body. Chances are your
favorite dynamic language doesn't.
The notion of «loop invariants» that many of you learned about if you took a course in algorithm design, can be applied here:
count
always points to the rightmost position of thebuffer
, andbuffer
is of constant size as it loses (tail
) exactly one element while gaining ((++)
) exactly one more, andrest
is always losing (tail
) one element.
This allows me to reason and prove that the window is always
of arbitrary size c
and it will slide from left to right until
all of its elements are different.
Generic, efficient, and formal.
Solving Part A
Part A wants windows of size 4. So be it
partA :: String -> Int
partA = findMarker 4
Solving Part B
Part B wants windows of size 14. You don't say.
partB :: String -> Int
partB = findMarker 14
I pity the fool that used arrays and had to come up with all the comparisons or the convoluted array index relations...
Conclusion
The more general your code, the better the chances for
reusing. The less you are bound by the shackles of
the length
of things, the while
using counters, and
embrace laziness, fusion, fold
and tail-recursive
encoded loops, the more your code will compose better.
I must point out that
someList ++ [e]
is a terrible code smell in the general case, because
concatenation (++)
is O(n)
on the length of it first argument.
Since we are working with short lists, and performing a single
concatenation, is not much of an issue.
If this was production code, with sliding windows of
considerable size, I would've used ByteString
for binary
data or Seq
for structured data. Both data types are carefully
crafter for functional programming to afford adding to either
end in O(1)
time.