Advent of Code 2022 -- Day 7

I enjoyed this challenge a lot. Instead of the usual «top to bottom» approach, I decided to apply the compiler writer approach: pick the intermediate representation first, write both ends after.

The intermediate representation is actually provided by Haskell's standard library. It provided Functor, Foldable, and a rich API to make the actual computation clearer to implement. It also provides a «zipper» that allowed writing a front-end that builds the whole structure in a single pass, as long as one writes a stateful parser...

The Problem

We are given the implicit description of a file system by showing the list of commands used to navigate and explore its contents. That is, the layout and contents of said filesystem has to be deduced from something like

    $ cd /
    $ ls
    dir a
    14848514 b.txt
    8504156 c.dat
    dir d
    $ cd a
    $ ls
    dir e
    29116 f
    2557 g
    62596 h.lst
    $ cd e
    $ ls
    584 i
    $ cd ..
    $ cd ..
    $ cd d
    $ ls
    4060174 j
    8033020 d.log
    5626152 d.ext
    7214296 k

which is consistent with the standard Unix way of navigating a filesystem: change directory (cd), list directory (ls), and either the size (a number) or the fact that something is a nested directory...

We are asked to compute the total size for each directory, considering their nesting, i.e. the current directory's size is equal to the sum of the size of files held, and the sum of the sizes of the nested directories.

Parsing the input is straightforward. Figuring out the layout and contents of the filesystem requires «interpreting» the commands. A filesystem is a tree-like structure, so it's only natural to try and model it as a recursive tree-like data structure.

Main Program

The main program is pretty much the same as Day 1.

The intermediate data structure

The key to solving this problem in a compact flexible way, is to focus on the tree-like structure. The pun will hit you later. You'll enjoy it.

A file system has a single root node (/ in the description). Every node of the file system could hold zero or more nodes: actual files, or nested directories.

This means each node has:

A descriptor -- is it a file or a directory? what's the size?
Zero or more children -- a file has no children, a directory can have an arbitrary number of children.
Each children is itself a node.

Haskell's standard library provides the Data.Tree module. It includes the type

    data Tree a = Node a [Tree a]

This is the simplest recursive data type that fulfills our needs. It's polymorphic, so we can use our own custom descriptor:

    data I = D { name  :: String
               , bytes :: Maybe Int
               }
           | F { name  :: String
               , bytes :: Maybe Int
               }
           deriving (Show,Eq)

The I is for i-node: that's how we Unix-folks name the nodes of our file systems. The constructors D and F stand for directory and file, respectively. Notice how I chose to use Maybe Int for the sizes: we will know the exact size for files as they are given in the input, but we will need to figure out the total sizes later. I rather protect myself from mistakes by wrapping my «knowns» in a Maybe, than use magic numbers or default initializers.

The solution will need to traverse the tree and decide if a node corresponds to a directory or not. A simple predicate will do the trick:

    isDir :: I -> Bool
    isDir (F _ _) = False
    isDir (D _ _) = True

The combination of the descriptor type I and the polymorphic nature of Tree a allowed me to «manually translate» the sample input into the following structure:

    test :: Tree I
    test = Node (D "/" Nothing)
                [ Node (D "a" Nothing)
                       [ Node (D "e" Nothing)
                              [ Node (F "i" (Just 584)) [] ]
                       , Node (F "f" (Just 29116)) []
                       , Node (F "g" (Just 2557)) []
                       , Node (F "h.lst" (Just 62596)) []
                       ]
                , Node (F "b.txt" (Just 14848514)) []
                , Node (F "c.dat" (Just 8504156)) []
                , Node (D "d" Nothing)
                       [ Node (F "j" (Just 4060174)) []
                       , Node (F "d.log" (Just 8033020)) []
                       , Node (F "d.ext" (Just 5626152)) []
                       , Node (F "k" (Just 7214296)) []
                       ]
                ]

Indentation is mine just to make the structure clear.

Computing the result

We are asked to find all directories having total size at most 100000, and sum their sizes.

I chose Data.Tree because it is Functor and Foldable. It's also a Traversable but using it for this problem would've required lots of additional work just for show. So, back to the «top-down» data-flow approach to process a Tree I.

The first step is «filling in the blanks». Every directory in out intermediate structure has Nothing as total size. We want to transform our existing Tree I into a new Tree I where those Nothing are replaced with Just n, where n is the size of the directory, including nested subdirectories' sizes. This can be written recursively as

    isize :: Tree I -> Tree I
    isize (Node (F n (Just b)) []) = Node (F n (Just b)) []
    isize (Node (D n _)        cs) = Node (D n (Just t)) cs'
                                        where
                                          cs' = map isize cs
                                          t   = sum $ mapMaybe (bytes.rootLabel) cs'

Every F node is left as is -- it already has its size computed. When we find a D node, we replace it with an equivalent D node, replacing the Nothing with a Just t. First, we recursively apply isize to all the children of the original node: some will be files so they will be unmodified, some will be directories and recursion will go one level deeper, as needed.

Say we hit a D node where there are only files, or the recursive calls have completed their work. In that case, we need to do two things with the resulting children list cs': we have to save it as the new «updated» list of children of the new node being returned, and we need to add all their sizes to complete this directory. Sizes are stored as Just v (where v is an Int): here's were mapMaybe comes in handy because it maps the function to actually get to the node field, and then removes the Just resulting in the list of numbers.

Now, if we start from a freshly parsed Tree I that has the directories sizes missing, such as test above, we can write

    ghci> :type test
    test :: Tree I
    ghci> isize test
    (... computes cumulative directory sizes ...)
    ghci> :type
    flatten :: Tree a -> [a]
    ghci> :type flatten . isize
    flatten . isize :: Tree I -> [I]
    ghci> :type filter isDir . flatten . isize
    filter isDir . flatten . isize :: Tree I -> [I]
    ghci> :type mapMaybe bytes . filter isDir . flatten . isize
    mapMaybe bytes . filter isDir . flatten . isize :: Tree I -> [Int]

In case you're wondering flatten is part of the Data.Tree API and produces the list of node descriptors. We are only interested in those corresponding to directories, hence the filter. There's the mapMaybe trick again to extract their sizes and collect them as a list. I ended up writing

    dirSizes :: Tree I -> [Int]
    dirSizes = mapMaybe bytes
             . filter isDir
             . flatten
             . isize

to decouple the directory size extraction from the actual solution. I completed the «back end» of the problem as

    answerA :: Tree I -> Int
    answerA = sum
            . filter (<=100000)
            . dirSizes

So, given a Tree I we can compute Part A.

Parsing the input into a tree

The input for this problem is different in that it is not the actual data to work with, but a description of how to build the data. We need to parse the input lines and interpret them as if we were navigating the file system.

The problem description, the sample input, and the belief that we will receive «well formed data», allows us to make three assumptions:

The root (/) will always be there -- we can use it to seed things to start building our Tree I.
Every cd foo happens when we are in the node that actually holds said directory -- we could search the children of the current node and find it.
Every cd .. will always have a parent to return to.

This means we need to write a parser that is processing the input String and returning useful intermediate values such as directory and file names, sizes, etc. and at the same time build a Tree I. That is, a parser that also holds state that is modified along the way, and possibly returned as its final result.

The thing is we need to build a Tree I as we get new information. This means moving inside the tree structure to extend it by adding nodes and their children, and also to go back and find a particular children to extend. We can certainly do this by copying trees around, but that is very expensive in terms of time and space. There's a better solution.

For every recursive algebraic data type, a Zipper can be derived.

A «zipper» is a way to hold the whole structure, while focusing on a particular part inside of it. The «zipper» provides an API to change the focus (moving) or modifying (tacking, pruning, updating) the focus. Once navigating and manipulating is completed, the API affords a way to go back to the «root» of the structure and return it.

Zippers are extremely efficient in a pure functional setting because they maximize structure sharing, thus minimizing memory copying, while providing efficient access to arbitrary parts of the structure. That is, they are way better than using maps and patch things via lookup and update. Really better. As in, it has been profiled and shown your intuition is wrong.

I chose Data.Tree because it comes with Data.Tree.Zipper: a fully functional zipper that allows navigating and manipulating a Data.Tree via higher-order functions. I will not explain the full API, suffice it to say that it will let us:

Start with a minimal Tree I for the root.
Turn it into a «full» TreePos zipper.
Carry it as the parser's state, so we can navigate and manipulate it.
Return it as the final parsing value.

Since the zipper API provides many names that can clash with standard library functions, it is wise to use a fully qualified import

    import qualified Data.Tree.Zipper as Z

A stateful parser

Instead of a stateless Parser, we'll declare a type alias

    type FSParser = Parsec Sting (Z.TreePos Z.Full I)

indicating that will use plain String as our input, and carry a «full» (no «holes», so to speak) TreePos zipper for Tree I as implicit state. The Parsec API provides helper actions getstate and putstate to access its state. Now it's a matter of writing the parser parts using the proper signature.

Parsers for common things

Inspecting the input data is enough to figure out we'll need to parse:

Directory names -- one or more lowercase letters

  parseDName :: FSParser String
  parseDName = many1 lower

File names -- one or more lowercase letters or periods

  parseFName :: FSParser String
  parseFName = many1 (lower <|> char '.')

File size -- one or more digits, but taken as an Int

  parseFSize :: FSParser Int
  parseFSize = read <$> many1 digit

Focusing on parsing the commands

    $ cd /
    $ cd somestring
    $ cd ..
    $ ls

we notice that they all share a common prefix: the dollar sign followed by a single space.

Parsec parsers belong to the recursive descent family, also known as predictive parsers. For them to be deterministic (efficient) they must be able to predict which rule to use by looking at no more than one character in advance. It should be clear that when a single $ is seen, it is impossible to predict which of the four cases were dealing with.

There's a solution for that which is making the common prefix a separate rule, and handle the remaining cases separately. Therefore we write

    parsePrompt :: FSParser String
    parsePrompt = string "$ "

that parses exactly the common prefix. Now we can deal with the first case with a separate parser

    parseSlash :: FSParser String
    parseSlash = parsePrompt >> string "cd /"

Now, there are two types of movements: either we go to a particular directory, or we go back up to the parent directory. It's important to distinguish those when we navigate the tree, so we write a discerning parser

    data Move = Up | To String
              deriving (Show,Eq)

    parseCD :: FSParser Move
    parseCD = do string "cd "
                 ((string ".." >> pure Up)
                  <|>
                  (To <$> parseDName)) <* newline

Note this parser only works after processing the prompt, and will also consume the \n at the end of the line. Operator (<*) comes from Applicative, processes the sequence of parsers, returning the value of the one to its left.

Lastly, we focus on parsing the directory listing. Besides processing the actual ls and end of line, we need to process one or more «items», one per line, following it. Each item is either a directory name, or a filename and its size. We might as well build them as Tree I since they will become the children of whatever node we have in context!

    parseLS :: FSParser [Tree I]
    parseLS = string "ls" >> newline >> item `sepEndBy1` newline
      where
        item :: FSParser (Tree I)
        item =  (do string "dir "
                    n <- parseDName
                    pure $ Node (D n Nothing) []
                )
            <|> (do b <- parseFSize
                    space
                    n <- parseFName
                    pure $ Node (F n (Just b)) []
                )

Notice how we use Nothing for directories, and Just b for files, when saving the sizes. This matches the structure we need for the «back end». Also notice that we use [] as children: for files, because they have no children, and for directories, because we still don't know what their children will be until we navigate there.

Having parsers for each of the commands allows us to write the parser that will capture and simulate the list of commands, manipulating the zipper as we go.

A zipping parser

The input data states that we are at the top of the file system. Therefore, our top level parser should start with a zipper for the empty root tree -- this would be the «priming» of the state. The top level parser should deal with the single cd / at the top, simulate the movements, and then turn the zipper into a complete tree to return it.

An empty Tree I for the root looks like

    Node (D "/" Nothing) []

and the zipper API provides Z.fromTree to create a zipper from a Tree a. Let's start with a top level function that primes and runs the top level parser that's supposed to return a Tree I

    parseFS :: String -> Tree I
    parseFS input = case runParser fsTOP
                                   (Z.fromTree $ Node (D "/" Nothing) [])
                                   "parseFS"
                                   input
                    of
                       Right t  -> t
                       Left  _e -> undefined

and then write the top level parser. Note that FSParser implies the state is a zipper, while (Tree I) denotes the type for the parser's return value.

    fsTOP :: FSParser (Tree I)
    fsTOP = do parseSlash >> newline >> many1 movement >> eof
               Z.tree . Z.root <$> getState

As promised, the top level parser deals (and ignores) the cd / line, and then relies on the movement parser to deal with whatever simulating steps we need to perform until the end of input. The movement parser returns nothing: it simulates everything using the implicit state. Once simulating is done we make sure the zipper focuses back on the Z.root of the tree, and return it as a full Z.tree. Refocusing on the root of the tree is key -- we don't know where in the file system the simulation will end up...

The simulation

It all comes down to simulating the movements, so we can tack the children coming out of parseLS under the proper parent. Therefore,

    movement :: FSParser ()
    movement = parsePrompt >> (reFocus <|> fillDir)

After parsing the prompt, each movement is: focusing somewhere else (handling a cd), or tacking children in the place we just focused in (handling an ls).

Since the zipper is always focused on the right parent node, attaching children is quite compact to write

    fillDir :: FSParser ()
    fillDir = do cs <- parseLS
                 t  <- getState
                 putState $ Z.modifyTree (\(Node l _) -> Node l cs) t

After parsing all children already as [Tree I] we take advantage of the zipper's Z.modifyTree. We have focus on the right node, so we keep the label as is, and simply put the list of children in place.

Now, to make sure we keep our focus in the right place

    reFocus :: FSParser ()
    reFocus = do w <- parseCD            
                 t <- getState
                 putState $ goThere w t

we start by figuring out what Move to make using parseCD, and let the auxiliary goThere perform it over the zipper.

    goThere :: Move ->  Z.TreePos Z.Full I -> Z.TreePos Z.Full I
    goThere Up     t = fromJust $ Z.parent t
    goThere (To n) t = findIt (fromJust $ Z.firstChild t)
                         where
                           findIt :: Z.TreePos Z.Full I -> Z.TreePos Z.Full I
                           findIt here = if name (Z.label here) == n
                                           then here
                                           else findIt (fromJust $ Z.next here)

Moving Up is easy: there will always be a Z.parent because the data input is always valid, so there's no risk in unwrapping the Maybe. The zipper API takes care of focusing on whatever Node was parent of the one the zipper was sitting in.

Moving to a particular directory is a bit more involved. Again, we work under the assumption that the data input is valid, so this movement will always happen while we are focused on a directory that actually contains that name as a child. This means:

It is safe to use Z.firstChild to focus on the first children of this node. It will be there, because it was loaded as part of the parseLS when we where there in the past. Thus fromJust is completely safe.
Once focusing on the first child, we start searching for the first node that has the label with the name n we are looking for and return the zipper focusing on it. If the current children has a different label we use tail recursion (efficient!) to move focus to the Z.next child. There will always be a valid next child, because the node we're looking for will be there: the input is always valid.

Having implemented all possible movements as changes of focus and structure over the zipper carried as implicit state, we've completed the parser.

Solving Part A

As usual we will read our input String needing an Int for an answer. We have followed the «translator» strategy so we will combine our «front-end»

    parseFS :: String -> Tree I

with our «back-end»

    answerA :: Tree I -> Int

to implement

    partA :: String -> Int
    partA = parseFS . answerA

Solving Part B

The implementation for partA came up with the right answer, uncovering Part B, as usual. Part B also needs to work with directory sizes, so we get to reuse almost everything -- decoupling is great and functional programming makes it oh so easy to compose.

Part B states we have 70000000 total disk space, and we need at least 30000000 free for whatever. So we need to find the first directory such that its cumulative size will free enough space when deleted.

When we use dirSizes we get the sizes of the directories in the order they occur in the tree, by virtue of using flatten. I know this because the documentation says so. This means the first element of the list is actually the total space occupied by the whole hierarchy. As for the rest, they can come in any order and size.

    (total:rest) = dirSizes t

For a given directory size we know deleting it frees not enough disk space if

    notEnough c = (70000000 - total) + c < 30000000

Now, if we sort the rest from smaller to larger, and ignore those whose deletion would be notEnough the first one remaining is precisely the one we're looking for

    answerB :: Tree I -> Int
    answerB t = head $ dropWhile notEnough (sort rest)
      where
        notEnough c  = (70000000 - total) + c < 30000000
        (total:rest) = dirSizes t

The solution to Part B becomes

    partB :: String -> Int
    partB = parseFS . answerB

Conclusion

This «what data structure should be in the middle» approach is part of the training in writing compilers and translators.

Once I realized the parsing had to be stateful, the real issue was choosing a data structure suitable for easy partial manipulation (having a zipper), and straightforward final consolidation (being Functor, Foldable, Traversable, and having a rich API). Choosing Data.Tree was easier than having to come up with my own data type.

To be honest, I wanted to use Traversable to write isize. The Traversable instance for Data.Tree performs an «in-order» traversal that is not suitable for this problem. It is possible to write a newtype wrapper and provide an alternate Traversable performing a «post-order» traversal suitable for this problem. That's left as a tricky exercise for the reader.

Haskell is particularly good for writing compilers, translators, and parsing in general, partly because of existing data types having zippers, and because in the worst case, deriving a zipper for your custom data type is a manual chore akin to computing polynomial derivatives. I'm not kidding, go read the wiki, the papers, or my lecture slides (spanish).

Add that on top of Parsec and streaming I/O to make Haskell arguably superior to imperative languages for writing parsers, translators, interpreters, and compilers. It doesn't matter if it's a casual parser for a one-off task, or high-performance production-grade ones. If a Haskell library for parsing doesn't already exist, you can come up with one on your own.

There are high-performance production-grade pure Haskell parsing libraries for CSV, JSON, HTML, and XML, with rich, composable, streaming APIs. Use them.

Either way is definitely better than half-assing it with regexes and poorly composable code.