I enjoyed this challenge a lot. Instead of the usual «top to bottom» approach, I decided to apply the compiler writer approach: pick the intermediate representation first, write both ends after.
The intermediate representation is actually provided
by Haskell's standard library. It provided Functor
,
Foldable
, and a rich API to make the actual computation
clearer to implement. It also provides a «zipper» that
allowed writing a front-end that builds the whole structure
in a single pass, as long as one writes a stateful parser...
The Problem
We are given the implicit description of a file system by showing the list of commands used to navigate and explore its contents. That is, the layout and contents of said filesystem has to be deduced from something like
$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
which is consistent with the standard Unix way of navigating
a filesystem: change directory (cd
), list directory (ls
),
and either the size (a number) or the fact that something
is a nested directory...
We are asked to compute the total size for each directory, considering their nesting, i.e. the current directory's size is equal to the sum of the size of files held, and the sum of the sizes of the nested directories.
Parsing the input is straightforward. Figuring out the layout and contents of the filesystem requires «interpreting» the commands. A filesystem is a tree-like structure, so it's only natural to try and model it as a recursive tree-like data structure.
Main Program
The main program is pretty much the same as Day 1.
The intermediate data structure
The key to solving this problem in a compact flexible way, is to focus on the tree-like structure. The pun will hit you later. You'll enjoy it.
A file system has a single root node (/
in the description).
Every node of the file system could hold zero or more nodes:
actual files, or nested directories.
This means each node has:
- A descriptor -- is it a file or a directory? what's the size?
- Zero or more children -- a file has no children, a directory can have an arbitrary number of children.
- Each children is itself a node.
Haskell's standard library provides the Data.Tree
module. It
includes the type
data Tree a = Node a [Tree a]
This is the simplest recursive data type that fulfills our needs. It's polymorphic, so we can use our own custom descriptor:
data I = D { name :: String
, bytes :: Maybe Int
}
| F { name :: String
, bytes :: Maybe Int
}
deriving (Show,Eq)
The I
is for i-node: that's how we Unix-folks name the nodes
of our file systems. The constructors D
and F
stand for
directory and file, respectively. Notice how I chose to use
Maybe Int
for the sizes: we will know the exact size for
files as they are given in the input, but we will need to
figure out the total sizes later. I rather protect myself from
mistakes by wrapping my «knowns» in a Maybe
, than use
magic numbers or default initializers.
The solution will need to traverse the tree and decide if a node corresponds to a directory or not. A simple predicate will do the trick:
isDir :: I -> Bool
isDir (F _ _) = False
isDir (D _ _) = True
The combination of the descriptor type I
and the polymorphic
nature of Tree a
allowed me to «manually translate» the sample
input into the following structure:
test :: Tree I
test = Node (D "/" Nothing)
[ Node (D "a" Nothing)
[ Node (D "e" Nothing)
[ Node (F "i" (Just 584)) [] ]
, Node (F "f" (Just 29116)) []
, Node (F "g" (Just 2557)) []
, Node (F "h.lst" (Just 62596)) []
]
, Node (F "b.txt" (Just 14848514)) []
, Node (F "c.dat" (Just 8504156)) []
, Node (D "d" Nothing)
[ Node (F "j" (Just 4060174)) []
, Node (F "d.log" (Just 8033020)) []
, Node (F "d.ext" (Just 5626152)) []
, Node (F "k" (Just 7214296)) []
]
]
Indentation is mine just to make the structure clear.
Computing the result
We are asked to find all directories having total size at most 100000, and sum their sizes.
I chose Data.Tree
because it is Functor
and Foldable
. It's
also a Traversable
but using it for this problem would've
required lots of additional work just for show. So, back
to the «top-down» data-flow approach to process a Tree I
.
The first step is «filling in the blanks». Every directory
in out intermediate structure has Nothing
as total size.
We want to transform our existing Tree I
into a new Tree I
where those Nothing
are replaced with Just n
, where n
is the size of the directory, including nested subdirectories'
sizes. This can be written recursively as
isize :: Tree I -> Tree I
isize (Node (F n (Just b)) []) = Node (F n (Just b)) []
isize (Node (D n _) cs) = Node (D n (Just t)) cs'
where
cs' = map isize cs
t = sum $ mapMaybe (bytes.rootLabel) cs'
Every F
node is left as is -- it already has its size computed.
When we find a D
node, we replace it with an equivalent D
node,
replacing the Nothing
with a Just t
. First, we recursively apply
isize
to all the children of the original node: some will be files
so they will be unmodified, some will be directories and recursion
will go one level deeper, as needed.
Say we hit a D
node where there are only files, or the recursive
calls have completed their work. In that case, we need to do two
things with the resulting children list cs'
: we have to save it
as the new «updated» list of children of the new node being returned,
and we need to add all their sizes to complete this directory.
Sizes are stored as Just v
(where v
is an Int
): here's were
mapMaybe
comes in handy because it maps the function to actually
get to the node field, and then removes the Just
resulting in
the list of numbers.
Now, if we start from a freshly parsed Tree I
that has the
directories sizes missing, such as test
above, we can write
ghci> :type test
test :: Tree I
ghci> isize test
(... computes cumulative directory sizes ...)
ghci> :type
flatten :: Tree a -> [a]
ghci> :type flatten . isize
flatten . isize :: Tree I -> [I]
ghci> :type filter isDir . flatten . isize
filter isDir . flatten . isize :: Tree I -> [I]
ghci> :type mapMaybe bytes . filter isDir . flatten . isize
mapMaybe bytes . filter isDir . flatten . isize :: Tree I -> [Int]
In case you're wondering flatten
is part of the Data.Tree
API
and produces the list of node descriptors. We are only interested
in those corresponding to directories, hence the filter
. There's
the mapMaybe
trick again to extract their sizes and collect them
as a list. I ended up writing
dirSizes :: Tree I -> [Int]
dirSizes = mapMaybe bytes
. filter isDir
. flatten
. isize
to decouple the directory size extraction from the actual solution. I completed the «back end» of the problem as
answerA :: Tree I -> Int
answerA = sum
. filter (<=100000)
. dirSizes
So, given a Tree I
we can compute Part A.
Parsing the input into a tree
The input for this problem is different in that it is not the actual data to work with, but a description of how to build the data. We need to parse the input lines and interpret them as if we were navigating the file system.
The problem description, the sample input, and the belief that we will receive «well formed data», allows us to make three assumptions:
The root (
/
) will always be there -- we can use it to seed things to start building ourTree I
.Every
cd foo
happens when we are in the node that actually holds said directory -- we could search the children of the current node and find it.Every
cd ..
will always have a parent to return to.
This means we need to write a parser that is processing
the input String
and returning useful intermediate values
such as directory and file names, sizes, etc. and at
the same time build a Tree I
. That is, a parser that also
holds state that is modified along the way, and possibly
returned as its final result.
The thing is we need to build a Tree I
as we get new
information. This means moving inside the tree structure
to extend it by adding nodes and their children, and
also to go back and find a particular children to extend.
We can certainly do this by copying trees around, but that
is very expensive in terms of time and space. There's
a better solution.
For every recursive algebraic data type, a Zipper can be derived.
A «zipper» is a way to hold the whole structure, while focusing on a particular part inside of it. The «zipper» provides an API to change the focus (moving) or modifying (tacking, pruning, updating) the focus. Once navigating and manipulating is completed, the API affords a way to go back to the «root» of the structure and return it.
Zippers are extremely efficient in a pure functional setting because they maximize structure sharing, thus minimizing memory copying, while providing efficient access to arbitrary parts of the structure. That is, they are way better than using maps and patch things via lookup and update. Really better. As in, it has been profiled and shown your intuition is wrong.
I chose Data.Tree
because it comes with Data.Tree.Zipper
:
a fully functional zipper that allows navigating and
manipulating a Data.Tree
via higher-order functions.
I will not explain the full API, suffice it to say that
it will let us:
Start with a minimal
Tree I
for the root.Turn it into a «full»
TreePos
zipper.Carry it as the parser's state, so we can navigate and manipulate it.
Return it as the final parsing value.
Since the zipper API provides many names that can clash with standard library functions, it is wise to use a fully qualified import
import qualified Data.Tree.Zipper as Z
A stateful parser
Instead of a stateless Parser
, we'll declare a type alias
type FSParser = Parsec Sting (Z.TreePos Z.Full I)
indicating that will use plain String
as our input, and
carry a «full» (no «holes», so to speak) TreePos
zipper for Tree I
as implicit state. The Parsec
API provides
helper actions getstate
and putstate
to access its state.
Now it's a matter of writing the parser parts using the
proper signature.
Parsers for common things
Inspecting the input data is enough to figure out we'll need to parse:
Directory names -- one or more lowercase letters
parseDName :: FSParser String parseDName = many1 lower
File names -- one or more lowercase letters or periods
parseFName :: FSParser String parseFName = many1 (lower <|> char '.')
File size -- one or more digits, but taken as an
Int
parseFSize :: FSParser Int parseFSize = read <$> many1 digit
Focusing on parsing the commands
$ cd /
$ cd somestring
$ cd ..
$ ls
we notice that they all share a common prefix: the dollar sign followed by a single space.
Parsec
parsers belong to the recursive descent family, also
known as predictive parsers. For them to be deterministic (efficient)
they must be able to predict which rule to use by looking at
no more than one character in advance. It should be clear that
when a single $
is seen, it is impossible to predict which of
the four cases were dealing with.
There's a solution for that which is making the common prefix a separate rule, and handle the remaining cases separately. Therefore we write
parsePrompt :: FSParser String
parsePrompt = string "$ "
that parses exactly the common prefix. Now we can deal with the first case with a separate parser
parseSlash :: FSParser String
parseSlash = parsePrompt >> string "cd /"
Now, there are two types of movements: either we go to a particular directory, or we go back up to the parent directory. It's important to distinguish those when we navigate the tree, so we write a discerning parser
data Move = Up | To String
deriving (Show,Eq)
parseCD :: FSParser Move
parseCD = do string "cd "
((string ".." >> pure Up)
<|>
(To <$> parseDName)) <* newline
Note this parser only works after processing the prompt,
and will also consume the \n
at the end of the line.
Operator (<*)
comes from Applicative
, processes the sequence
of parsers, returning the value of the one to its left.
Lastly, we focus on parsing the directory listing. Besides
processing the actual ls
and end of line, we need to
process one or more «items», one per line, following it.
Each item is either a directory name, or a filename and its
size. We might as well build them as Tree I
since they
will become the children of whatever node we have in context!
parseLS :: FSParser [Tree I]
parseLS = string "ls" >> newline >> item `sepEndBy1` newline
where
item :: FSParser (Tree I)
item = (do string "dir "
n <- parseDName
pure $ Node (D n Nothing) []
)
<|> (do b <- parseFSize
space
n <- parseFName
pure $ Node (F n (Just b)) []
)
Notice how we use Nothing
for directories, and Just b
for
files, when saving the sizes. This matches the structure we
need for the «back end». Also notice that we use []
as
children: for files, because they have no children, and for
directories, because we still don't know what their children
will be until we navigate there.
Having parsers for each of the commands allows us to write the parser that will capture and simulate the list of commands, manipulating the zipper as we go.
A zipping parser
The input data states that we are at the top of the file system.
Therefore, our top level parser should start with a zipper for
the empty root tree -- this would be the «priming» of the state.
The top level parser should deal with the single cd /
at the
top, simulate the movements, and then turn the zipper into
a complete tree to return it.
An empty Tree I
for the root looks like
Node (D "/" Nothing) []
and the zipper API provides Z.fromTree
to create a zipper from
a Tree a
. Let's start with a top level function that primes
and runs the top level parser that's supposed to return
a Tree I
parseFS :: String -> Tree I
parseFS input = case runParser fsTOP
(Z.fromTree $ Node (D "/" Nothing) [])
"parseFS"
input
of
Right t -> t
Left _e -> undefined
and then write the top level parser. Note that FSParser
implies the state is a zipper, while (Tree I)
denotes the
type for the parser's return value.
fsTOP :: FSParser (Tree I)
fsTOP = do parseSlash >> newline >> many1 movement >> eof
Z.tree . Z.root <$> getState
As promised, the top level parser deals (and ignores) the cd /
line, and then relies on the movement
parser to deal with
whatever simulating steps we need to perform until the end of
input. The movement
parser returns nothing: it simulates
everything using the implicit state. Once simulating is done
we make sure the zipper focuses back on the Z.root
of the
tree, and return it as a full Z.tree
. Refocusing on the
root of the tree is key -- we don't know where in the
file system the simulation will end up...
The simulation
It all comes down to simulating the movements, so we can
tack the children coming out of parseLS
under the proper
parent. Therefore,
movement :: FSParser ()
movement = parsePrompt >> (reFocus <|> fillDir)
After parsing the prompt, each movement
is: focusing somewhere
else (handling a cd
), or tacking children in the place we
just focused in (handling an ls
).
Since the zipper is always focused on the right parent node, attaching children is quite compact to write
fillDir :: FSParser ()
fillDir = do cs <- parseLS
t <- getState
putState $ Z.modifyTree (\(Node l _) -> Node l cs) t
After parsing all children already as [Tree I]
we take advantage
of the zipper's Z.modifyTree
. We have focus on the right node,
so we keep the label as is, and simply put the list of children
in place.
Now, to make sure we keep our focus in the right place
reFocus :: FSParser ()
reFocus = do w <- parseCD
t <- getState
putState $ goThere w t
we start by figuring out what Move
to make using parseCD
,
and let the auxiliary goThere
perform it over the zipper.
goThere :: Move -> Z.TreePos Z.Full I -> Z.TreePos Z.Full I
goThere Up t = fromJust $ Z.parent t
goThere (To n) t = findIt (fromJust $ Z.firstChild t)
where
findIt :: Z.TreePos Z.Full I -> Z.TreePos Z.Full I
findIt here = if name (Z.label here) == n
then here
else findIt (fromJust $ Z.next here)
Moving Up
is easy: there will always be a Z.parent
because the
data input is always valid, so there's no risk in unwrapping
the Maybe
. The zipper API takes care of focusing on whatever Node
was parent of the one the zipper was sitting in.
Moving to a particular directory is a bit more involved. Again, we work under the assumption that the data input is valid, so this movement will always happen while we are focused on a directory that actually contains that name as a child. This means:
It is safe to use
Z.firstChild
to focus on the first children of this node. It will be there, because it was loaded as part of theparseLS
when we where there in the past. ThusfromJust
is completely safe.Once focusing on the first child, we start searching for the first node that has the
label
with the namen
we are looking for and return the zipper focusing on it. If the current children has a different label we use tail recursion (efficient!) to move focus to theZ.next
child. There will always be a valid next child, because the node we're looking for will be there: the input is always valid.
Having implemented all possible movements as changes of focus and structure over the zipper carried as implicit state, we've completed the parser.
Solving Part A
As usual we will read our input String
needing an Int
for an
answer. We have followed the «translator» strategy so we will
combine our «front-end»
parseFS :: String -> Tree I
with our «back-end»
answerA :: Tree I -> Int
to implement
partA :: String -> Int
partA = parseFS . answerA
Solving Part B
The implementation for partA
came up with the right answer,
uncovering Part B, as usual. Part B also needs to work with
directory sizes, so we get to reuse almost everything --
decoupling is great and functional programming makes it
oh so easy to compose.
Part B states we have 70000000 total disk space, and we need at least 30000000 free for whatever. So we need to find the first directory such that its cumulative size will free enough space when deleted.
When we use dirSizes
we get the sizes of the directories
in the order they occur in the tree, by virtue of using
flatten
. I know this because the documentation says so.
This means the first element of the list is actually
the total space occupied by the whole hierarchy. As for
the rest, they can come in any order and size.
(total:rest) = dirSizes t
For a given directory size we know deleting it frees not enough disk space if
notEnough c = (70000000 - total) + c < 30000000
Now, if we sort the rest
from smaller to larger, and
ignore those whose deletion would be notEnough
the
first one remaining is precisely the one we're looking for
answerB :: Tree I -> Int
answerB t = head $ dropWhile notEnough (sort rest)
where
notEnough c = (70000000 - total) + c < 30000000
(total:rest) = dirSizes t
The solution to Part B becomes
partB :: String -> Int
partB = parseFS . answerB
Conclusion
This «what data structure should be in the middle» approach is part of the training in writing compilers and translators.
Once I realized the parsing had to be stateful, the real
issue was choosing a data structure suitable for easy
partial manipulation (having a zipper), and
straightforward final consolidation (being Functor
, Foldable
,
Traversable
, and having a rich API). Choosing Data.Tree
was
easier than having to come up with my own data type.
To be honest, I wanted to use Traversable
to write
isize
. The Traversable
instance for Data.Tree
performs
an «in-order» traversal that is not suitable for
this problem. It is possible to write a newtype
wrapper and provide an alternate Traversable
performing
a «post-order» traversal suitable for this problem.
That's left as a tricky exercise for the reader.
Haskell is particularly good for writing compilers, translators, and parsing in general, partly because of existing data types having zippers, and because in the worst case, deriving a zipper for your custom data type is a manual chore akin to computing polynomial derivatives. I'm not kidding, go read the wiki, the papers, or my lecture slides (spanish).
Add that on top of Parsec and streaming I/O to make Haskell arguably superior to imperative languages for writing parsers, translators, interpreters, and compilers. It doesn't matter if it's a casual parser for a one-off task, or high-performance production-grade ones. If a Haskell library for parsing doesn't already exist, you can come up with one on your own.
There are high-performance production-grade pure Haskell parsing libraries for CSV, JSON, HTML, and XML, with rich, composable, streaming APIs. Use them.
Either way is definitely better than half-assing it with regexes and poorly composable code.