The practical part of CI3725 covers a minimal set of techniques and tools one needs to know about in order to write adequate parsers and interpreters. Those who are not interested in structured language processing should at least be less ignorant about what’s reasonable and what’s not. Those who are, will know where to find better tools.
Transformation mindset
The key concept we hope students understand and apply is that a language processor is a transformation (abusing Haskell notation because it’s the best notation)
String -> Behavior
where the input is some «program description» and the output is a practical result (for some value of «practical»): it can be another high/low level description, a simulation, the execution of instructions… «things happening as described».
The main technique to embrace is breaking down this transformation into a sequence of many analyze-synthesize phases. Each phase will take the result of the previous one, process it to produce an intermediate representation that is better for the next stage. That is, something along the lines of
phase0 :: String -> R1
phase1 :: R1 -> R2
...
phaseN :: Rn -> Behavior
This naturally breaks down the problem into simpler parts, a notion rooted into good engineering. It also allows each transformation to employ a particular set of tools and algorithms specifically designed for abstract language processing. Tools and algorithms that are extremely hard to come up with by accident.
For run-of-the-mill declarative processing, or the implementation of toy programming language interpreters, we teach students to split the problem in at least three parts
String -> [Tokens]
(Lexical Analysis) – group input characters into meaningful «word» elements in the context of the problem.[Tokens] -> (AST,Context)
(Syntax-Directed Translation) – process group of tokens to build a hierarchy of meaningful syntactical elements, creating an Abstract Syntax Tree (AST) reasonable for the problem, while capturing as much context about the description (variables, subroutines).(AST,Context) -> Behavior
(the Really Hard part) – traverse theAST
while using theContext
to «make things happen»: each node of the AST corresponds to some basic instruction (if
,while
, expression) that is executed according to what’s in context. This is the actual interpretation stage.
We teach tools that make step (1) extremely easy to write, step (2) straightforward to write as long as students have a good understanding of what’s needed for (3). That is, even though the problem can be broken in these three parts, you must understand the whole processing flow in order to properly connect boundaries. We can’t teach experience, so we try to give them guidelines (both technical and in terms of estimated work), but since undergraduate students already Know Everything, there will be anguish and suffering midway through step (2).
This post is not going to be a fully functional example, i.e. there will be the minimal amount of code for the reader to understand what it takes. That is, this post is about the engineering of a solution and how to approach the back and forth. I’ve also chosen an esoteric language as subject matter because efficiently implementing its behavior is more interesting than the first two phases. It will also allow me to fuse phases (1) and (2) as they are quite… boring. At least for this language.
Brainfuck and its behavior
Brainfuck is a minimalistic esoteric programming language. Yes, it can be used to program anything that can be programmed, but no, you wouldn’t use it to program anything complex. It’s just an illustration of the Church-Turing Thesis, a concept more people in IT should be familiar with. Specially those who think «computers can do anything», even more so with deranged hallucinatory probably probabilistic angsty teenager vector adders. But I digress.
Phases (1) and (2) for Brainfuck are boring for two reasons:
programs are linear and each instruction is but a single
character. That means we can process the input «program»
from left to right, one character at a time, in deterministic
fashion, to figure out each instruction. The only challenge
is handling nested blocks: finding an opening bracket
([
) forces us to wait until we find the closing bracket
(]
) because the open needs to jump to the instruction
after the close, and the close needs to jump to the
instruction after the open. Confused already?
The key observation here is that as we process the input
string one character at a time, we will transform it into
labeled instructions. The labels will be natural numbers,
and the instructions will be a Haskell data type, some
with values that refer to labels. Thus, we start with
a Haskell module HBF.Program
providing data types
to represent instructions and programs, as well as a
parser. Each instruction (Cmd
) is represented by
data Cmd = Next
| Prev
| Incr
| Decr
| PutC
| GetC
| JFwd Int
| JBak Int
deriving (Show,Eq)
Once we successfully parse the input, the program is immutable.
Given that programs have to jump back and forth, we need an
efficient data type to store the program, such that access to
each position in constant time O(1)
: Haskell’s
immutable arrays (Data.Array
)
type Program = Array Int Cmd
We teach students how to use a
LALR Parser Generator, and I might explore that in a future post. But for this
language, it’s simpler to use a
Recursive Descent Parser. Even more so thanks to Haskell’s Text.Parsec
parser combinators, or any of the alternative derivatives.
The parser can be 100% deterministic, because each instruction is a single-character, and they are all different. Also, characters that are not part of Brainfuck’s syntax can be ignored, making it for a very easy basic instruction parser.
The parseSingle
combinator is a parser having String
as input, carrying a single Int
as mutable state, and
producing a list of instructions.
parseSingle :: Parsec String Int [Cmd]
The parsing process is the same for each one of the simpler
instructions: make sure to match the particular character
we’re interested in, modify the state to keep the counter going,
and produce a singleton list with the matching instruction
constructor as per Cmd
. For instance, to parse a >
and
produce the corresponding Next
, we write
= (char '>' >> modifyState succ >> pure [Next])
parseSingle <|> ...
The reader should be able to come up with the five other rules for the simple instructions.
Now, the open-close bracket blocks are interesting. Syntactically
speaking, for every opening one there must be a closing one,
and nesting them makes this a recursive problem. In terms of
the program structure, we’ll need to use the proper label for
the forward and backward jumps. Also, there might me more than
one instruction (including nested blocks!) within the block:
that’s why I chose to use [Next]
to make concatenation easy.
Once we match the opening bracket and update the state
we save the current state as begin
. It’s the position just
after the opening bracket. One or more (many1
) commands
are parsed using parseSingle
recursively. This produces
a [[Cmd]]
flattened by mconcat
into a [Cmd]
. Since these are
recursive calls to parseSingle
, the state has been
updated accordingly. We match the closing bracket
and update the state, or immediately fail with a syntax
error stating there’s a missing closing bracket. If the
balancing bracket is there, we save the current state as
end
, to produce the final value: a jump forward to
the end
position, all the instructions we found in
between, and a jump backward to begin
.
<|> (do char '[' >> modifyState succ
<- getState
begin <- mconcat <$> many1 parseSingle
cmds ']' >> modifyState succ) <?> "missing ]"
(char <- getState
end pure $ [JFwd end] <> cmds <> [JBak begin])
Yes, I’m making absolutely no effort to recover from syntax errors: that’s what the Brainfuck specification sates, and you have no idea how hard is to do syntax error recovery…
The rest of the instruction parsing needs to ignore any character that is not part of the Brainfuck specification. Note ignoring characters does not update the state, because they don’t count as program instructions. Finally, the parser needs an extra syntax error detection for when there are more closing brackets than opening brackets. Think about it.
<|> (many1 (noneOf "><+-,.[]") >> pure [])
<?> "unexpected ]"
If the above parser succeeds in processing the whole input
string, it will return a full list of instructions using
the Cmd
type, where the internal jumps target the proper
destination address. Given that I want to build a static
immutable array, it would be helpful to know how many
instructions to store. But that’s exactly what the parser
state carries! We can write a top-level parser that
uses parseSingle
to get all the instructions in a single
list cs
and the value of the internal state once finished
parseProgram :: Parsec String Int (Int,[Cmd])
= do cs <- mconcat <$> many1 parseSingle
parseProgram <- getState
l pure (l,cs)
and then write a program loading function that uses that information to build the immutable array
loadProgram :: String -> Program
= case runParser parseProgram 0 "loadProgram" input of
loadProgram input Left _e -> error $ show _e
Right (l,cs) -> array (0,pred l) $ zip [0..] cs
This wraps up our fused (1) and (2) stages: going from
the input String
into a Program
. It’s nicely encapsulated
into a separate Haskell module exporting both Cmd
and
Program
data types, as well as the loadProgram
function.
In terms of efficiency, the parser is deterministic as
it does not need to backtrack, so it takes O(n)
time
in the number of instructions. I chose to write the
parser this way because I understood the language’s syntax.
It requires as much stack space as there are nesting blocks,
but that would be necessary regardless of the chosen parser
type. It can be argued that using [Cmd]
is not the most
efficient data structure. But I used it as a Monoid
(noticed <>
and mconcat
?), which means it can be
replaced with a better performing data structure as long
as it is a Monoid
(e.g. Data.Sequence
). I probably won’t.
I make it correct first, fast only after measuring.
Focus Lambda-san
I was able to fuse phases (1) and (2) into one, on account of Brainfuck’s syntax being intentionally easy to parse, and the lack of symbols (variables or subroutines). That means we have «the AST» we need to simulate, and we don’t need any other context from the program source. So far, so good.
So we focus on phase (3), the actual behavior. Brainfuck specifies its mutable state as a single tape. There is a pointer to the «current cell», that can move left or right to change focus. The tape has to store at least 30000 elements, but this is Haskell so we will try and make it infinite (bound by the amount of available RAM). The left end of the tape must be easily identifiable: trying to move the pointer to the left of the left end, should result in no movement. And it has to efficiently move left or right, one position at a time.
We are obviously not going to use an Array
, because both
size and contents are immutable. Using [a]
and functions
like length
, (!!)
, drop
or take
would be brutally
inefficient, as they would traverse the list from the left
every single time. If you
have taken a proper course in data structures, even better,
taking one in performant declarative data structures
(like the ones discussed in CI4251),
you should’ve already realized «’tis but a list zipper».
We continue with a Haskell module named HBF.Tape
providing an
abstract data type Tape
, built like this
data Tape = Tape { front :: [Int]
rest :: NE.NonEmpty Int
,
}deriving (Show)
Here’s the trick: the tape
[1,2,3,4,5,6,7]
^
with 4
being in focus, will be represented as
[3,2,1] [4,5,6,7,8]
\front/ \ rest /
Being a properly engineered abstract data type, there are some invariants that hold true at any given time:
∀ tape : tape = reverse front <> rest
. We only observe that representation for debugging, being implicit during execution. It’s the mathsy way to describe the trick.The
rest
of the list will never be empty. That’s why it will be represented using Haskell’sNonEmpty
type.The focus will always be on the first element of the
rest
.Identifying we are at the left end of the tape should take
O(1)
.
The module exports the type Tape
without the internal
implementation details, alongside the API needed to use it,
while keeping the invariants
module HBF.Tape ( Tape
, initialTape
, shiftR
, shiftL
, peek
, poke
, change )
The only way to create a new type is by calling
initialTape :: Tape
= Tape { front = []
initialTape = NE.singleton 0
, rest }
which, unsurprisingly, creates a tape with a single 0
in focus.
This data type is extremely efficient because accessing
the first element of a list is always O(1)
.
For instance, to read the element currently in focus
peek :: Tape -> Int
= NE.head $ rest t peek k
where NE.head
is O(1)
and NonEmpty
guarantees rest t
contains
at least one element.
How about moving the focus to the left? If the front
of
the tape is empty (null
is O(1)
) there’s nowhere to move
so we return the list as is. If there’s at least one element
in the front
we drop it (drop
is O(1)
), placing it as the
new first element of rest
(<|
is O(1)
).
shiftL :: Tape -> Tape
= if null (front t)
shiftL t then t
else Tape { front = drop 1 $ front t
= head (front t) NE.<| rest t
, rest }
we are effectively transforming
[3,2,1] [4,5,6]
^
into
[2,1] [3,4,5,6]
^
The rest of the functions are left as an exercise to the
reader. The most interesting one would be shiftR
:
after pushing the previously focused value into the front
(using (:)
which is O(1)
), a check is needed to see if
we reached the far end of rest
to create a new
singleton 0
in focus (another O(1)
operation).
Modern GHC compilers are gradually forcing programmers
to refactor their code to use NonEmpty a
instead of
plain [a]
, because the former allows for even better
optimizations over list-like code, since the non-emptiness
can be enforced at compilation time. It also forces the
programmer to properly check their list manipulation,
and not having to deal with head
or tail
throwing
exceptions on empty lists.
Plain [a]
has an immense didactic value, but once you
understand its shortcomings, you should use better data types,
forgo partial functions, or include additional checks that
would be also needed when… using better data types.
So, use better data types.
Run to the hills
Phase (3) usually brings pain and misery to students that
did not pay attention to the plethora of warnings brought up
by lecturer and teaching assistants. It’s the part where
you combine your AST
(Program
in our case) and your
Context
(Tape
in our case) and write a recursive
function that goes over the AST
effecting changes over
Context
.
In our particular case:
Program
will always be read-only, as most executable code should be.Tape
will naturally be read-write. There’s also the need to know what is the next instruction to execute, as Brainfuck programs go one instruction after another and follow jumps.Brainfuck programs read and write, so there’s I/O involved.
This leads to the obvious solution of using a carefully setup monad transformer stack
data BF = BF { tape :: Tape
pc :: !Int
, power :: !Power
,
}
type BrainFuck = ReaderT Program (StateT BF IO)
where the ReaderT
ensures read-only access to our Program
,
StateT
allows read-write access to the execution state BF
,
all sitting atop IO
in order to interact with the world.
Every Cmd
in our Program
has a particular effect on
the Tape
(reading, writing, moving) and will always
change the program counter pc
(next step or arbitrary jump).
We need to write one function for each Cmd
value, implementing
that particular set of effects. It’s worth separating the
effects on the program counter from the effects over the tape.
Consider the abstract effect «jump to the i-th instruction».
According to the Brainfuck implementation, jumping to a
nonexistent instruction stops the program immediately. That
means we need to get the current state, check if the target
instruction is within bounds in terms of Program
, and
act accordingly
goto :: Int -> BrainFuck ()
= do s0 <- get
goto i <- asks bounds
r0 if inRange r0 i
then put s0 { pc = i }
else put s0 { power = Off }
Function get
from StateT
reads the current state
as a whole. Function asks
from ReaderT
, it’s used
to apply bounds
to the Program
array. This gets the
array’s valid index range used by inRange
. A check that
ends up either put
ting a new state jumping to the
i
-th instruction, or shutting the power off.
This leads to a trivial implementation for step
, i.e.
continue to the next instruction when the current one is
not a goto: get
the current state, read the pc
, add one,
goto
there. It’s one line using desugared Monad notation.
Once you have goto
and step
to manipulate the program
counter, we can implement all the actual instructions. For
instance, the implementation for ShiftR
would be
next :: BrainFuck ()
= do s0 <- get
next $ s0 { tape = shiftR $ tape s0 }
put step
Start by get
ting the current state, apply
shiftR
to the current tape in s0
, put
the new
tape back as updated state, and step
to the next instruction.
Implementing the conditional jump is straightforward too.
Consider the implementation for JFwd i
jfwd :: Int -> BrainFuck ()
= do v <- gets (peek . tape)
jfwd to if v == 0
then goto to
else step
Function gets
from StateT
lets us apply a function to the state,
instead of manipulating the state as a whole. In this case, peek
ing
at the current value the tape
has in focus. If said value
is 0
, then goto
the destination address, otherwise step
to the next instruction.
Finally, functions that need to interact with the real world
will take advantage of the underlying IO
monad. Brainfuck’s
specification states that PutC
reads the value currently in
focus, and outputs its ASCII equivalent, therefore
putc :: BrainFuck ()
= do c <- gets (chr . peek . tape)
putc $ putChar c
liftIO step
Exactly the same flow as for a jump in terms of peek
ing,
but adding chr
to get the corresponding ASCII character.
Then we produce output by «lifting» the IO
monad into
the stack, and step
to the next instruction.
Finally, after writing all the functions needed to implement each instruction’s behavior, an abstract «Brainfuck microcode» can be implemented as
go :: BrainFuck ()
= do pc0 <- gets pc
go <- asks (! pc0)
cmd case cmd of
Next -> next
Prev -> prev
Incr -> incr
Decr -> decr
PutC -> putc
GetC -> getc
JFwd i -> jfwd i
JBak i -> jbak i
<- gets power
pwr == On) go when (pwr
where we read the current program counter pc0
, fetch (!
)
the instruction from the array, perform the particular
instruction using the associated implementation monadic effect,
check if the machine is still on, and if so tail-recur.
Tail-recursion will be compiled to a closed loop, so this
Brainfuck interpreter will be very efficient in terms of
memory. If the Brainfuck execution tries to jump out of
its programs bounds, it will terminate. Or it can hang
forever, as any Turing-complete computation device must
be able to…
Some assembly required…
Brainfuck source code will come from some text file, named using a command-line argument. We then
readFile :: FilePath -> String
to read it. We would need a function
brainfuck :: String -> IO ()
= {- left as an exercise -} brainfuck input
that perfectly aligns with our notion of an interpreter. This function can be built with the parts we’ve worked out so far. I hope you can figure out how to fill in the blanks around:
Parsing the program from the
input
program :: Program = {- left as an exercise -} program
Creating the initial state for the Brainfuck machine
state0 :: BF = BF {- left as an exercise -} state0
Unravel the monad stack over
go
to run. I’ll give it in point-free style Because I Can®.emulate :: BrainFuck () -> IO ((),BF) = flip runStateT state0 emulate . flip runReaderT program
Try and rewrite it explicitly without the
flip
s. Figure out where the((),BF)
comes from and either take advantage of it to have aDebug
mode for your interpreter, or change what you need to make it a mere()
.
Mine can run things in normal mode by default
$ stack exec hbf-exe -- ~/hello.b
Hello World!
but also in Debug
mode with pretty-printing
$ stack exec hbf-exe -- --debug ~/42.b
++++ +++
+[>++++ ++[>+<-][
<]< -]> >++ +++
+.- --- --- ---
--.+++++++ +++
+++ .++
+++ +.-
--- -----.--.
Final State
( ()
, BF
{ tape =
Tape
{ front = [ 45 , 43 , 32 , 10 , 0 , 0 ]
, rest = 46 :| [ 60 , 62 , 91 , 93 ]
}
, pc = 459
, power = Off
}
)
Here’s your hello world and the answer.
All in all, it took me about an hour to write the basic Brainfuck
interpreter in about 160 lines of code, imports, type-signatures
and all. The command line processing, debugging mode, and
pretty-printing added about 20 more lines. A breakpoint instruction
required half a dozen: #
in source becomes Bkpt :: Cmd
; when
reaching it, the interpreter stops, pretty-prints the state, and
waits for you to press any key to continue (if you have it).
Sometimes you don’t need a fully-fledged lexer, a generated shift-reduce parser, a syntax tree, and a symbol table. But the analyze-synthesize mentality should always be there. And choosing languages with superior capabilities for implementing other languages.