An Overview of Parsing Algorithms
August 26, 2020
These notes are supplemented by Dmitry Soshnikov’s videos on parsing
Introduction
Compiler parsers serve as the second stage of the “front end”, after scanning. A scanner takes a set of source text and effectively annotates them into tokens, or small structs with a classification. For example if (true) {print("hello")}
might yield [{type: IF}, {type: LEFT_PAREN}, {type: TRUE}, {type: RIGHT_PAREN}, {type: LEFT_BRACE}, {type: IDENTIFIER, value: "print"}, {type: LEFT_PAREN}, {type: STRING, value: "hello"}, {type: RIGHT_PAREN}, {type: RIGHT_BRACE}, {type: EOF}]
. A parser receives these input tokens as a stream and tries to determine syntactic structure based on some grammar, or rules of a valid language. Parsers typically use a context-free grammar, which Cooper/Torczon define as “a set of rules that describe how to form sentences.”
Formally, a CFG is defined as the tuple (T, NT, S, P) where T is the set of terminal values (that is, as you follow the rules, you cannot substitution any deeper rule), NT is the set of non-terminal values (variables, a set of abstractions), S is the goal symbol or start symbol, and P is the set of productions or rules for rewrites / substitutions. That is, any non-terminal should be rewriteable as a string of other symbols in the grammar.
We use Backer-Naur Form traditionally to describe our grammar. For example, an incomplete grammar for math might look like
1) Expr -> Expr + Expr
2) | Expr * Expr
3) | number
When we create a derivation, a sequence of rewriting steps following the rules of the grammar, we generate sentential forms which are sentences that can occur in one step of a derivation. As we do so, we generate a parse tree. For example, if we used our above rules and tried to parse 2 + 5 * 10
we might get:
Rule | Sentential Form |
---|---|
Expr | |
1 | Expr + Expr |
3 | 2 + Expr |
2 | 2 + Expr * Expr |
3 | 2 + 5 * Expr |
3 | 2 + 5 * 10 |
In doing this, we also followed leftmost derivation where we rewrote the leftmost non-terminal in each step. There also exists rightmost derivation.
If it is possible that following leftmost (or rightmost) derivation can result in different parse trees, we call the grammar ambiguous. One way to remove ambiguity is to add a new nonterminal that embodies the repeated logic. The book uses the example of
Statement → if Expr then Statement else Statement
| if Expr then Statement
| Assignment
Which is ambiguous for “if expr1 then if expr2 then assign1 else assign2” (which arm does the else belong to?) and corrects it to
Statement → if Expr then Statement
| if Expr then WithElse else Statement
| Assignment
WithElse → if Expr then WithElse else WithElse
| Assignment
This forces the else clause to always be associated with the correct arm. We also assert that the closer to the start symbol a rule is, the lower its precedence. That is, more important / more binding power elements will appear lower in the parse tree usually.
Top Down Parsing
Top-Down parsers start with the root of a parse tree and build downwards, continuing until all leaves of the tree are terminal values and the input stream of tokens is exhausted (or, in case of error). In the case of an error, a parser will assume it applied some wrong production and will backtrack and unwind the set of changes it made to its last branching choice, try the next one and continue. Backtracking isn’t mandatory for a grammar, however. A good set of pseudocode for a backtracking algorithm
let node = root
let word = scanner.nextWord()
let testingStack = []
while true
if node is a variable
// look at all rules for node
for 1 to node.rules.length
testingStack.push(node.rules[idx])
node = node.rules[0]
else if word == node
word = scanner.nextWord()
node = testingStack.pop()
else if word == EOF and we have no more nodes to look at
return root
else
backtrack
What happens in the case of a left-recursive grammar with a top-down parser? Think of our above math
1) Expr -> Expr + Expr
2) | Expr * Expr
3) | number
We would end up with an infinite cycle of testing the leftmost Expr with the first substitution rule. So, in order to change this, we would need a right-recursive grammar by rewriting the productions.
Expr -> Term Expr'
Expr' -> + Expr'
| * Expr'
| Term
Term -> number
We need to watch out for direct left recursion as well as indirect left recursion.
Once we’ve done this, we can introduce a lookahead symbol which allows us to peek at the first symbol in each rule and match it to the upcoming token in order to predict the next applicable rule. We call the set of valid lookahead symbols the FIRST set, which is defined as “the set of terminals can that appear at the start of a sentence derived from some rule a’” .
That is, for Expr’, the FIRST set includes Term, + and *
We can also define a FOLLOW set as all the words that can come after a nonterminal. That is, for Expr’, that would be eof
We also introduce left-factoring that rewrites ambiguity in left terms by isolating common prefixes into their own set of productions. That is
Expr -> T + F
| T - F
| T
We see a common prefix of T, isolate and move the rest
Expr -> T Expr'
Expr' -> + F
| - F
| ε
The pseudocode to use Expr’ looks like
function exprPrime() {
if word == + or word == -
word = scanner.nextWord()
if isF(word)
return true
else return false
else if word == "" or word == EOF // epsilon case
return true
else
reportSyntaxError()
return false
}
A topdown parser that uses FIRST, FOLLOW, and FIRST+ sets to avoid backtracking is called LL(1). That is, we scan left to right on input, working on leftmost derivation, with a lookahead of 1 token. As long as the grammar is right-recursive and backtrack free, LL(1) is an option.
Bottom-Up Parsing
Like the name suggests, a bottom-up parser finds the leaves (terminals) of the parse tree and builds non-terminals on top. The parser “extends the frontier upwards” by looking for some production a' -> b'
- that is, if b’ exists at some rightside position k, we can replace b’ with a’. If that production yields a valid derivation, we call this the handle, which is the pair (a’ -> b’, k). The process of replacing b’ with a’ in the frontier is called a reduction.
The scanner still runs with left to right input, but our parser works on rightmost derivation. However, we reverse the order to make sure we handle the leftmost leaf first. As a result, these parsers are LR parsers - left to right input, reverse rightmost derivation.
The LR(1) parsing algorithm uses a single lookahead symbol. It uses two tables, an Action table and a Goto table to find the next handle for the derivation. It does this by holding a stack of the current upper frontier (the layer of the parse tree it’s building) interleaves with states from the handle-finding steps to figure out how to make a reduction.
A shift is just an advancement of the scanner stream - the equivalent of consuming a token. The shifted symbol becomes the new parse tree state.
Pseudocode
push $
push starting state s0
word = scanner.nextWord()
while true
let state = stack.pop()
if Action[state.word] == reduction of a' to b'
cardinality = b'.length
pop cardinality times
state = pop()
push(a')
push Goto[state, A]
else if Action[state.word] == "shift state si"
push word
push si
word = scanner.nextWord()
else if Action[state.word] == accept
break
else
failure()
return success
Action tables are a mapping of states and words to steps - either shifts or reduces. Goto tables are a mapping of handles (functions to replace some b’ with a’) and state transitions.
In order to build these tables, we need to be able to represent handles and potential handles alongside the lookahead symbols. Each potential handle is presented by an LR(1) item which looks like [a' -> b' • y. a]
where a' -> b' y
is a grammar production, • represents the position of the stacktop, and a is a terminal symbol. The set of items is called the canonical collection. Where we place • means different things.
- [a’ -> • b’y, a] = a’ would be valid, if we find a b’ next then we can discover a’. This item is called a possibility - it represents a possible completion for an input seen.
- [a’ -> b’• y, a] = the parser has progressed from the state before and has recognized b’. This item is partially complete
- [‘a -> b’y• , a] = the parser has found some b’y such that a’ followed by a would be valid. If the lookahead symbol is a, then the item is a handle and the parser can reduce b’y to a’. This is a complete item.
To build the canonical collection, a parser states with an initial state [Goal -> • SomeExpr, eof] that is, the first handle that can be reduced such that the lookahead symbol is the end of the file. We then find every potential state change as a set. The two operations that can occur are taking a closure and computing a transition.
- A closure completes a state. It adds to the set of items any related implications. That is, anywhere
Goal -> List
is valid, we can also add the productions of ways to deriveList
. So if we see[Goal ->•List, eof]
, we can say that both[List -> • List Pair, eof]
and[List -> • Pair, eof]
are valid. - To compute a transition on some symbol x, the algorithm should find the item of items that has • before x, and pick the set of items that would transition • AFTER x. This is the goto procedure.
Pseudocode for closure
function closure(s)
while s is different from lastSeenS
for each item in s where [a' -> b' • nonTerminalC, a]
for each production for nonTerminalC -> y in ProductionList
for each lookAhead in FIRST for someState
s = union(s, ([nonTerminalC ->• y, lookAhead]))
return s
So if we had some grammar for math, we might start with [Goal -> • Expr, eof]
and the closure operator might add to the set
[Expr -> • Num, eof], [Expr -> • Num, +], [Expr -> + Expr, eof]
The goto procedure takes some set CCi in the canonical collection, a grammar symbol c, and computes the states that would be made if x was accepted in state i.
goto(set, symbol)
moved = null
for item in set
if item matches some [a' -> b'• xAndExtra, a]
moved = union(moved, [a->b'x• theExtra, a])
return closure(moved)
It tests to see if it can move the acceptance past the lookahead symbol , and if it does adds the moved state into a total collection, then finds the closure over that.
Conclusion
This is a very high-level overview, ignoring a lot of error management. There are many other parsing algorithms, such as SLR and LALR. I handwaved over a couple of details in the parsing pseudocode for brevity, but I have found some notes / books that can help.
I work and live in Brooklyn, NY building software.