An Overview of Parsing Algorithms

August 26, 2020

These notes are supplemented by Dmitry Soshnikov’s videos on parsing

Introduction

Compiler parsers serve as the second stage of the “front end”, after scanning. A scanner takes a set of source text and effectively annotates them into tokens, or small structs with a classification. For example if (true) {print("hello")} might yield [{type: IF}, {type: LEFT_PAREN}, {type: TRUE}, {type: RIGHT_PAREN}, {type: LEFT_BRACE}, {type: IDENTIFIER, value: "print"}, {type: LEFT_PAREN}, {type: STRING, value: "hello"}, {type: RIGHT_PAREN}, {type: RIGHT_BRACE}, {type: EOF}]. A parser receives these input tokens as a stream and tries to determine syntactic structure based on some grammar, or rules of a valid language. Parsers typically use a context-free grammar, which Cooper/Torczon define as “a set of rules that describe how to form sentences.”

Formally, a CFG is defined as the tuple (T, NT, S, P) where T is the set of terminal values (that is, as you follow the rules, you cannot substitution any deeper rule), NT is the set of non-terminal values (variables, a set of abstractions), S is the goal symbol or start symbol, and P is the set of productions or rules for rewrites / substitutions. That is, any non-terminal should be rewriteable as a string of other symbols in the grammar.

We use Backer-Naur Form traditionally to describe our grammar. For example, an incomplete grammar for math might look like

1) Expr -> Expr + Expr
2)       | Expr * Expr
3)       | number

When we create a derivation, a sequence of rewriting steps following the rules of the grammar, we generate sentential forms which are sentences that can occur in one step of a derivation. As we do so, we generate a parse tree. For example, if we used our above rules and tried to parse 2 + 5 * 10 we might get:

Rule	Sentential Form
	Expr
1	Expr + Expr
3	2 + Expr
2	2 + Expr * Expr
3	2 + 5 * Expr
3	2 + 5 * 10

In doing this, we also followed leftmost derivation where we rewrote the leftmost non-terminal in each step. There also exists rightmost derivation.

If it is possible that following leftmost (or rightmost) derivation can result in different parse trees, we call the grammar ambiguous. One way to remove ambiguity is to add a new nonterminal that embodies the repeated logic. The book uses the example of

Statement → if Expr then Statement else Statement 
			| if Expr then Statement
			| Assignment

Which is ambiguous for “if expr1 then if expr2 then assign1 else assign2” (which arm does the else belong to?) and corrects it to

Statement → if Expr then Statement
			| if Expr then WithElse else Statement
			| Assignment

WithElse → if Expr then WithElse else WithElse
			| Assignment

This forces the else clause to always be associated with the correct arm. We also assert that the closer to the start symbol a rule is, the lower its precedence. That is, more important / more binding power elements will appear lower in the parse tree usually.

Top Down Parsing

Top-Down parsers start with the root of a parse tree and build downwards, continuing until all leaves of the tree are terminal values and the input stream of tokens is exhausted (or, in case of error). In the case of an error, a parser will assume it applied some wrong production and will backtrack and unwind the set of changes it made to its last branching choice, try the next one and continue. Backtracking isn’t mandatory for a grammar, however. A good set of pseudocode for a backtracking algorithm

let node = root
let word = scanner.nextWord()
let testingStack = []
while true
	if node is a variable
		// look at all rules for node
		for 1 to node.rules.length
			testingStack.push(node.rules[idx])
		node = node.rules[0]
	else if word == node
		word = scanner.nextWord()
		node = testingStack.pop()
	else if word == EOF and we have no more nodes to look at
		return root
	else
		backtrack

What happens in the case of a left-recursive grammar with a top-down parser? Think of our above math

1) Expr -> Expr + Expr
2)       | Expr * Expr
3)       | number

We would end up with an infinite cycle of testing the leftmost Expr with the first substitution rule. So, in order to change this, we would need a right-recursive grammar by rewriting the productions.

Expr -> Term Expr'

Expr' -> + Expr'
       | * Expr'
	   | Term
	   
Term -> number

We need to watch out for direct left recursion as well as indirect left recursion.

Once we’ve done this, we can introduce a lookahead symbol which allows us to peek at the first symbol in each rule and match it to the upcoming token in order to predict the next applicable rule. We call the set of valid lookahead symbols the FIRST set, which is defined as “the set of terminals can that appear at the start of a sentence derived from some rule a’” .

That is, for Expr’, the FIRST set includes Term, + and *

We can also define a FOLLOW set as all the words that can come after a nonterminal. That is, for Expr’, that would be eof

We also introduce left-factoring that rewrites ambiguity in left terms by isolating common prefixes into their own set of productions. That is

Expr -> T + F
     | T - F
	 | T

We see a common prefix of T, isolate and move the rest

Expr -> T Expr'

Expr' -> + F
      |  - F
	  |  ε

The pseudocode to use Expr’ looks like

function exprPrime() {
	if word == + or word == -
		word = scanner.nextWord()
		if isF(word)
			return true
		else return false
	else if word == "" or word == EOF // epsilon case
		return true
	else
		reportSyntaxError()
		return false
}

A topdown parser that uses FIRST, FOLLOW, and FIRST+ sets to avoid backtracking is called LL(1). That is, we scan left to right on input, working on leftmost derivation, with a lookahead of 1 token. As long as the grammar is right-recursive and backtrack free, LL(1) is an option.

Bottom-Up Parsing

Like the name suggests, a bottom-up parser finds the leaves (terminals) of the parse tree and builds non-terminals on top. The parser “extends the frontier upwards” by looking for some production a' -> b' - that is, if b’ exists at some rightside position k, we can replace b’ with a’. If that production yields a valid derivation, we call this the handle, which is the pair (a’ -> b’, k). The process of replacing b’ with a’ in the frontier is called a reduction.

The scanner still runs with left to right input, but our parser works on rightmost derivation. However, we reverse the order to make sure we handle the leftmost leaf first. As a result, these parsers are LR parsers - left to right input, reverse rightmost derivation.

The LR(1) parsing algorithm uses a single lookahead symbol. It uses two tables, an Action table and a Goto table to find the next handle for the derivation. It does this by holding a stack of the current upper frontier (the layer of the parse tree it’s building) interleaves with states from the handle-finding steps to figure out how to make a reduction.

A shift is just an advancement of the scanner stream - the equivalent of consuming a token. The shifted symbol becomes the new parse tree state.

Pseudocode

push $
push starting state s0
word = scanner.nextWord()

while true
	let state = stack.pop()
	if Action[state.word] == reduction of a' to b'
		cardinality = b'.length
		pop cardinality times
		state = pop()
		push(a')
		push Goto[state, A]
	else if Action[state.word] == "shift state si"
		push word
		push si
		word = scanner.nextWord()
	else if Action[state.word] == accept
		break
	else
		failure()

return success

Action tables are a mapping of states and words to steps - either shifts or reduces. Goto tables are a mapping of handles (functions to replace some b’ with a’) and state transitions.

In order to build these tables, we need to be able to represent handles and potential handles alongside the lookahead symbols. Each potential handle is presented by an LR(1) item which looks like [a' -> b' • y. a] where a' -> b' y is a grammar production, • represents the position of the stacktop, and a is a terminal symbol. The set of items is called the canonical collection. Where we place • means different things.

[a’ -> • b’y, a] = a’ would be valid, if we find a b’ next then we can discover a’. This item is called a possibility - it represents a possible completion for an input seen.
[a’ -> b’• y, a] = the parser has progressed from the state before and has recognized b’. This item is partially complete
[‘a -> b’y• , a] = the parser has found some b’y such that a’ followed by a would be valid. If the lookahead symbol is a, then the item is a handle and the parser can reduce b’y to a’. This is a complete item.

To build the canonical collection, a parser states with an initial state [Goal -> • SomeExpr, eof] that is, the first handle that can be reduced such that the lookahead symbol is the end of the file. We then find every potential state change as a set. The two operations that can occur are taking a closure and computing a transition.

A closure completes a state. It adds to the set of items any related implications. That is, anywhere Goal -> List is valid, we can also add the productions of ways to derive List. So if we see [Goal ->•List, eof], we can say that both [List -> • List Pair, eof] and [List -> • Pair, eof] are valid.
To compute a transition on some symbol x, the algorithm should find the item of items that has • before x, and pick the set of items that would transition • AFTER x. This is the goto procedure.

Pseudocode for closure

function closure(s)
	while s is different from lastSeenS
		for each item in s where [a' -> b' • nonTerminalC, a]
			for each production for nonTerminalC -> y in ProductionList
				for each lookAhead in FIRST for someState
					s = union(s, ([nonTerminalC ->• y, lookAhead]))
	
	return s

So if we had some grammar for math, we might start with [Goal -> • Expr, eof] and the closure operator might add to the set

[Expr -> • Num, eof], [Expr -> • Num, +], [Expr -> + Expr, eof]

The goto procedure takes some set CCi in the canonical collection, a grammar symbol c, and computes the states that would be made if x was accepted in state i.

goto(set, symbol)
	moved = null
	for item in set
		if item matches some [a' -> b'• xAndExtra, a]
			moved = union(moved, [a->b'x• theExtra, a])
	return closure(moved)

It tests to see if it can move the acceptance past the lookahead symbol , and if it does adds the moved state into a total collection, then finds the closure over that.

Conclusion

This is a very high-level overview, ignoring a lot of error management. There are many other parsing algorithms, such as SLR and LALR. I handwaved over a couple of details in the parsing pseudocode for brevity, but I have found some notes / books that can help.

I work and live in Brooklyn, NY building software.