Crafting Interpreters in OCaml - Finishing the Lexer
May 22, 2019
To follow along the source material, check out the Scanning chapter of Crafting Interpreters
Moving on to section 4.5.2 of the book, we have multi-character operators. Cool! Specifically, we have the !=, ==, <= and >= operators. Bob defines a match method that takes some expected character and checks to see if the current character is some character. If we’re at the end, it’s not. If the current character isn’t the expected character, it’s false. Else, we advance and return true. Okay that seems straightforward.
Let’s define a match function. Remember that we don’t have any global variables, so we’re going to have to pass in context as well as the character. Since we want to conditionally mutate current, it makes sense that our function takes and returns a scanner_context object. We also want to maintain that boolean return, so let’s return a tuple.
We also need our match function to look at the character in the next position, so we’ll need a helper function.
In our scanner.mli file we can define the following:
val next_char: scanner_context -> char option
val match_character: char -> scanner_context -> bool * scanner_contextThis defines a function that takes a character and a context and returns a 2-tuple of boolean and context types.
Now we can write some code
let next_char context =
try Some (String.get context.source (context.current)) with Invalid_argument _ -> None
let match_character expected_char context = if (is_at_end context) then (false, context) else
match next_char context with
| Some c when c = expected_char -> (true, advance context)
| _ -> (false, context)You may note that next_char and current_char look really similar. I don’t really have a good solution for that but that’s okay for now.
Now we can add some new cases to our match statement. We’re going to have to write a lot of cases that look something like
let x = match match_character '=' advanced_context with
| (true, ctx) -> add_non_literal_token BANG_EQUAL ctx
| (false, ctx) -> add_non_literal_token BANG ctxBut for all four above two-character operators. Since I’m finally being lazy, let’s write a helper. Our new function needs to take an expected character to match, a token for the true case, a token for the false case, and a context. It will return a new context with the token added. That signature will look like
val add_conditional_non_literal_token: char -> token_type -> token_type -> scanner_context -> scanner_contextNeat. Now we can write that function
let add_conditional_non_literal_token expected_char token_if_true token_if_false context =
match (match_character expected_char context) with
| (true, new_context) -> add_non_literal_token token_if_true new_context
| (false, new_context) -> add_non_literal_token token_if_false new_contextAnd our scan_token function now has the following new clauses:
let scan_token context =
let advanced_context = advance context in
match (current_char advanced_context) with
| Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
| Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
| Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
| Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
| Some ',' -> add_non_literal_token COMMA advanced_context
| Some '.' -> add_non_literal_token DOT advanced_context
| Some '-' -> add_non_literal_token MINUS advanced_context
| Some '+' -> add_non_literal_token PLUS advanced_context
| Some ';' -> add_non_literal_token SEMICOLON advanced_context
| Some '*' -> add_non_literal_token STAR advanced_context
| Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
| Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
| Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
| Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
| _ -> failwith "Disallowed character"So far we’re doing great! Section 4.6 adds some more lexeme handling. Specifically, comments. Let’s describe in words what Bob’s code does.
If we see a slash, check to see if the next character is also a slash. If it is, just more forward until we either see a newline or reach the end of the file. If the next character ISN’T a slash, then just add a slash token. In our world we don’t have multiline comments (whew).
So if we look a little ahead, we also have this peek function that, if we’re at the end, returns a null character, else the character at the current position. We can use another char option here since we don’t need nulls in this magical ML world.
It turns out we already wrote this function! next_char does the exact same thing. So we’ll just use that. If you want, you can alias it as let peek = next_char but I won’t.
We can use our match_character helper here to create another helper function. We want a function that either adds a slash token, or just moves on. Let’s call that function add_conditional_slash for now. It’ll take context and return context.
val add_conditional_slash: scanner_context -> scanner_contextlet add_conditional_slash context =
let rec consume_line context =
match (not (next_char context = Some '\n')), not (is_at_end context)) with
| (true, true) -> consume_line (advance context)
| _ -> context
in
if match_character '/' context then consume_line context else add_non_literal_token SLASH contextLet’s explain this code:
We nest a function to consume the entire line that says “if the next character isn’t a newline AND we’re not at the end of the file, then move forward a character and check again. Once either we see a newline or end of file, return the context object.
Then we simply say if the next character is a slash, consume the line. Else, add a slash token. Maybe later we’ll pull out this nested function but for now I think it’s okay, we have no other use cases for it.
Right - so we have some default cases that Bob lists. Let’s translate those by either returning context, or in the case of a newline, increment the line in our context.
let scan_token context =
let advanced_context = advance context in
match (current_char advanced_context) with
| Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
| Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
| Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
| Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
| Some ',' -> add_non_literal_token COMMA advanced_context
| Some '.' -> add_non_literal_token DOT advanced_context
| Some '-' -> add_non_literal_token MINUS advanced_context
| Some '+' -> add_non_literal_token PLUS advanced_context
| Some ';' -> add_non_literal_token SEMICOLON advanced_context
| Some '*' -> add_non_literal_token STAR advanced_context
| Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
| Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
| Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
| Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
| Some ' ' -> advanced_context
| Some '\r' -> advanced_context
| Some '\t' -> advanced_context
| Some '\n' -> {advanced_context with line = advanced_context.line + 1}
| _ -> failwith "Disallowed character"Not so bad! This is looking more and more readable. How rad is this? Let’s handle literals and then write some tests.
Bob defers all string handling to a function called string. In words, it does the following.
As long as the next character isn’t a quote, and as long as we’re not at the end just consume. If we see a newline, increment line and keep going. If we see the end of the file along the way, throw an error. Once we see the closing quote, advance one more, then grab the string and add a String token. Okay! Let’s make our own function. We’ll call it add_string_literal which will take and output a scanner_context.
There’s some conditonal line incrementing logic that I’m going to make it’s own function just for ease as well.
val increment_line_if_newline: scanner_context -> scanner_context
val add_string_literal: scanner_context -> scanner_context let increment_line_if_newline context =
match next_char context with
| Some '\n' -> {context with line = context.line+1}
| _ -> context
let rec add_string_literal context =
if( not ( next_char context = Some('"')) && (not (is_at_end context)) ) then
context |> increment_line_if_newline |> advance |> add_string_literal
else if (is_at_end context) then failwith "Unterminated string" else
let str = (String.sub context.source (context.start + 1) (context.current - 1)) in
add_literal_token STRING (Some (STRING_LITERAL str)) contextIt’s a bit gnarly and imperative. I’d like to find a way to pattern match out. BUT now we can add the following to our scan_tokens
| Some '"' -> add_string_literal advanced_contextStarting to get exciting now. Let’s also handle number literals in section 4.6.2. The rules of numbers are simple - if there is a . character, it must have 1 or more digits on both sides. So let’s try and implement this.
We need a function is_digit that takes a char option and returns true or false, and then a add_number_literal function that takes and returns a scanner_context. We’ll also create a peek_next along the way that takes a scanner_context and returns char option.
To make things easy, we should write a function that simply captures digits. Let’s call it capture_digits. It’ll just abstract out the Java code of while (isDigit(peek())) advance();. We can also write a function to defer to above with capture_decimal
In our interface:
val is_digit: char option -> bool
val capture_digits: scanner_context -> scanner_context
val capture_decimal: scanner_context -> scanner_context
val peek_next: scanner_context -> char option
val add_number_literal: scanner_context -> scanner_contextAnd back in scanner.ml
let peek_next context =
try Some (String.get context.source (context.current + 1)) with Invalid_argument _ -> None
let is_digit character = match character with
| Some c when c >= '0' && c <= '9' -> true
| _ -> false
let rec capture_digits context =
if is_digit (next_char context) then context |> advance |> capture_digits else context
let capture_decimal context = match (next_char context) with
| Some '.' when (is_digit peek_next context) -> context |> advance |> capture_digits
| _ -> context
let add_number_literal context =
let new_context = context |> capture_digits |> capture_decimal in
let stringified_number = (String.sub new_context.source new_context.start (new_context.current - new_context.start)
let parsed_number = int_of_string stringified_number in
add_literal_token NUMBER NUMBER_LITERAL(stringified_number) new_contextThen just add in scan_tokens the following clause
| digit when (is_digit digit) -> add_number_literal advanced_contextSpicy. If you go ahead and dune runtest you should get some compiler errors about unused characters but no syntax or interface errors. Woohoo!
Let’s finish up identifiers real quick. Once again, skimming ahead in Bob’s chapter we see the claim if isAlpha or in English, “if the next lexeme begins with a letter or underscore, capture everything that’s alphanumeric after and assume it’s an identifier.
Our version of is_alpha will take a char option and return boolean. So will is_alpha_numeric. Our add_identifier_literal will need to take a scanner context and return one. Let’s create a new literal_type
type literal_type = STRING_LITERAL of string | NUMBER_LITERAL of float | IDENTIFIER_LITERAL of stringThis new type will just wrap our identifiers that users create. If a user defines a top levek function, it’s an IDENTIFIER_LITERAL. If it’s a usual keyword like AND, it’s just an IDENTIFIER.
To our interface to define implementations.
val is_alpha: char option -> bool
val is_alpha_numeric: char option -> bool
val add_identifier_literal: scanner_context -> scanner_contextAnd our implementation
let is_alpha character = match character with
| Some c when c >= 'a' && c <='z' -> true
| Some c when c >= 'A' && c <='Z' -> true
| Some c when c = '_' -> true
| _ -> false
let is_alpha_numeric character = (is_alpha character) || (is_digit character)
let rec add_identifier_literal context =
if (is_alpha_numeric (next_char context)) then context |> advance |> add_identifier_literal
else
let substring = (String.sub context.source context.start (context.start - context.current)) in
match substring with
| "and" -> add_non_literal_token AND context
| "class" -> add_non_literal_token CLASS context
| "else" -> add_non_literal_token ELSE context
| "false" -> add_non_literal_token FALSE context
| "for" -> add_non_literal_token FOR context
| "fun" -> add_non_literal_token FUN context
| "if" -> add_non_literal_token IF context
| "nil" -> add_non_literal_token NIL context
| "or" -> add_non_literal_token OR context
| "print" -> add_non_literal_token PRINT context
| "return" -> add_non_literal_token RETURN context
| "super" -> add_non_literal_token SUPER context
| "this" -> add_non_literal_token THIS context
| "true" -> add_non_literal_token TRUE context
| "var" -> add_non_literal_token VAR context
| "while" -> add_non_literal_token WHILE context
| _ -> add_literal_token IDENTIFIER (Some (IDENTIFIER_LITERAL substring)) contextand then we can add in our scan_tokens clause the last piece.
| alpha when (is_alpha alpha) -> add_identifier_literal advanced_contextAnd now if you dune runtest one more time the only issue you should get is the call to Scanner.scan_tokens in olox.ml. Just go ahead and open Scanner at the top to fix that.
We now have a working scanner. In the next chapter move onto the code representation step.
I work and live in Brooklyn, NY building software.