Crafting Interpreters in OCaml - Finishing the Lexer

May 22, 2019

To follow along the source material, check out the Scanning chapter of Crafting Interpreters

Moving on to section 4.5.2 of the book, we have multi-character operators. Cool! Specifically, we have the !=, ==, <= and >= operators. Bob defines a match method that takes some expected character and checks to see if the current character is some character. If we’re at the end, it’s not. If the current character isn’t the expected character, it’s false. Else, we advance and return true. Okay that seems straightforward.

Let’s define a match function. Remember that we don’t have any global variables, so we’re going to have to pass in context as well as the character. Since we want to conditionally mutate current, it makes sense that our function takes and returns a scanner_context object. We also want to maintain that boolean return, so let’s return a tuple.

We also need our match function to look at the character in the next position, so we’ll need a helper function.

In our scanner.mli file we can define the following:

val next_char: scanner_context -> char option

val match_character: char -> scanner_context -> bool * scanner_context

This defines a function that takes a character and a context and returns a 2-tuple of boolean and context types.

Now we can write some code

let next_char context =
  try Some (String.get context.source (context.current)) with Invalid_argument _ -> None

let match_character expected_char context = if (is_at_end context) then (false, context) else
  match next_char context with
    | Some c when c = expected_char -> (true, advance context)
    | _ -> (false, context)

You may note that next_char and current_char look really similar. I don’t really have a good solution for that but that’s okay for now.

Now we can add some new cases to our match statement. We’re going to have to write a lot of cases that look something like

let x = match match_character '=' advanced_context with
  | (true, ctx) -> add_non_literal_token BANG_EQUAL ctx
  | (false, ctx) -> add_non_literal_token BANG ctx

But for all four above two-character operators. Since I’m finally being lazy, let’s write a helper. Our new function needs to take an expected character to match, a token for the true case, a token for the false case, and a context. It will return a new context with the token added. That signature will look like

val add_conditional_non_literal_token: char -> token_type -> token_type -> scanner_context -> scanner_context

Neat. Now we can write that function

let add_conditional_non_literal_token expected_char token_if_true token_if_false context =
    match (match_character expected_char context) with
    | (true, new_context) -> add_non_literal_token token_if_true new_context
    | (false, new_context) -> add_non_literal_token token_if_false new_context

And our scan_token function now has the following new clauses:

let scan_token context =
  let advanced_context = advance context in
  match (current_char advanced_context) with
  | Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
  | Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
  | Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
  | Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
  | Some ',' -> add_non_literal_token COMMA advanced_context
  | Some '.' -> add_non_literal_token DOT advanced_context
  | Some '-' -> add_non_literal_token MINUS advanced_context
  | Some '+' -> add_non_literal_token PLUS advanced_context
  | Some ';' -> add_non_literal_token SEMICOLON advanced_context
  | Some '*' -> add_non_literal_token STAR advanced_context
  | Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
  | Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
  | Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
  | Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
  | _ -> failwith "Disallowed character"

So far we’re doing great! Section 4.6 adds some more lexeme handling. Specifically, comments. Let’s describe in words what Bob’s code does.

If we see a slash, check to see if the next character is also a slash. If it is, just more forward until we either see a newline or reach the end of the file. If the next character ISN’T a slash, then just add a slash token. In our world we don’t have multiline comments (whew).

So if we look a little ahead, we also have this peek function that, if we’re at the end, returns a null character, else the character at the current position. We can use another char option here since we don’t need nulls in this magical ML world.

It turns out we already wrote this function! next_char does the exact same thing. So we’ll just use that. If you want, you can alias it as let peek = next_char but I won’t.

We can use our match_character helper here to create another helper function. We want a function that either adds a slash token, or just moves on. Let’s call that function add_conditional_slash for now. It’ll take context and return context.

val add_conditional_slash: scanner_context -> scanner_context

let add_conditional_slash context =
    let rec consume_line context =
      match (not (next_char context = Some '\n')), not (is_at_end context)) with
      | (true, true) -> consume_line (advance context)
      | _ -> context
    in
    if match_character '/' context then consume_line context else add_non_literal_token SLASH context

Let’s explain this code:

We nest a function to consume the entire line that says “if the next character isn’t a newline AND we’re not at the end of the file, then move forward a character and check again. Once either we see a newline or end of file, return the context object.

Then we simply say if the next character is a slash, consume the line. Else, add a slash token. Maybe later we’ll pull out this nested function but for now I think it’s okay, we have no other use cases for it.

Right - so we have some default cases that Bob lists. Let’s translate those by either returning context, or in the case of a newline, increment the line in our context.

let scan_token context =
    let advanced_context = advance context in
    match (current_char advanced_context) with
    | Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
    | Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
    | Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
    | Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
    | Some ',' -> add_non_literal_token COMMA advanced_context
    | Some '.' -> add_non_literal_token DOT advanced_context
    | Some '-' -> add_non_literal_token MINUS advanced_context
    | Some '+' -> add_non_literal_token PLUS advanced_context
    | Some ';' -> add_non_literal_token SEMICOLON advanced_context
    | Some '*' -> add_non_literal_token STAR advanced_context
    | Some '!' -> add_conditional_non_literal_token '=' BANG_EQUAL BANG advanced_context
    | Some '=' -> add_conditional_non_literal_token '=' EQUAL_EQUAL EQUAL advanced_context
    | Some '>' -> add_conditional_non_literal_token '=' GREATER_EQUAL GREATER advanced_context
    | Some '<' -> add_conditional_non_literal_token '=' LESS_EQUAL LESS advanced_context
    | Some ' ' -> advanced_context
    | Some '\r' -> advanced_context
    | Some '\t' -> advanced_context
    | Some '\n' -> {advanced_context with line = advanced_context.line + 1}
    | _ -> failwith "Disallowed character"

Not so bad! This is looking more and more readable. How rad is this? Let’s handle literals and then write some tests.

Bob defers all string handling to a function called string. In words, it does the following.

As long as the next character isn’t a quote, and as long as we’re not at the end just consume. If we see a newline, increment line and keep going. If we see the end of the file along the way, throw an error. Once we see the closing quote, advance one more, then grab the string and add a String token. Okay! Let’s make our own function. We’ll call it add_string_literal which will take and output a scanner_context.

There’s some conditonal line incrementing logic that I’m going to make it’s own function just for ease as well.

val increment_line_if_newline: scanner_context -> scanner_context

val add_string_literal: scanner_context -> scanner_context

 let increment_line_if_newline context =
    match next_char context with
    | Some '\n' -> {context with line = context.line+1}
    | _ -> context

  let rec add_string_literal context =
     if( not ( next_char context = Some('"')) && (not (is_at_end context)) ) then
      context |> increment_line_if_newline |> advance |> add_string_literal
    else if (is_at_end context) then failwith "Unterminated string" else
    let str = (String.sub context.source (context.start + 1) (context.current - 1)) in
    add_literal_token STRING (Some (STRING_LITERAL str)) context

It’s a bit gnarly and imperative. I’d like to find a way to pattern match out. BUT now we can add the following to our scan_tokens

| Some '"' -> add_string_literal advanced_context

Starting to get exciting now. Let’s also handle number literals in section 4.6.2. The rules of numbers are simple - if there is a . character, it must have 1 or more digits on both sides. So let’s try and implement this.

We need a function is_digit that takes a char option and returns true or false, and then a add_number_literal function that takes and returns a scanner_context. We’ll also create a peek_next along the way that takes a scanner_context and returns char option.

To make things easy, we should write a function that simply captures digits. Let’s call it capture_digits. It’ll just abstract out the Java code of while (isDigit(peek())) advance();. We can also write a function to defer to above with capture_decimal

In our interface:

val is_digit: char option -> bool

val capture_digits: scanner_context -> scanner_context

val capture_decimal: scanner_context -> scanner_context

val peek_next: scanner_context -> char option

val add_number_literal: scanner_context -> scanner_context

And back in scanner.ml

let peek_next context =
    try Some (String.get context.source (context.current + 1)) with Invalid_argument _ -> None

  let is_digit character = match character with
    | Some c when c >= '0' && c <= '9' -> true
    | _ -> false

  let rec capture_digits context =
    if is_digit (next_char context) then context |> advance |> capture_digits else context

  let capture_decimal context = match (next_char context) with
    | Some '.' when (is_digit peek_next context) -> context |> advance |> capture_digits
    | _ -> context

  let add_number_literal context =
    let new_context = context |> capture_digits |> capture_decimal in
    let stringified_number = (String.sub new_context.source new_context.start (new_context.current - new_context.start)
    let parsed_number = int_of_string stringified_number in
    add_literal_token NUMBER NUMBER_LITERAL(stringified_number) new_context

Then just add in scan_tokens the following clause

| digit when (is_digit digit) -> add_number_literal advanced_context

Spicy. If you go ahead and dune runtest you should get some compiler errors about unused characters but no syntax or interface errors. Woohoo!

Let’s finish up identifiers real quick. Once again, skimming ahead in Bob’s chapter we see the claim if isAlpha or in English, “if the next lexeme begins with a letter or underscore, capture everything that’s alphanumeric after and assume it’s an identifier.

Our version of is_alpha will take a char option and return boolean. So will is_alpha_numeric. Our add_identifier_literal will need to take a scanner context and return one. Let’s create a new literal_type

type literal_type = STRING_LITERAL of string | NUMBER_LITERAL of float | IDENTIFIER_LITERAL of string

This new type will just wrap our identifiers that users create. If a user defines a top levek function, it’s an IDENTIFIER_LITERAL. If it’s a usual keyword like AND, it’s just an IDENTIFIER.

To our interface to define implementations.

 val is_alpha: char option -> bool

val is_alpha_numeric: char option -> bool

val add_identifier_literal: scanner_context -> scanner_context

And our implementation

let is_alpha character = match character with
  | Some c when c >= 'a' && c <='z' -> true
  | Some c when c >= 'A' && c <='Z' -> true
  | Some c when c = '_' -> true
  | _ -> false

  let is_alpha_numeric character = (is_alpha character) || (is_digit character)

  let rec add_identifier_literal context =
    if (is_alpha_numeric (next_char context)) then context |> advance |> add_identifier_literal
    else
    let substring = (String.sub context.source context.start (context.start - context.current)) in
    match substring with
    | "and" -> add_non_literal_token AND context
    | "class" -> add_non_literal_token CLASS context
    | "else" -> add_non_literal_token ELSE context
    | "false" -> add_non_literal_token FALSE context
    | "for" -> add_non_literal_token FOR context
    | "fun" -> add_non_literal_token FUN context
    | "if" -> add_non_literal_token IF context
    | "nil" -> add_non_literal_token NIL context
    | "or" -> add_non_literal_token OR context
    | "print" -> add_non_literal_token PRINT context
    | "return" -> add_non_literal_token RETURN context
    | "super" -> add_non_literal_token SUPER context
    | "this" -> add_non_literal_token THIS context
    | "true" -> add_non_literal_token TRUE context
    | "var" -> add_non_literal_token VAR context
    | "while" -> add_non_literal_token WHILE context
    | _ -> add_literal_token IDENTIFIER (Some (IDENTIFIER_LITERAL substring)) context

and then we can add in our scan_tokens clause the last piece.

| alpha when (is_alpha alpha) -> add_identifier_literal advanced_context

And now if you dune runtest one more time the only issue you should get is the call to Scanner.scan_tokens in olox.ml. Just go ahead and open Scanner at the top to fix that.

We now have a working scanner. In the next chapter move onto the code representation step.

I work and live in Brooklyn, NY building software.