Crafting Interpreters in OCaml - Building a Lexer

May 16, 2019

*To follow along the source material, check out the Scanning chapter of Crafting Interpreters

For ease, I’ll recommend you clone down this starter repo. We’ll be building a scanner for a Lox implementation in OCaml, which we’ll call Olox. Let’s create a file called olox.ml. This is going to be our entry point for the application, similar to Lox.java in the source material.

Bob’s Java code handles the following three cases:

If more than one argument is passed in, output a line telling the user the proper usage and exit.
If one argument is passed in, run a file with that name.
Else, just open a prompt.

Okay, let’s start by writing out the following code in our olox.ml

  let main = fun _ -> match Array.length Sys.argv with
    | 0 -> run_prompt ()
    | 1 -> run_file (Array.get Sys.argv 0)
    | _ -> print_endline "Usage: olox [script]"; exit 64;

We’re pattern matching on the length of the arguments vector passed in. Sys.argv takes an array of strings that we can use. If no arguments are passed in, we run a prompt. If one is passed in, we run a file with that first element. Else, we’ll print an error and exit. We can’t really run this code, since we have two undefined functions in run_prompt and run_file but we’re just following along.

So now need to define those two functions. Bob’s implementation defers run_file to a function called run by reading the bytes out of a file. run_prompt also just takes a line from input and calls run. So we now need to define those three functions.

Let’s add the following in our module above our main function

  let run source =
    let tokens = Scanner.scan_tokens in
    List.iter (fun token -> print_endline token) tokens

  let run_prompt = fun _ ->
    while true do
      print_string "> ";
      let input = read_line ()
      in run input;
    done

  let run_file filename =
    let channel = open_in filename in
    try
      (* read entire file *)
      let line = really_input_string channel (in_channel_length channel) in
      run line;
      flush stdout;
      close_in channel
    with e ->
      close_in_noerr channel;
      raise e

Like Bob’s Java code, we have a run function that just gets tokens from a scanner - in his case, an instance of a Scanner class. We’ll just use a to-be-defined module. Then we’ll just iterate over the returned tokens and print them out.

Our run_prompt function will print a string as a prompt, then read in the next line and pass it to our run function.

Our run_file function takes some filename and creates an input channel, positioned at the beginning of the file. It then gets the entirety of the file by reading in_channel_length channel number of characters, then passes into run. For clarity, see the following type signatures.

val really_input_string : in_channel -> int -> string val in_channel_length : in_channel -> int

really_input_string takes a channel and an int, and returns a string. in_channel_length returns the length of a channel.

So far so good! If you were to copy and paste this into utop, you’d error on the Scanner module. That’s fine - feel free to replace run with something like let run = print_endline and test all those functions. In fact, let’s try it right now. Create a file called test.md and add the following code.

Hello, world!
This is the second line!
Now on line 3 with spec!@l characters

In your terminal, run utop and copy and paste the following code which replaces run with a simple print.

let run = print_endline

let run_prompt = fun _ ->
  while true do
    print_string "> ";
    let input = read_line ()
    in run input;
  done

let run_file filename =
  let channel = open_in filename in
  try
    (* read entire file *)
    let line = really_input_string channel (in_channel_length channel) in
    run line;
    flush stdout;
    close_in channel
  with e ->
    close_in_noerr channel;
    raise e

Try running run_file "test.md";; and then run_prompt ();; Anything in your prompt should be returned back. Neat! Now ctrl+D to exit.

Next up is error handling! Bob makes the following point and I think it’s very valid

The other reason I pulled the error reporting out here instead of stuffing it into the scanner and other phases where the error occurs is to remind you that it’s a good engineering practice to separate the code that generates the errors from the code that reports them.

However, in the interest of not having a global mutable variable, and keeping some of my logic contained in the scanner (for now), I’m going to go ahead and handle all error management in the scanner. This way we can maintain all our scanner logic (line number, character positions) safely isolated and still report on it.

Let’s jump right into it. Create a file called scanner.ml. Bob’s examples creates classes for Token and TokenType. In our FP world, those can just be types that our scanner knows about so we’ll wrap it all up inside one file. If we need to refactor later we can.

Looking at Token.java in the source, we see a giant enum. When I see enum, I think variant types - read up here if you need. Let’s take the enum and create a variant type in our scanner.

module Scanner = struct
  type token_type =
    (* single-character tokens *)
    | LEFT_PAREN
    | RIGHT_PAREN
    | LEFT_BRACE
    | RIGHT_BRACE
    | COMMA
    | DOT
    | MINUS
    | PLUS
    | SEMICOLON
    | SLASH
    | STAR

    (* One or two character tokens *)
    | BANG
    | BANG_EQUAL
    | EQUAL
    | EQUAL_EQUAL
    | GREATER
    | GREATER_EQUAL
    | LESS
    | LESS_EQUAL

    (* Literals *)
    | IDENTIFIER
    | STRING
    | NUMBER
    (* Keywords *)
    | AND
    | CLASS
    | ELSE
    | FALSE
    | FUN
    | FOR
    | IF
    | NIL
    | OR
    | PRINT
    | RETURN
    | SUPER
    | THIS
    | TRUE
    | VAR
    | WHILE
    | EOF
end

Just to quote section 4.2.2,

There are lexemes for literal values—numbers and strings and the like. Since the scanner has to walk each character in the literal to correctly identify it, it can also convert it to the real runtime value that will be used by the interpreter later.

Before we move on, we’ll need to have some way of wrapping literals. Our token is going to want to have a reference of token type AND optional literals - e.g. a token could be of type TRUE with no literal, or of type NUMBER with a literal 2. So, let’s start by just making two types for literals - string and number

  type literal_type = STRING_LITERAL of string | NUMBER_LITERAL of float

This doesn’t solve all our problems, but at least we now have some idea of what a literal can be. Moving on to make our token, we should define a type that represents a token. Bob’s Token class has a type of TokenType, a lexeme of String, a literal of some Object, and a line of Int. We can do something similar with

type token = {
  lexeme: string;
  literal: literal_type option;
  line: int;
  token_type: token_type;
}

Nikhil why is that literal an option? Well, as we had mentioned before - a token does not need to contain a literal. There is no literal for the token_type VAR, so we may not need a literal in our token. Options represent that. Real World OCaml’s chapter on options is a great primer for more info.

Lets take a moment to look at the initial Scanner code in section 4.4. The file defined a class Scanner that has a field source, a string that represents the code we’re scanning. scanTokens takes a list of tokens, checks to see if we’re at the end of the source and if not, scans tokens. Once at the end, we add a token representing the end of the file and return the list of tokens. We also have the use of a helper function that determines if we’re at the end of the source code.

We see that the original Java code has some fields for use. Let’s embrace our FP-world and rather than create a class, we’re going to create a type that represents some record of fields. We can then ensure all our functions take and emit that type.

type scanner_context = {
  source: string;
  start: int;
  current: int;
  line: int;
}

Now that we have a type that contains all the fields we know about, let’s define our methods. To start, let’s create an interface file. Create a file called scanner.mli with the following

module Scanner : sig
  type token_type

  type literal_type

  type token

  type scanner_context

  val scan_tokens: string -> token list

  val is_at_end: scanner_context -> bool
end

In our interface, I’ll be using abstract types. type token and type scanner_context now hide the implementation of the type from any module that implements this signature. You can read more about them on the OCaml website and this Cornell class’s notes.

Our signatures implement what Bob’s Java code does - is_at_end returns true or false for the state of the application. scan_tokens will go through the source code and return all the tokens.

Let’s prove that our interface actually does something! Open dune and replace with the following code`

(executable
  (name olox)
  (libraries oUnit)
)

(alias
  (name    runtest)
  (deps    (:x olox.exe))
  (action  (run %{x})))

Now just run dune build and dune runtest. You should see the following errors.

File "scanner.ml", line 1:
Error: The implementation scanner.ml
       does not match the interface .olox.eobjs/byte/scanner.cmi:
       ...
       In module Scanner:
       The value `is_at_end' is required but not provided
       File "scanner.mli", line 56, characters 2-40: Expected declaration
       In module Scanner:
       The value `scan_tokens' is required but not provided
       File "scanner.mli", line 54, characters 2-48: Expected declaration

There’s a bit more above it but let’s come back to that later. Our interface works! We stated that the module must have two specific functions but did not implement them. Now to do that.

We’ll first write is_at_end. The function takes our wrapping context and returns a boolean. I wrote

let is_at_end ctx = ctx.current >= (String.length ctx.source)

Then we’ll have our scan_tokens function. This function takes just the source code string as an input, and will return a list of tokens. Somewhere in the middle we’ll need to wrap up the rest of our context. OCaml has local let bindings we can use for this. We also now realize we need to maintain a running list of tokens that we’re adding. Let’s add tokens to our scanner_context. In the definition of type scanner_context just add tokens: token list; inside.

Looking at Bob’s implementation, we have a few more pieces of information. We have a scanToken function that does…something. We can at least assume it changes the value of current globally due to the reset inside the loop. We also have some addition of an EOF token. Let’s sketch out what a scan_token function might look like. Given that it’ll change two values, current and our token list, let’s have it receive and output a scanner_context. In our .mli file add

  val scan_token: scanner_context -> scanner_context

Bob’s token addition basically just instantiates a token object and adds it to the global list. Let’s do the same, by writing a function add_token that takes all the fields of a token as well as context and returns the new context.

  val add_token: token_type -> string -> literal_type option -> int -> scanner_context -> scanner_context

Okay! We have some loosely defined code interfaces. Let’s implement. In scanner.ml we’ll do the eaier of the two first.

let add_token token_type lexeme literal line ctx =
  {ctx with tokens = ctx.tokens @ [{
      token_type;
      lexeme;
      literal;
      line
    }]
  }

There’s a performance issue here where OCaml’s list appending operator (either @ or List.append) runs in O(n) time by walking through the length of the first list. It’s not ideal but that’s what we get for using a linked list. For now, we’ll ignore this and come back to it when it’s a performance issue.

Our next function will be our scanning. Look at Bob’s implementation. In words - he gets the next character, checks to see what it is, then adds a token. I’m going to break this into three functions - one function to just increment the current counter, one to get the next character, and our add token above.

We can add the following two interfaces

  val advance: scanner_context -> scanner_context

  val current_char: scanner_context -> char option

And then define our functions

  let advance context = {context with current = context.current + 1}

  let current_char context =
    try Some (String.get context.source (context.current -1)) with Invalid_argument _ -> None

I’m using an option for the current_char because it’s possible to get an out of bounds exception on string access, and it’s better to just be safe than sorry.

We also have sone unary function addToken that Bob defines, that just defers to a binary function. I’m going to define these two functions as either adding a literal, or a nonliteral. That means we’ll need a function add_literal_token and add_non_literal_token - the former takes both a token_type and Some literal and the latter can defer to the former with a None. We’re going to again have context passed along as the last argument in order to make pipelining our functions easy.

val add_literal_token: token_type -> literal_type option -> scanner_context -> scanner_context

val add_non_literal_token: token_type -> scanner_context -> scanner_context

Great. Now to just write those little helpers…

let add_literal_token token_type literal_type context =
  let text = (String.sub context.source context.start (context.current - context.start)) in
  add_token token_type text literal_type context.line context

let add_non_literal_token token_type context = add_literal_token token_type None context

We create a substring from the start position and capture the difference between start and current, then add the token. String.sub in OCaml takes a start and length, rather than a start and end so we need to do a little math.

Now that we have these, we can write out our scan_token function and finally write scan_tokens. Again, since we see that scan_token in Bob’s example cares about two fields - the current character, and the token list, let’s pass in our context and get out context.

val scan_token: scanner_context -> scanner_context

let scan_token context =
  let advanced_context = advance context in
  match (current_char advanced_context) with
  | Some '(' -> add_non_literal_token LEFT_PAREN advanced_context
  | Some ')' -> add_non_literal_token RIGHT_PAREN advanced_context
  | Some '{' -> add_non_literal_token LEFT_BRACE advanced_context
  | Some '}' -> add_non_literal_token RIGHT_BRACE advanced_context
  | Some ',' -> add_non_literal_token COMMA advanced_context
  | Some '.' -> add_non_literal_token DOT advanced_context
  | Some '-' -> add_non_literal_token MINUS advanced_context
  | Some '+' -> add_non_literal_token PLUS advanced_context
  | Some ';' -> add_non_literal_token SEMICOLON advanced_context
  | Some '*' -> add_non_literal_token STAR advanced_context
  | _ -> failwith "Disallowed character"

WHEW. That’s a whole lot of work but we did it. We now have a way to scan tokens from a string. Now for the fun part.

In our next post let’s try to write more matching cases. And if we’re lucky, some tests.

I work and live in Brooklyn, NY building software.