Honing my craft

Writing an Interpreter in OCaml - Lexical Analysis

August 19, 2019


Code is on my Github


Background

I’ve decided to take another stab at this interpreter. I’ll be using Thorsten Ball’s book Writing an Interpreter in Go as my primary source, with extra reading material from Bob Nystrom’s Crafting Interpreters. I find that Thorsten’s book gets me productive faster, while Bob’s material is deeper and helps clarify and illuminate what I don’t understand.

Thorsten’s book defines a language called Monkey, that’s similar in syntax to Javascript. Curly braces, integer math, not whitespace sensitive, recursion, the whole nine yards.

For this project I’ll be using the OCaml language. OCaml sits at an intersection of academia and large production-scale software. It’s used at Jane Street, Docker, Bloomberg, Facebook and other companies. I plan on redoing this project in Haskell eventually - I missed some syntax and conventions from Haskell in writing this chapter. I find inline function composition and application is more readable with Haskell’s dot notation and $ application than pipelines.

-- Haskell
next_char . read_char $ lexer
(* OCaml *)
lexer |> read_char |> next_char

Requirements

  • Dune 1.11.x
  • opam 2.0.3
  • OCaml 4.07.1
  • Alcotest 0.8.5 (testing framework)
  • Fmt 0.8.8 (pretty printer for testing)

Use opam to set up OCaml and install those dependencies. I chose a new version of Dune, and I went with Alcotest as it seemed relatively popular on the OCaml discord. It’s also quite readable and pleasant to configure.

Project structure

Take a look at this commit for some files you can copy

ROOT
  - dune-project
  - monkey.intall
  - monkey.opam
  - src/
    - dune
  - test/
    - dune

The dune files will let you link and build your code and run the tests. We’ll update these as we move along, but for now this will get you started.

Defining our first tokens

The first step for Thorsten’s interpreter is defining a subset of our tokens. In lexical analysis, a token is a simple data structure that represents a small piece of syntax in our programming language. A token could be a SEMICOLON, a PLUS_SIGN, an IDENTIFIER(“x”), an INTEGER(5), or some other representation.

We’re going to define our tokens in src/token.ml. Create that file and let’s add some code. We’re going to open a module and add basic tokens.

module Token = struct
  type token_type =
    | ILLEGAL
    | EOF
    (* Identifiers and literals *)
    | IDENT of string
    | INT of int
    (* Operators *)
    | ASSIGN
    | PLUS

    (* -- Delimiters *)
    | COMMA
    | SEMICOLON
    | LPAREN
    | RPAREN
    | LBRACE
    | RBRACE
    (* -- Keywords *)
    | FUNCTION
    | LET
end

We’re also going to add a token_to_string function so we can print out some tokens for our tests, as well as keep our compiler happy when it inevitably barks at us for unused code.

let token_to_string = function
  | ILLEGAL -> "ILLEGAL"
  | EOF -> "EOF"
  | IDENT a -> "IDENT " ^ a
  | INT a -> "INT " ^ string_of_int a
  | ASSIGN -> "ASSIGN"
  | PLUS -> "PLUS"
  | COMMA -> "COMMA"
  | SEMICOLON -> "SEMICOLON"
  | LPAREN -> "LPAREN"
  | RPAREN -> "RPAREN"
  | LBRACE -> "LBRACE"
  | RBRACE -> "RBRACE"
  | FUNCTION -> "FUNCTION"
  | LET -> "LET"

Unlike Thorsten’s Go implementation, we don’t need to keep a list of constants. Rather, we can use OCaml’s algebraic data types to represent tokens. In fact, OCaml is commonly used in compilers and similar dev tools for exactly this reason.

Thorsten then moves on to creating some tests, in order to build his NextChar function. We’re going to similar add tests.

We’ll need to set up a small module export. Create src/monkey.ml and add the following code.

module Token = Token
module Lexer = Lexer

Create a file test/test.ml and let’s add a bit of boilerplate.

open Monkey
include Lexer
include Token

let token_testable = Alcotest.testable Token.pretty_print (=)

let test_lexer_delimiters () =
  Alcotest.(check (list token_testable))
    "same token types" [
        Token.ASSIGN
      ; Token.PLUS
      ; Token.LPAREN
      ; Token.RPAREN
      ; Token.LBRACE
      ; Token.RBRACE
      ; Token.COMMA
      ; Token.SEMICOLON
      ; Token.EOF
      ]
      (Lexer.generate_tokens "=+(){},;")

let () =
  Alcotest.run "Lexer"
    [
      ( "list-delimiters",
        [ Alcotest.test_case "first case" `Slow test_lexer_delimiters ] );
    ]

See that Token.pretty_print function? We’ll need to define a way to return a formatting string representing our Token. Let’s go back into src/token.ml

let pretty_print ppf tok = Fmt.pf ppf "Token %s" (token_to_string tok)

We can look at Alcotest’s source code to dig into what this code does. Effectively, for any inferred type a it takes a function that formats a type a and an equality checking function that takes two a and returns a boolean. You can look at the implementations of int32 or string to get a hint of how that works. Our token_testable is a module that works on Token.token_type, our defined set of token tags. If we were to run dune runtest we’d get an error because we haven’t yet implemented our lexer. So let’s do that.

Setting up the Lexer

Let’s create src/lexer.ml.

(* lexer *)

module Lexer = struct
  include Token

  type lexer = {
    input : string;
    position : int;
    read_position : int;
    ch : char;
  }

  let null_byte = '\x00'

  let new_lexer input_string =
    {
      input = input_string;
      position = 0;
      read_position = 0;
      ch = null_byte;
    }
end

I chose to use a null_byte instead of an OCaml option type because in my opinion, an Option represents a potential gap whereas we know that we won’t have gaps in our source code. We’ll simply be reading the string until we run out of string. The end of our source code isn’t undefined, but a known value.

We need to define our read_char function. In a lexer, reading a character is simply incrementing our pointer in our lexer and looking at the next character in the input string.

let read_char lexer =
  let read_to_end = lexer.read_position >= String.length(lexer.input) in
  let new_ch = if read_to_end then null_byte else String.get lexer.input lexer.read_position
  in {lexer with position = lexer.read_position; read_position = lexer.read_position + 1; ch = new_ch}

OCaml is immutable by default, so we can’t just adjust our lexer. Or rather, we can, but why do that when we can avoid that mutable state. We look to see if we’re at the end of the string. If we are, the character is a null, else we get the next character in the string and update our lexer. In order to initialize our lexer, we can use our read_char. We’ll update new_lexer to look like

let new_lexer input_string =
  let lexer = {
    input = input_string;
    position = 0;
    read_position = 0;
    ch = null_byte;
  } in
  read_char lexer

Awesome! Our test code is still calling Lexer.generate_tokens so we’ll need to continue working on this.

let next_char lexer = match lexer.ch with
  | '=' -> (read_char lexer, Token.ASSIGN)
  | ';' -> (read_char lexer, Token.SEMICOLON)
  | '(' -> (read_char lexer, Token.LPAREN)
  | ')' -> (read_char lexer, Token.RPAREN)
  | ',' -> (read_char lexer, Token.COMMA)
  | '+' -> (read_char lexer, Token.PLUS)
  | '{' -> (read_char lexer, Token.LBRACE)
  | '}' -> (read_char lexer, Token.RBRACE)
  | '\x00' -> (lexer, Token.EOF)
  | _ -> failwith "unmatched character"

let generate_tokens input_string =
  let lexer = new_lexer input_string in
  let rec gen lxr tokens =
    match next_char lxr with
    | (_, Token.EOF) -> List.rev_append tokens [Token.EOF]
    | (l, tok) -> gen l (tok :: tokens)
  in gen lexer []

This code matches on our known inputs and returns a tuple of an updated lexer and the last read token. This tuple allows us to collect tokens and continue to call next_char until it ends in our generate_tokens function.

generate_tokens is defining a recursive function that calls next_char on a lexer. For any token, it’ll add the token to a list of seen tokens and call gen again on the new lexer. Once it finds an EOF token, it will rev_append the collected tokens to the EOF.

Side note: rev_append reverses the first list and concatenates it to the second. Because we’re using the OCaml cons operator in the non-EOF case, we’re actually constructing a list in reverse order so we need to flip it at the end. The reason for this is that cons is constant time where append is linear time proportional to the first list (the operation has to traverse the entire list before adding the elements of the second list). So it’s easier to reverse at the end (linear time once) instead of appending (linear time for every token).

If we run dune runtest we get passing tests!

Up next

The next chapter will take care of adding tokens for real source code.


Nikhil Thomas

I work and live in Brooklyn, NY building software.