Make a Compiler in Swift with LLVM | Part 1

This is the first part in a series of articles on creating a simple compiler in Swift using LLVM. LLVM is what Swift itself uses for a backend. We will be using it to compile our simple language via the SwiftLLVM package. This package provides bindings from Swift to LLVM.

The Language

The language we will be building is called "Slang", or "Stupid Language". Here's how the language looks:

// This is a comment
/* This is a multiline
   comment */
func main() {
    let message = 5
    print(message)
}

It's a very stupid language, but it has familiar syntax. Here's some of it's features:

The code above would compile to the following LLVM IR:

; ModuleID = 'main'
source_filename = "main"

@.floatFormat = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1

define void @main() {
entry:
    %0 = alloca double
    store double 5.000000e+00, double* %0
    %1 = load double, double* %0
    call void @print(double %1)
    ret void
}

declare i32 @printf(i8*, ...)

define void @print(double) {
entry:
    %1 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.floatFormat, i32 0, i32 0), double %0)
    ret void
}

If that doesn't make any sense yet, don't worry! We'll cover IR in more depth in a later part.

Making the Lexer

The purpose of the lexer is to take the source code, and convert it into a bunch of different tokens. Then we can try to assemble these tokens in a meaningful way.

We'll start by creating a Token struct. Every token will have a way to identify what kind of token it is, so we'll create an enum to represent those different kinds.

struct Token {
    /// The actual text of the token.
    let text: String
    /// The line the token appeared on. Used for error logging.
    let line: Int
    /// The kind of token this is.
    let kind: Kind
    
    // We can define the entire language with 10 kinds of `Token`
    enum Kind {
        case keyword,
             identifier,
             number,
             lParen,
             rParen,
             lBrace,
             rBrace,
             comma,
             equals,
             `operator`

        // Oh, and EOF
        case eof
    }
}

When we're parsing the content of the source file, we'll likely hit whitespace and comments. We'll call this Trivia, and attach it to the token. Then we can find out what whitespace came before the token:

struct Token {
    ...
    /// Trivia like whitespace and comments.
    let trivia: [Trivia]
    enum Trivia {
        case space
        case newline
        case tab
        case comment(String)
        case multilineComment(String)
    }
}

I'll also provide an initializer for the Token so we don't have to provide any Trivia:

struct Token {
    ...
    init(kind: Kind, text: String, line: Int, trivia: [Trivia] = []) {
        self.kind = kind
        self.text = text
        self.trivia = trivia
        self.line = line
    }
}

Now that we have a way to represent the tokens, we can start to figure out how to identify them.

When going through the source code, we should start by consuming trivia. We can identify it like so:

Then we can identify types of token. Here's how we define them:

While I used regex here to demonstrate how these could be identified, we won't be using regex in our code.

When reading the source code, we need a way to go character by character through the String. For this we can use String.Index.

Let's start the lexer:

class Lexer {
    let source: String
    var index: String.Index
    var nextToken: Token?
    
    var atEOL: Bool {
        index == source.index(before: source.endIndex)
    }
    
    init(_ source: String) {
        self.source = source
        self.index = source.startIndex
        self.nextToken = try? lex()
    }
}

The Lexer takes in the source code as a String, and keeps track of the String.Index it's currently at. We will implement a lex function that builds the Tokens. But first, we need a way to handle errors. For that we can use Swift's builtin Error protocol and throwing functions. Let's define the LexError:

enum LexError: Error, LocalizedError {
    case unexpectedCharacter(Character, line: Int)
    
    var errorDescription: String? {
        switch self {
        case let .unexpectedCharacter(char, line: line):
            return "Unexpected character '\(char)', line \(line)"
        }
    }
}

We really only have one kind of error we could throw. However, if you extend the language later, you can now add any number of possible errors to this enum.

Before we can start lexing, we need to have a way to handle the index. We'll use a few helper functions to handle these:

/// Get the index offset by a specified amount
func offsetIndex(by offset: Int) -> String.Index {
    source.index(index, offsetBy: offset)
}

/// Increment the index
func advance(_ offset: Int = 1) {
    index = offsetIndex(by: offset)
}

/// Get the next `String.Index` without incrementing
func next() -> String.Index {
    offsetIndex(by: 1)
}

/// Advance until we hit a specified end `String`
func readTo(_ stop: String) -> String {
    var buffer = ""
    while source[index...offsetIndex(by: stop.count - 1)] != stop {
        buffer += String(source[index])
        advance()
    }
    return buffer
}

Now we can start lexing. Let's start with Trivia:

func lexTrivia() -> [Token.Trivia] {
    var trivia = [Token.Trivia]()
    while !atEOL {
        switch source[index] {
        // Basic trivia:
        case " ":
            trivia.append(.space)
        case "\t":
            trivia.append(.tab)
        case "\n":
            trivia.append(.newline)
        // Possibly a comment:
        case "/":
            switch source[next()] {
            // This is definitely a line comment
            case "/":
                advance(2)
                // Read to the end of the line:
                trivia.append(.comment(readTo("\n")))
            // This is definitely a multiline comment
            case "*":
                advance(2)
                // Read to the closing `*/`
                trivia.append(.multilineComment(readTo("*/")))
                advance()
            // It wasn't a comment after all...
            default:
                return trivia
            }
        default:
            return trivia
        }
        advance()
    }
    return trivia
}

Here we're able to identify all of the Trivia we may encounter. Now we can start implementing our lex function:

@discardableResult
func lex() throws -> Token {
    // Check if we reached the end of the source code
    if atEOL {
        nextToken = .init(kind: .eof, text: "", line: line(for: index))
        return nextToken!
    }
    
    // Otherwise, lex the character:
    // First find the trivia:
    let trivia = lexTrivia()
    // Now lex the token itself:
    let tokenStart = index
    var kind: Token.Kind
    let char = source[index]
}

Let's start with lexing identifiers and keywords. They're similar, and will be lexed the same, then identified after. Swift's Character provides a convenient way to check if characters are letters, numbers, etc.

if char.isLetter {
    // We have either an identifier or a keyword. Let's read more letters to get the full word.
    while source[index].isLetter || source[index].isNumber {
        advance()
    }
    // Now identify it.
    switch source[tokenStart...source.index(before: index)] {
    case "let", "func", "repeat", "return":
        kind = .keyword
    default:
        kind = .identifier
    }
}

Next we can lex numbers. They may or may not be floating point, so we need to allow a . to be in the number:

else if char.isNumber {
    // Keep reading numbers
    repeat {
        advance()
    } while source[index].isNumber || source[index] == "."
    kind = .number
}

Every other token in our language is one character, so we lex those last, and throw an error if the character is invalid:

else {
   // Single character tokens:
   switch char {
   case "=":
       kind = .equals
   case ",":
       kind = .comma
   case "+", "-":
       kind = .operator
   case "{":
       kind = .lBrace
   case "}":
       kind = .rBrace
   case "(":
       kind = .lParen
   case ")":
       kind = .rParen
   default:
       throw LexError.unexpectedCharacter(char, line: line(for: index))
   }
   if index != source.index(before: source.endIndex) {
       advance()
   } else {
       nextToken = .init(kind: kind, text: String(source[tokenStart...index]), line: line(for: index), trivia: trivia)
       return nextToken!
   }
}

Lastly, we can build the Token, and return it:

nextToken = .init(kind: kind, text: String(source[tokenStart...source.index(before: index)]), line: line(for: index), trivia: trivia)
return nextToken!

We also will make a function to identify what line of the source code the Token appeared on. This will help with error logging later on:

func line(for index: String.Index) -> Int {
    let offset = index.utf16Offset(in: source)
    var line = 1
    var currentOffset = 0
    for char in source {
        if char == "\n" {
            line += 1
            currentOffset += 1
        } else {
            currentOffset += 1
        }
        if currentOffset == offset {
            return line
        }
    }
    return line
}