Note: I would suggest reading (or skimming) through the pages linked on this page of my website before trying to make sense of this. This is a highly technical paper.

Introduction

        My research was in three main areas: programming language design, compiler design, and LLVM’s inner workings. This paper will be divided as such, albeit not in equal portions.

Language Design

        Language design breaks down into several “categories” of languages, including (but not limited to): imperative languages, functional languages, logical languages, and others. For this project, I have chosen to follow the “imperative” style, as it’s what I know best, and works well with the tools I am using. This style of language also often comes with static typing, i.e every entity is given its own “type” and each type has a list of operations that are valid to perform on it, and all other operations are considered invalid and cause the program to be rejected by a compiler. This is the style of language that is being implemented by my senior project, a C-like statically-typed imperative language. In my case, my type system will be a tad more complex than most languages: it will deal with pointers, references and other types directly, rather than making it a responsibility of a runtime or language virtual machine.

        Languages in the imperative category can still borrow ideas from other categories, in this project I will be borrowing a key idea from functional languages: almost everything in my language will be an expression, i.e it will evaluate to a value. Instead of basing the language on statements like the C language (and modern C-style languages) did, core functionality will be expressions. Nearly everything in the language will be able to be used to initialize variables, used as function arguments, etc. Blocks ({}s) evaluate to the last expression inside of them, Ifs evaluate to the result of evaluating the block that was executed, etc. This is a significantly less orthodox approach than most languages today take, although it is becoming more common. Languages like Rust have brought the idea into the forefront, and I decided to try my hand at it.

Compiler Design

        I decided from the start that I wanted to make a compiled language, and thus I researched the design of compilers. I started looking into type systems first, as my language is a statically-typed language and will require a type-checker (along with type-based analysis for generating LLVM IR later). Type checking effectively works by having the user write code with type information (i.e a type Point is made up of two Decimals, x and y, or the function make_point returns a Point). This information is then used in a tree-walking phase where the compiler walks through the entire abstract syntax tree, annotating each node with additional type information that could be inferred (i.e 42 is of type Integer, or a call to the function make_point evaluates to the type Point) and checking that against what the language requires (i.e you can only use + on two number types) and what the user said. If a discrepancy is found, the program is declared “invalid,” an error is reported to the user, and the compiler stops. This is what my compiler will do, it will have rudimentary local type inference and use that to type check programs. More advanced type inference could simplify the code users have to write (i.e make them write what the types are less often), but also introduce significant complexity and make error messages much worse. It was simply not reasonable.

        I then looked into “name resolution,” i.e how a compiler figures out what the programmer is talking about when they refer to “Point” or to “some_variable” or “sin.” The first part of this is a symbol table, which is exactly what it sounds like: a huge table that keeps track of every “symbol” (name) in the program, and maps it back to an actual entity in the compiler’s representation. This allows much more efficient mapping than would be possible if resolution was done more than once, and it helps to simplify all parts of the compiler that need to deal with symbols after the symbol table has been formed.         Finally, I looked into IRs, and how translation into IRs worked. I focused mostly on the “SSA” and “CFG” style of IR, as this is the type that LLVM uses. “SSA” (“static single assignment form”) is a style of IR where programs are made up of a list of bindings, each binding is simply a name given to a computation. Each name can only be bound exactly once, which drastically simplifies many optimizations (as it becomes trivial to find unused or duplicate computations, and being able to assume that no operations change the value of any previous operations makes a lot of optimization code much simpler and less error-prone). IRs designed this way require care to properly model mutation, as bindings cannot actually be updated. Different SSA-based IRs model this in different ways however. The other form I investigated is called a “CFG,” aka a “control-flow graph.” These are DAGs that model the flow of a program, where each node in the graph is a set of code that doesn’t branch (a “basic block” in CFG terminology), and each edge models a control transfer (a branch, jump, return, etc). It models every possible runtime control-flow path in a graphical manner, and optimizations can be performed on that. LLVM uses an IR that is an SSA/CFG hybrid, so understanding how both of them work will aid me later when I start generating LLVM IR from my compiler.

LLVM

        LLVM is the framework I will be using in order to simplify optimization and machine code generation for my compiler. The main things I researched in this department was what I could actually do with LLVM, and how I could make it do what I wanted to do. I first researched exactly how I could model the various abstractions that my language provides (or will provide): namely structures, classes, functions, FFI functions, methods, interfaces, constants, etc. LLVM can model all of them to some degree, as LLVM actually provides most of it. LLVM provides constants, structure types, functions, “external” functions, and ways of specifying calling conventions to interact with FFI functions that have a different ABI. Classes and methods need to get turned into structures and functions that accept a this pointer, while interfaces need to get translated into vtables of some sort. Being able to actually interact with the rest of the system is more complicated, but it is still doable: LLVM provides a calling convention that maps to the convention of the target platform’s native C ABI, meaning that it would automatically generate code to properly call an external C function with the Linux ABI on Linux, the Windows ABI on Windows, the iOS ABI on iOS, etc. While it seems that making LLVM properly generate code that can be turned into a DLL on Windows is more complicated, it’s not necessary to make a functional compiler.

        Based on my research, I believe the simplest way to achieve things like printing or file i/o is probably through the platform’s C runtime library, or perhaps through creating a lightweight runtime library in C++ or something that then relies on the C runtime library (and then interfacing with that custom runtime library from my language).
As for interfacing with LLVM, they provide a massive C++ API that I can interact with in my compiler (as it’s written in C++ as well). That API may not be particularly well-documented, but documentation does exist, and the parts I care about are the common parts that are documented much better than the more obscure corners of the framework. Optimizations are done through a “pass manager” API, generating LLVM IR is done through a “instruction builder” API, etc. Everything I need from there is relatively simple to access where I need it.

Annotated Bibliography

  • LLVM Project Developers. LLVM Language Reference Manual, LLVM Foundation, https://llvm.org/docs/LangRef.html.
    • This source documents the LLVM language, including its syntax, semantics, and “overall context” of the project. It contains a huge amount of useful information about what needs to be done to effectively make use of LLVM, and how to make it properly interact with the operating system that the generated code is targeting.
  • Cooper, Keith D., and Linda Torczon. Engineering A Compiler. 2nd ed., Morgan Kaufmann, 2011.
    • This source details a lot more of the internal implementation that needs to actually happen for a compiler’s middle-end (i.e analysis) and back-end (i.e code-gen, translation to lower-level IRs, etc). It has more of the theoretical aspects of these implementations, but it still does touch on them. It almost entirely skips over the language design itself and simply talks about how to implement a compiler once you already have a language designed.
  • Friedman, Daniel P., and Mitchell Wand. Essentials of Programming Languages. 3rd ed., MIT Press, 2008.
    • This source is an academic introduction to the design of programming languages, it focuses exclusively on features of the language itself and occasionally on how the front-end of a compiler would deal with it (albeit at a very high level). It also references basic ideas in parsing and scanning, and representation of syntax (i.e ASTs) in slightly more detail than any of the other sources.
  • Nystrom, Robert. Crafting Interpreters. 1st ed., Genever Benning, 2021.
    • This source is effectively an introduction to the field of programming languages, albeit from the perspective of making an interpreter instead of a compiler. It does an excellent job of introducing the theory of parsing (without trying to approach it from a math standpoint), and documents various other less “theoretical” (i.e not talked about in books/papers) parts of language design. Things like syntactical “sugar,” usability, and history vs. innovation are explored here, they are not in other more academic books. It tries to introduce the reader to the field rather than to the theory behind it, and gives the reader enough knowledge to start seeking out other resources. It also touches on more of the “human” part of language tools, i.e thinking about how people will use the tool rather than just how people will use the language that the tool implements. This leads into discussions of how to effectively report diagnostics and warnings to users, and other general usability topics.
  • Rodler, Michael. Mapping High Level Constructs to LLVM IR, https://mapping-high-level-constructs-to-llvm-ir.readthedocs.io/en/latest/README.html.
    • This is an exhibition of how to translate higher-level constructs into LLVM IR, including control-flow, OO concepts, functional ideas, and how to talk to the OS (i.e how to actually do anything useful). It’s not as expansive as any of the other resources, but it’s an excellent sanity check for my ideas.