Writing A Compiler

Posted by in Code

In my final semester of undergraduate collegiate studies (meaning I’ll hopefully be doing graduate shenanigans following this year), I decided to broaden my scope of computer science knowledge and drop myself into a graduate-level course on compilers. The professor for the course is Daniel Cooke, a pretty big name in the world of computer languages, NASA intelligence systems, and AI and logic. His most prolific brainchild is probably SequenceL, a high performance declarative language that rips traditional languages a new one in terms of speed.

Anyways, for the entire course, we the students were to build a complete LL(1) parser-based compiler from lexical analysis to syntax to semantics to optimization and finally to code generation for an admittedly simplistic context-free grammar. The LL(1) essentially breaks down into three parts for explanation:

  • The first L means that the grammar is parsed from left to right,
  • The second L means that the grammar is parsed based on the left-most derivation (as opposed to an LR parser),
  • The (1) means that the parser can look ahead at most one token

I won’t bother to go indepth about the intricacies of what a context-free grammar is as that is probably better suited for a full-blown automata theory course.

The Grammar

Back to business, though, our grammar goes a little something like this. I won’t bore you with the entire thing, but here’s the gist of it:

  • S ::= ID := E S1 | read(IDL S1 | write(IDL S1 | case C do S M S1 | while C do S od S1 | if C then S S2 | foreach id in id do S od S1
  • S1 ::= ; S | ε
  • S2 ::= fi S1 | else S fi S1
  • M ::= : C do S M | esac

E’s are expressions of positive integers and identifiers (variables which may or may not be arrays of any dimension) with operations for +, -, *, and ÷ as well as being raised to some power with exp(E, E). C’s are conditional statements of E statements joined by <, >, or = and “AND” or “OR” operations in between as well as a negating NOT(C) operation. There’s also some variable declaration stuff before all that wrapped up in some “program,” “begin,” and “end” tokens. Pretty simple, right?

The Code

Our first step is to write up a tokenizer (in C++) that spits out tokens as necessary. This comprises of the entire lexical analysis portion of the compiler since its only job is to look for words of the program (hence “lexical”) and return errors if there are any. In this case, we have a limited word length, a list of known and unknown characters, and the rule that identifiers can’t start with numbers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
bool token() {                              // Tokenizer
    resetctkn();                            // Reset the current token
    while(1) {
        if(linepos >= linesize) {                   // Finished with the current line
            if(checktkn(ctkn)) return true;
            resetctkn();
            if(!newline()) return false;            // Out of lines
            continue;
        }
        if(cmnt == 1line.at && (linepos) == '*' && line.at(linepos+1) == '/') {
            cmnt = 0;                       // Set comment flag
            linepos += 2;
            continue;
        }
        if(cmnt == 0 && line.at(linepos) == '/' && line.at(linepos+1) == '*') {
            cmnt = 1;                       // Reset comment flag
            linepos += 2;
            continue;
        }
        if(cmnt == 1) {                     // Skip until comment ends
            linepos++;
            continue;
        }
        if(tsize >= TOKENSIZE) {                // Lexical error in token size
            log('Lexical Error: token size exceeds maximum length');
            if(checktkn(ctkn)) return true;
            resetctkn();
        }
        // Skip spaces
        while(line.at(linepos) == ' ') && linepos < (linesize - 1)) linepos++;
        ctkn[tsize] = tolower(line.at(linepos));
        // Check token if a delimiter is found
        if(delimiter(line.at(linepos)) || delimiter(line.at(linepos+1))) {
            linepos++;
            if(checktkn(ctkn)) return true;
            resetctkn();
            continue;
        }
        tsize++;
        linepos++;
    }
}

I’ve kept things pretty simple in that I won’t explicitly go over the helper functions’ code, though I will go over what they do.

  • Delimiter() returns true if a character is a delimiter.
  • Checktkn() checks if a string is a valid token and sets the appropriate variable values in the symbol table for syntax and semantics analysis.
  • Resetctkn() simply resets the variables used for tracking the tokenizer’s progress through a line.
  • Log() just logs errors into the console and the listing file.

This could almost definitely be optimized, but you get the idea. Next time, syntax!