I'm about to graduate from my university (January 2011 if there's no more problems), and as a final assigment (though it's optional in my faculty, but it would be a great experience and honor to have one) I choose to implement a compiler. One of our labs, Formal Method in Software Engineering (or simple FMSE), is the lab where my supervisor gets involved. Therefore, the compiler I'm writing would probably based on a project they're (or have been) working on. Yep, I was given LinguSQL language to implement.
Previously, the language had a compiler, that generates Java code (bad choice IMO) which is then compiled by a Java compiler. The problem with this approach, as in normal Java application, is the HUGE runtime environment that must be distributed if someone wants to use the application. Furthermore, since Java uses interpreted bytecode (don't count GCJ, I even believe less than 10 persons in my campus know that thing exists), the performance is at maximum only 1/3 of native binaries (someone in OSDEV forum ever said). Last but not least, the execution isn't trivial. One must type "java xxx" in order to execute the program.
As a native application (deve)lov(|p)er (read: developer and lover), I decided to implement a compiler for this language that generates native binaries. However, I don't have enough experience in generating native binaries (or at least native assembly). So, remembering an option, I asked my supervisor what if the compiler generates LLVM assembly language (also called LLVM Intermediate Representation or IR)? Do you what he said? "What is LLVM?" (doh). OK, so after bla bla bla, he accepted my choice. The advantages of generating LLVM assembly instead of native assembly are:- A LOT of optimizations for free
- It can be compiled to native assembly for MANY platforms
- Easy integration with existing libraries
- It uses SSA format
- Written in C++, more specifically, it officially only supports G++! (there are some hacks to use MSVC but... still it's not official)
- Most important one: it doesn't have Object Pascal frontend
So the work begins. I continue the previous research, the previous Java based compiler that generates Java code uses JavaCC, followed by JavaCUP, and finally UUAG. The first two are parser generators, with some differences, mainly JavaCC generates LL parsers, while JavaCUP generates LALR one. Both are BAD. I always find parser generators are bad since we have no idea whether it's correct or not, and the grammar can't be deduced from the code (except for recursive descent parser generators like Coco/R). The last one is a Haskell based product, which actually runs like recursive descent parser, only in functional languages they're called parser combinators. The last one is quite good, with one important problem: when a parsing error happens, the parser tries to find all possible corrections, therefore slows down the parsing and eats resources. This behavior can't be customized easily and that's what makes me writing the whole thing from scratch using classic approach: a true recursive descent parser. This is the best parser I've ever learned, since it's the most flexible one (there are tons of way to handle parsing error and that's totally up to you, with many methods possibly combined or used specifically for certain productions) and still shows the grammar in its code.
Come to the code generation part, the problem I stated above must be covered. I create my own LLVM IR Builder to generate LLVM assembly language. Due to the SSA structure, it's a bit difficult, but I managed to create it quite successful with beautiful modular architecture. It can now generate modules consisting of functions and global variables, where each functions can have local variables, labels (for branch and loop), arithmetic instructions, memory instructions, etc. It's not yet complete, but already capable of generating simple programs. I'll put it in my bitbucket account when I think it's quite production ready.
Wants some code? OK:
program llvmirbuildertest; {$mode objfpc}{$H+} uses llvmirbuilder; var x,y,l,s,a,b: TLLVMSymbol; c: TLLVMConstant; cl: TLLVMCallInstruction; begin x := TLLVMSymbol.Create('x',lltInteger,true); y := TLLVMSymbol.Create('y',lltInteger); c := TLLVMConstant.Create('255',lltInteger); l := TLLVMLoadInstruction.Create('tmp',lltInteger,x); s := TLLVMStoreInstruction.Create('tmp',lltInteger,y,x); cl:= TLLVMCallInstruction.Create('func',lltInteger); a := TLLVMAddInstruction.Create('a',lltInteger,x,c); b := TLLVMSubInstruction.Create('b',lltInteger,c,y); WriteLn(l.GenerateCode); WriteLn(s.GenerateCode); WriteLn(cl.GenerateCode); WriteLn(a.GenerateCode); WriteLn(b.GenerateCode); a.Free; b.Free; cl.Free; s.Free; l.Free; c.Free; y.Free; x.Free; end.and the generated LLVM IR:
%tmp = load i32 * @x store i32 %y, i32 * @x call i32 @func() %a = add i32 @x, 255 %b = sub i32 255, %yNote that it's a partial code, so compiling this with llvm-as would absolutely produce an error.