TechBlast - Tech News for Builders and Operators

Four sprints. One afternoon. 62 tests, all green.

ForgeIL is the compiler layer of Forge - an open source UI framework built around
a radical simplicity: one language, one binary, no runtime installation.

The Idea

There is a test every language eventually faces. Not a benchmark. Not a syntax
comparison. A question:

Can the language compile itself?

A self-hosted compiler is the closest thing software has to a proof of maturity.
If the language can express a lexer, a parser, and a code generator - and the
result of running that code is a working binary - then the language has crossed
a threshold. It is no longer a demo. It is a tool.

SMS is the scripting language at the heart of Forge. It is interpreted at
development time and compiled to LLVM IR at release time. We decided to test
whether SMS could implement its own compiler front-end: a lexer, a parser, and
an LLVM IR emitter - all written in SMS itself.

Four sprints. One afternoon. It works.

What We Built

The self-hosted compiler lives in the forgeil/ directory of
sms-cpp - three SMS source files that
together form a complete compilation pipeline:

SMS Source
    |
    v
tokenize(src)        sprint1-lexer.sms     ->  array of Token
    |
    v
parse(tokens)        sprint2-parser.sms    ->  AST of Node
    |
    v
codegen(ast)         sprint3-codegen.sms   ->  LLVM IR text
    |
    v
clang                                      ->  native binary

A single convenience function wraps it all:

fun compile(src) {
    return codegen(parse(tokenize(src)))
}

That is the entire public API. Give it SMS source text. Get back LLVM IR.

Sprint 1: Lexer

The lexer turned out to be the place where SMS's integer-character API became
a design constraint rather than a limitation. str.charAt(i) returns the
integer char code of the character at position i. So the entire lexer is
built on integer comparisons - no regex, no character class library.

fun isAlpha(code) {
    return (code >= 65 && code <= 90) || (code >= 97 && code <= 122) || code == 95
}

fun isDigit(code) {
    return code >= 48 && code <= 57
}

Two-character operators (==, !=, <=, >=, &&, ||) are handled with
a single lookahead:

if (ch == 61) {  // '='
    if (i + 1 < len && src.charAt(i + 1) == 61) {
        tokens.add(Token("OP", "==", line))
        i = i + 2
    } else {
        tokens.add(Token("ASSIGN", "=", line))
        i = i + 1
    }
}

The result: tokenize(src) returns an array of Token(type, value, line) data
class instances. 12 tests, all green on the first run.

Sprint 2: Parser

A recursive descent parser, written in SMS, for SMS.

The interesting constraint here: SMS does not allow forward declarations. Every
function must be defined before it is called - except at the top level, where
all definitions are registered before main() runs. So mutual recursion between
parseExpr and parseStmt works perfectly, because both are defined at the
module level before any call site is reached.

The precedence chain follows the standard pattern:

parsePrimary -> parsePostfix -> parseUnary -> parseMul ->
parseAdd -> parseCompare -> parseEquality -> parseAnd -> parseOr -> parseExpr

One non-obvious detail: the parser cursor is a single-element array used as a
mutable integer box. SMS arrays are passed by shared reference - field
assignments inside helper functions are visible to the caller. This is how the
cursor advances across the recursive descent without needing a global variable.

fun curTok(cur, tokens) { return tokens[cur[0]] }
fun advance(cur)         { cur[0] = cur[0] + 1 }
fun consume(cur, tokens, expected) {
    var tok = curTok(cur, tokens)
    if (tok.type != expected) { ... }
    advance(cur)
    return tok
}

24 tests. All green.

Sprint 3: LLVM IR Emitter

The code generator takes an AST and produces a complete LLVM IR text string.
Integer-only subset: every SMS value is i64, variables use alloca (memory-form
SSA), and clang/llc applies mem2reg in the optimization pipeline.

The context object accumulates emitted instructions:

data class Ctx(tempCnt, labelCnt, code, terminated, loops)

Because SMS data class instances use shared references (shared_ptr under the
hood), every helper function that mutates ctx.code or ctx.tempCnt has its
changes visible to every other function holding the same ctx. This is the
foundation of the entire emitter design.

A comparison compiles to an icmp followed by a zext - because LLVM
comparisons produce i1 and SMS uses i64 everywhere:

if (op == "<") { emit(ctx, cmpReg $ " = icmp slt i64 " $ left $ ", " $ right) }
emit(ctx, reg $ " = zext i1 " $ cmpReg $ " to i64")

For a function like fun add(a, b) { return a + b }, the emitter produces:

define i64 @sms_add(i64 %_p_a, i64 %_p_b) {
entry:
    %a = alloca i64
    store i64 %_p_a, i64* %a
    %b = alloca i64
    store i64 %_p_b, i64* %b
    %t0 = load i64, i64* %a
    %t1 = load i64, i64* %b
    %t2 = add  i64 %t0, %t1
    ret i64 %t2
}

If the source contains a fun main(), a C-compatible entry point is appended:

define i32 @main() {
entry:
    %ret64 = call i64 @sms_main()
    %ret32 = trunc i64 %ret64 to i32
    ret i32 %ret32
}

19 tests. All green.

Sprint 4: End-to-End

The final sprint is the proof. A new C API function -
sms_native_execute_string_result - captures the string value produced by the
SMS interpreter, instead of the integer it previously returned. This lets the
host application receive the generated IR text directly.

The test then does exactly what you would do on the command line:

// 1. Run the SMS compiler in the interpreter, capture the IR string
std::string ir = get_ir(load_all(), "fun main() { return 42 }");

// 2. Write to a temp .ll file
// 3. Run: clang -O0 -o /tmp/test_bin /tmp/test.ll
// 4. Run the binary, check exit code == 42

Exit code 42. Not chosen at random. The compiler's first words are a nod to
the only question that ever mattered.

forgeil_sprint4_tests: all tests passed (7)

The pipeline holds for arithmetic, if/else branches, and while loops. The tests
skip gracefully when clang is not available, so they run cleanly in any CI
environment.

What the Numbers Look Like

Sprint 1 - Lexer        12 tests  v
Sprint 2 - Parser       24 tests  v
Sprint 3 - Code gen     19 tests  v
Sprint 4 - Self-host     7 tests  v
---------------------------------
Total                   62 tests  all green

What This Is Not

SMS is not trying to replace LLVM's front-end infrastructure. The self-hosted
compiler covers the integer-only subset of the language: functions, variables,
if/else, while, break/continue, arithmetic, and comparisons. Strings, arrays,
data classes, and the standard library are outside its current scope.

The point is not feature completeness. The point is that the language has
enough expressive power to reason about itself. That is a different claim - and
a meaningful one.

The Code

Everything is open source under GPL-3.0 (with a commercial option):

sms-cpp: codeberg.org/CrowdWare/sms-cpp
Lexer: forgeil/sprint1-lexer.sms
Parser: forgeil/sprint2-parser.sms
Code gen: forgeil/sprint3-codegen.sms
Tests: tests/forgeil_sprint{1..4}_tests.cpp

If you want to try it:

git clone https://codeberg.org/CrowdWare/sms-cpp.git
git clone https://codeberg.org/CrowdWare/sml-cpp.git
cmake -B build -DBUILD_TESTING=ON -DSML_CPP_DIR=../sml-cpp
cmake --build build
cd build && ctest -R forgeil --output-on-failure

*Forge is being built in public at crowdware.info.
SMS, ForgeIL, and the self-hosted compiler are part of a longer project:
a UI framework that runs anywhere without asking anything of the user's machine.