How to Write an Assembler in Racket, Part 1: Introduction

10 Mar 2022

Reading time ~6 minutes

Introduction, or: Why I Don’t Like Your Assembler

As avid readers may have noticed, I like assembly programming. I’ve several times tried to branch into other consoles and systems like the Atari ST, Game Boy, Sega Genesis/Mega Drive, Amiga, etc.

All these endeavors all ended in frustration. Most of the time, for the same reason: I found the available tools to be inadequate to my needs. This is not to say that they’re bad. They’re well designed and for the most part, do what they advertise they do - assemble code for cross-platform development.

But I find myself longing for the simple assembler-linker combination cc65 offers. I think .include and .org instructions (and their cousins) are horrible; they make modularity harder to achieve, prevent you from taking advantage of concepts like relocatable code, re-entrant subroutine design, etc.

I think assembler-linker combinations are preferable to assemblers that obfuscate code by introducing assembler directives into the code that do not affect the functionality of the code at all - you don’t write your makefile into your C/C++ headers, so why is it okay to do so with assembly code?

The second problem is, that I like to automate my builds. Be it with make, CMake, or -best of all- tup, I want to set up my project once and know it will build (as long as my code is correct, that is). There’s no black sorcery involved in turning source files into linkable objects (allegedly). Which is to say, I want a linker. It is not my job as a programmer to correctly resolve labels, variable addresses, subroutine calls, return addresses, etc. That’s exactly why we have linkers.

The cc65 toolchain mostly solves this problem - alas, it only supports 6502 and 65816. But I need Z80 (and hopefully 68k) support. Fellow assembly programmers may not feel as strongly about the issues mentioned above. But I do. I will no longer suffer the tyranny of single namespaces and fixed addresses.

Now, if only there was a way for people to make their own assembler and linker (with Black Jack and hookers)…

Wyvern: The Modular Assembler

So, I’m going to write my assembler and linker. Since the market for new assemblers is kind of small, I’ll be arrogant enough to declare this to be a new type of assembler: the modular assembler!

Wow Okay GIFfrom Wow GIFs

What is a modular assembler? A modular assembler does not allow you to use .include and .org (I really dislike those). But kidding aside, the idea is pretty simple: Each source file represents a module (or translation unit, if you like). A module, in turn, creates its own environment.

An environment is a sequence of frames. Frames are bindings between names and values - translated to assembly, we’d call names labels (or symbols sometimes), and the values they’re associated with either constants or addresses (which essentially are the same thing; in the final step of assembling/linking, names/labels/symbols are replaced by the value they represent).

Now, what is the practical implication of this? The paragraph above is just a fancy way of saying that labels/symbols are unique to their respective frame, aka, you can reuse names within your assembly code. Let me illustrate that with an example.

Suppose you want to write a small math library like this (code modified and taken from Neil Parker):

.proc Multiply:
        LDA #0       ; Initialize RESULT to 0
        STA RESULT+2
        LDX #16      ; There are 16 bits in NUM2
L1:     LSR NUM2+1   ; Get low bit of NUM2
        ROR NUM2
        BCC L2       ; 0 or 1?
        TAY          ; If 1, add NUM1 (hi byte of RESULT is in A)
        CLC
        LDA NUM1
        ADC RESULT+2
        STA RESULT+2
        TYA
        ADC NUM1+1
L2:     ROR A        ; "Stairstep" shift
        ROR RESULT+2
        ROR RESULT+1
        ROR RESULT
        DEX
        BNE L1
        STA RESULT+3
.endproc

.proc Divide:
        LDA #0      ; Initialize REM to 0
        STA REM
        STA REM+1
        LDX #16     ; There are 16 bits in NUM1
L1:     ASL NUM1    ; Shift hi bit of NUM1 into REM
        ROL NUM1+1  ; (vacating the lo bit, which will be used for the quotient)
        ROL REM
        ROL REM+1
        LDA REM
        SEC         ; Trial subtraction
        SBC NUM2
        TAY
        LDA REM+1
        SBC NUM2+1
        BCC L2      ; Did subtraction succeed?
        STA REM+1   ; If yes, save it
        STY REM
        INC NUM1    ; and record a 1 in the quotient
L2:     DEX
        BNE L1
.endproc

This code will assemble with no problem with cc65. Why do the duplicate labels L1 and L2 not cause any problems? Because they each form a frame. The instruction pair .proc and .endproc cause the assembler to create a new scope for the subroutine so that all labels used within it do not clash with those of the same name (and do not cause redefinition problems).

So, in Wyvern, the same code may look like this:

;;; File: math.asm
module Math ( Multiply, Divide ) where


.proc Multiply:
        LDA #0       ; Initialize RESULT to 0
        STA RESULT+2
        LDX #16      ; There are 16 bits in NUM2
L1:     LSR NUM2+1   ; Get low bit of NUM2
        ROR NUM2
        BCC L2       ; 0 or 1?
        TAY          ; If 1, add NUM1 (hi byte of RESULT is in A)
        CLC
        LDA NUM1
        ADC RESULT+2
        STA RESULT+2
        TYA
        ADC NUM1+1
L2:     ROR A        ; "Stairstep" shift
        ROR RESULT+2
        ROR RESULT+1
        ROR RESULT
        DEX
        BNE L1
        STA RESULT+3
        RTS
.endproc

.proc Divide:
        LDA #0      ; Initialize REM to 0
        STA REM
        STA REM+1
        LDX #16     ; There are 16 bits in NUM1
L1:     ASL NUM1    ; Shift hi bit of NUM1 into REM
        ROL NUM1+1  ; (vacating the lo bit, which will be used for the quotient)
        ROL REM
        ROL REM+1
        LDA REM
        SEC         ; Trial subtraction
        SBC NUM2
        TAY
        LDA REM+1
        SBC NUM2+1
        BCC L2      ; Did subtraction succeed?
        STA REM+1   ; If yes, save it
        STY REM
        INC NUM1    ; and record a 1 in the quotient
L2:     DEX
        BNE L1
        RTS
.endproc

And to use this code in another file/module, you’d start that module like this:

;;; File: Graphics.asm
module Graphics where

import Math
import Memory

; some code...

; now use Multiple from Math
    JSR Multiply
    ; or use the fully qualified name
    JSR Math.Multiply

; more code...

Hence, the modular assembler.

I believe this is preferable to simply designing code with unique labels. The programmer can more clearly convey their intentions on what the code at hand is supposed to achieve. It makes it easier to extend and integrate existing code without fear of a “polluted namespace”.

The discerning Haskeller will notice a striking resemblance to Haskell modules. Coincidence? We’ll never know.

Why Haskell Racket

A functional language like Haskell is the best fit for a project like this. Parsing and interpreting is a major part of Haskell development - there are more than 200 libraries tagged parsing that can be found. I will most definitely not rewrite it in Rust (Rust is not a real programming language).

I was going to use Haskell, but being a Schemer, I thought trying out a Lisp with a typing system would be fun. [Racket][racket] fits the bill here. It comes with an [expansive documentation][racketdoc]. I want to depend as little as possible on existing code so I think a Lisp dialect will be easier to follow. Despite what detractors say, Lisp is very readable and the syntax allows zero ambiguity.

Join me next time, when we discuss the three main parts of the assembler.

How to Write an Assembler in Racket, Part 1: Introduction | Machine Code Construction Yard