Writing a VM Part Four - The Assembler

Filed under vms on January 11, 2023

I’m not 100% proud of this assembler, but here we go. I’ve hacked one together that works well enough, but I don’t think I’ll show much code here. Code is here for the curious.

Source Material

Originally I was completely lost and thought I might need to write a full tokeniser and lexer, but I found an easier way after skimming through Assemblers and Loaders by David Saloman.

My Assembly Language

My assembly language is as simple as I could make it. I’d like to add macros at some point, but it’s not really up the top of my priority list right now, so can wait.

Each line could be one of three things:

An instruction, which will compile directly to machine code, eg ADD 0x123 R0
A directive, which the assembler will dynamically figure out and generate binary for
A comment, which is ignored

When a line is looked at by the assembler, the line is split on spaces. The assembler will then look at the first ‘word’ and do a lookup against a pregenerated instruction and directive table to see if there’s a match.

Most instructions will have the form

{LABEL} {MNEMONIC} {I1} {I2}

As I mentioned in my last post, I wish I had made the instructions a little more flexible, as it stands the destination can only really be another register, and only one of these input values can be an immediate value. As such this is a little clunky, as everything needs to be loaded into a register before being used and the ASM falling out of that design is a bit verbose and unwieldy.

The Basics

The assembler I’ve written is a two pass assembler, where the first pass goes over the provided file and any imports to build a symbol table. It then just translates these symbols into a single loadable file in its second pass.

Loadable files

The first thing I attacked was my ideal goal state. I came up with a rudimentary file format that was easy to convert to and from binary:

// LoadableFile represents a file we can load into memory
type LoadableFile struct {
	BlockCount uint32
	Flags      uint32
	Blocks     []*MemoryBlock
}

type MemoryBlock struct {
	Address   uint32
	BlockSize uint32
	Words     []uint32
}

func (l *LoadableFile) Save(w io.ByteWriter) error {
	err := writeWords(w, l.BlockCount, l.Flags)
	if err != nil {
		return err
	}
	for bi, b := range l.Blocks {
		err = writeWords(w, b.Address, b.BlockSize)
		if err != nil {
			return fmt.Errorf("error writing block %d: %v", bi, err)
		}
		err = writeWords(w, b.Words...)
		if err != nil {
			return fmt.Errorf("error writing block %d: %v", bi, err)
		}
	}
	return nil
}

// Load loads a loadable file from a binary stream
func Load(bs io.ByteReader) (*LoadableFile, error) {
	blockCount, err := nextWord(bs)
	if err != nil {
		return nil, fmt.Errorf("error reading block count: %v", err)
	}

	flags, err := nextWord(bs)
	if err != nil {
		return nil, fmt.Errorf("error reading flags: %v", err)
	}

	blocks, err := loadBlocks(blockCount, bs)
	if err != nil {
		return nil, fmt.Errorf("error loading blocks: %v", err)
	}

	return &LoadableFile{
		BlockCount: blockCount,
		Flags:      flags,
		Blocks:     blocks,
	}, nil
}
// loadBlocks from a stream
func loadBlocks(blockCount uint32, bs io.ByteReader) ([]*MemoryBlock, error) {
	if blockCount == 0 {
		return nil, nil
	}
	blocks := make([]*MemoryBlock, blockCount)
	var err error
	for i := 0; i < int(blockCount); i++ {
		blocks[i], err = loadBlock(bs)
		if err != nil {
			return nil, fmt.Errorf("error loading block %d: %v", i, err)
		}
	}
	return blocks, nil
}

// loadBlock from a stream
func loadBlock(bs io.ByteReader) (*MemoryBlock, error) {
	address, err := nextWord(bs)
	if err != nil {
		return nil, fmt.Errorf("error reading address: %v", err)
	}

	blockSize, err := nextWord(bs)
	if err != nil {
		return nil, fmt.Errorf("error reading block size: %v", err)
	}

	words := make([]uint32, blockSize)
	for i := 0; i < int(blockSize); i++ {
		words[i], err = nextWord(bs)
		if err != nil {
			return nil, fmt.Errorf("error loading word %d: %v", i, err)
		}
	}

	return &MemoryBlock{
		Address:   address,
		BlockSize: blockSize,
		Words:     words,
	}, nil
}

// nextWord gets the next word in a binary stream
func nextWord(bs io.ByteReader) (uint32, error) {
	b0, err := bs.ReadByte()
	if err != nil {
		return 0, fmt.Errorf("error reading first byte: %v", err)
	}
	b1, err := bs.ReadByte()
	if err != nil {
		return 0, fmt.Errorf("error reading second byte: %v", err)
	}
	b2, err := bs.ReadByte()
	if err != nil {
		return 0, fmt.Errorf("error reading third byte: %v", err)
	}
	b3, err := bs.ReadByte()
	if err != nil {
		return 0, fmt.Errorf("error reading fourth byte: %v", err)
	}
	ret := uint32(b0) << 24
	ret |= uint32(b1) << 16
	ret |= uint32(b2) << 8
	ret |= uint32(b3)

	return ret, nil
}

func wordToBytes(w uint32) []byte {
	return []byte{byte((w >> 24) & 0xFF), byte((w >> 16) & 0xFF), byte((w >> 8) & 0xFF), byte(w & 0xFF)}
}

func writeWords(w io.ByteWriter, ws ...uint32) error {
	for _, word := range ws {
		for _, b := range wordToBytes(word) {
			err := w.WriteByte(b)
			if err != nil {
				return fmt.Errorf("could not write byte: %v", err)
			}
		}
	}
	return nil
}

Each memory block has an address associated with it, so theoretically when I add interrupts I’ve already got this part covered.

When we load this into our memory, we simply loop through the memory blocks, see where they start, and set the appropriate memory addresses.

Opcode and Directive tables

There are two static tables, one describing opcodes and another describing assembler directives. This makes it easy to build our list of symbols and do rudimentary error checking after our first pass.

// eg:
var opcodeTable = opcodeTableType{
	"HALT": {
		mnemonic: "HALT",
		opcode:   0x00,
		hasI1:    false,
		hasI2:    false,
	},
	//...
}

var directiveTable = directiveTableType{
	"WORD": {
		mnemonic: "WORD",
		sizeCalc: func(_ string) uint32 {
			// Always one line long
			return 1
		},
		assembleFunc: func(sourceLine string, _ symbols) ([]uint32, error) {
			spl := strings.Split(sourceLine, " ")
			col := 1
			if len(spl) > 2 {
				col = 2
			}
			if len(spl) <= col {
				return nil, fmt.Errorf("not enough arguments to WORD directive")
			}
			arg := spl[col]
			p, err := parseLiteral(arg)
			if err != nil {
				return nil, err
			}
			return []uint32{p}, nil
		},
	},
	//...
}

These hold objects that conform to the following interface:

type assemblable interface {
	calculateSize(sourceLine string) uint32
	assemble(sourceLine string, symbolTable symbols) ([]uint32, error)
}

I noticed pretty early on that while directives will always require some special assembly logic, operations don’t require that. This is reflected by the different assemble implementations between the two types.

Intermediate File Format

The intermediate file format contains a symbol table and a list of symbols:

type firstPassFile struct {
	symbolTable symbols
	records     []*symbol
}

Where a symbol is defined as such

type symbol struct {
	symbolType         symbolType
	label              string
	relativeLineNumber uint32
	sourceLine         string
	assemblyLink       assemblable
}

I pulled this source pretty much straight out of the assembler book, where there’s a symbol type, label, relative line number, source field, and a pointer to the opcode or directive table.

Assembler First Pass

The first pass’ responsibility is to build up a coherent symbol table before the second pass put everything together. This could be done in a more flexible way, but it’s not the main focus of this little project, so I won’t be going too far into it.

func firstPass(sourceFile io.Reader, lineNum uint32) (*firstPassFile, error) {
	ln := lineNum
	src := bufio.NewReader(sourceFile)
	reloc := newFirstPassFile()
	for {
		line, _, err := src.ReadLine()
		if err != nil {
			if err == io.EOF {
				break
			}
			return nil, err
		}
		rec, err := firstPassLine(ln, string(line))
		if rec.label != "" {
			_, ok := reloc.symbolTable[rec.label]
			if ok {
				reloc.symbolTable[rec.label] = &symbol{
					label:              rec.label,
					symbolType:         MTDF,
					relativeLineNumber: ln,
					sourceLine:         string(line),
				}
			} else {
				reloc.symbolTable[rec.label] = rec
			}
		}
		reloc.records = append(reloc.records, rec)
		if rec.assemblyLink != nil {
			ln += rec.assemblyLink.calculateSize(string(line))
		}
	}
	return reloc, nil
}

As you can see, we call the calculateSize function on each opcode/directive link to calculate the address of each symbol for insertion into the table.

Assembler Second Pass

The second pass simply builds up a list of 32 bit words and packs it into our executable format, and as such is pretty simple:

func secondPass(firstPass *firstPassFile) (*executable.LoadableFile, error) {
	ret := &executable.LoadableFile{
		BlockCount: 0x1,
		Flags:      0,
		Blocks:     nil,
	}
	b := &executable.MemoryBlock{
		Address:   0x100,
		BlockSize: 0,
		Words:     nil,
	}
	for _, rec := range firstPass.records {
		if rec.assemblyLink == nil {
			continue
		}
		words, err := rec.assemble(firstPass.symbolTable)
		if err != nil {
			return nil, err
		}
		b.Words = append(b.Words, words...)
	}
	ret.Blocks = append(ret.Blocks, b)
	b.BlockSize = uint32(len(b.Words))
	return ret, nil
}

A Budget Terminal Device

I decided to created a small, memory mapped output device. This attaches itself to a memory range, and will just print chars if something is put into the memory address 0xFFE1 or 0xFFE2.

import (
	"bufio"
	"fmt"
	"os"
)

// TerminalDevice is a bus device that backs directly onto a real terminal
type TerminalDevice struct {
	consoleReader *bufio.Reader
}

const (
	TERMINAL = uint32(0xFFE1) + iota
	TERMINAL_INT
	TERMINAL_X
	TERMINAL_Y
	__terminal_reserved1
)

func (t *TerminalDevice) MemoryRange() *MemoryRange {
	// Addresses:
	// * 0xFFE1 - Write a character to terminal or read a character
	// * 0xFFE2 - Write a number to the terminal
	// * 0xFFE3 - Cursor X position
	// * 0xFFE4 - Cursor Y position
	// * 0xFFE5 - reserved
	return &MemoryRange{
		Start: 0xFFE1,
		End:   0xFFE5,
	}
}

func (t *TerminalDevice) Read(address uint32) (uint32, error) {
	// By default, Go doesn't provide a way to get unbuffered input from the console.
	// Will leave this to the UI when I get to that
	return 0, nil
}

func (t *TerminalDevice) Write(address, value uint32) error {
	switch address {
	case TERMINAL:
		fmt.Printf("%c", rune(value))
	case TERMINAL_INT:
		fmt.Printf("%d", value)
	}
	return nil
}

func NewTerminal() *TerminalDevice {
	return &TerminalDevice{
		consoleReader: bufio.NewReader(os.Stdin),
	}
}

Go is a little annoying, because ideally I think I would have wanted a getch type functionality but all I have is buffered input. Without a platform independent way of switching that of, I’ve just left it be for now and will build something more full-featured when I add a GUI

Seeing the Whole thing in Action

I also altered the main file to take in a .bs assembly file to assemble and run. I have the following assembly file

IMPORT term
ADDRESS HELLO R0
CALL PRINTSTRING
HALT
HELLO STRING Hello, world!

With an imported file that looks like the following on my search path:

; PRINTSTRING will print the string starting at the address in R0
PRINTSTRING READ R0 R1
EQ R1 0x00
RETURN
WRITE R1 0xFFE1
ADD 0x01 R0
JMP PRINTSTRING

Running it will produce the following:

$ go run ./cmd/main.go run -file ./examples/print_string.bs
Begin execution
-------
Hello, world!
-------
Machine has halted

Things to fix

My machine can assemble and run code, though I’d like to fix the following

Import search paths are a bit clunky
When I add interrupts, I’m going to need a way to make sure interrupt directives have a way to better alter the memory addresses lower than 0x100
The assembler itself is very simple, I would like to add some macros but that’s probably going to be a bit more complex than I really want to go right now
As mentioned before, Go’s buffered console I/O makes it a pain to do a VGA terminal-like interaction. This will need to be addressed when I add a GUI for the machine

Conclusion

The assembler was a fun little holiday project, not something that I’ve done in the past but it was good learning how they worked. I originally thought it would be similar to a compiler, but was pleasantly surprised to find it was much less unwieldy. Code is available here