Toolbox: a Quick Rising Assembler

Brought to you by ‘Long time ago in a galaxy far away’.

The other day I discovered that to be able to experiment with instruction set trimming for my rustic 8-bit CPU I need a tunable assembler. I thought a little about my options which appeared to be the following: to search for existing tool, to modify an existing tool or to build my own tool from scratch. Since I was not in a big hurry I picked the third option and started building rudimentary assembler. Here I’ll document this endeavour.

The idea of assembler is quite simple on its face: take a stream of mnemonics corresponding to processor’s instructions more or less directly, return a chunk of binary data which consists of processor’s instructions and is ready to run. In other words take something like

        lda $4020
        inc
	add 10

and return something like

34 40 20 2a 43 10

So to build an assembler one would at least need some mapping from mnemonics to instructions. Also it would be nice to know if an instruction takes an argument. And from the snippet above it could be deduced that some instructions require more data than others: it is highly likely that instructions that try to address memory will require arguments wide enough to cover memory bus width while instructions writing to or reading from registers will be limited to register width. Sounds trivial so far. In the simplest case¹ it is easy to imagine a dictionary with mnemonics as keys and dictionaries or tuples holding all necessary data:

instructions = {"lda": {"value": "42",  # hex value represented as a string for
                                        # conveniece.
			"size": 2,      # instruction size in bytes.
                       }
                ...
               }

Note that this example implicitly uses Python – it is much easier to do text processing in high-level languages. The general principle stays the same no matter which tool one uses, but some tools are easier to use than others.

The next thing that comes to mind is addresses. Not the ones supplied directly, but rather the ones which are defined by labels:

        jmp start
start:  inc

start is a label and it must be handled properly i.e. its position respective to some point in memory like the beginning of the program must be determined and the label must be substituted with it whenever it appears as an argument. The code above immediately shows a problem: the label is used before it appears in the program. Thus to be able to produce runnable binary an assembler must first produce intermediate result to compute addresses of labels and then use both these intermediate and labels-to-addresses mapping to build the binary. Depending on how intermediates look it could be either two-pass assembler when assembler first collects all labels and computes all addresses and then walks along the program once again and uses precomputed addresses or one pass when a separate table is kept for storing all locations at which label’s value must be inserted. When there are no more instructions to be processed this second table is used to fill gaps in the code. Both approaches are pretty straightforward.

To make it a charm one needs one more component to a minimal assembler. The question I have been omitting so far is how exactly input program should be parsed. Now it is time to deal with it. There is as usual more than one way to do it. First of all I must mention the venerable ad-hoc approach which was used now and then in the olden days. Just mash together some string splitting with substring search, add regular expressions to taste and here it is: fast to hack in, brittle, hard to maintain and expand. I definitely won’t recommend it unless you want to experiment with some artificial obstacles like mandatory field sizes a-la FORTRAN-77.

The other option is to do it right: write a parser. It will take care of input and return a nice list of objects corresponding to each line’s content. A nice and thorough example could be found here. The only problem with such approach is that it feels a bit too fundamental, especially given the fact that this problem has been solved multiple times. So if you are not looking for an excuse for diving into a comprehensive guide to parsing (like this) and reemerging half a year later with a lovingly-crafted industrial-grade parse-em-all parser you would be better off if you search for existing solutions and spend a few hours assessing your finds. Actually I did exactly this: I have played a little bit with my homemade parser and then decided that I would rather settle for some intermediate, not necessarily industrial-strength solution. And after some consideration I decided to continue my experiments with pyparsing.

The strength of the library (and, alas, its weakness, but that is a subject for a different discussion) lies in how seamlessly one could embed parser definition in her code. No code discontinuity, no messy EBNFs with ‘E’s understood slightly differently than in the other nice library. Actually the initial results impressed me so much that I will not resist the urge and publish the entire parser definition for simple assembler right here, right now. Lo and behold:

from pyparsing import alphas, alphanums, delimitedList, Combine, Group, LineEnd
from pyparsing import Literal, Optional, OneOrMore, SkipTo, Word, ZeroOrMore

COMMENT_SYMBOL = Literal(';').suppress()
LABEL_END = Literal(":").suppress()
comment = Combine(COMMENT_SYMBOL + SkipTo(LineEnd()))("comment")
identifier = Word(alphas + "_", alphanums + "_")
mnemonic = Word(alphas, alphanums)("mnemonic")
args = delimitedList(Word(alphanums+'_'), ',')("args")
label = Combine(identifier + LABEL_END)("label")
codeline = (Optional(label)
            + mnemonic
            + Optional(~LineEnd() + args)
            + Optional(comment))("codeline")
line = codeline | comment

Nice thing about parsing with pyparsing is that even if you don’t have previous exposure to it you will nevertheless be able to quickly get an idea of what is going on. Yet what I believe is even more important is the fact that these definitions stay reader-friendly. When I turned back to my own parser after a couple of months of other activities I had to spend quite some time refamiliarizing myself with my implementation. With pyparsing it took me about a minute to read and understand what this piece is about. The only thing left now is to apply line parser to some assembly code and deal with result. line is a parser which is capable of parsing program lines consisting of parts described above. To get parse results now is trivial:

parse_results = [line.parseString(x) for x in program]

parse_results for parser described above can have attributes label, mnemonic, args and comment. Actually an attempt to get any named field will result in parse_results returning a string which will be empty in case nothing was successfully matched.

Having a list of parsed elements it is trivial to take a list of instructions, compute proper addresses for all labels and then substitute mnemonics for operation codes, addresses for labels and keep constants intact:

def assemble(program):
    return second_pass(*first_pass(parse(program)))

def parse(t):
    return [line.parseString(x) for x in t.split('\n')]

def first_pass(code):
    lt = LabelTable()
    for line in code:
        if line.codeline and line.codeline.label:
            lt.add_entry(line.codeline.label)
        if line.codeline and line.codeline.mnemonic:
            lt.advance_address(line.codeline.mnemonic)
    return code, lt.table

def second_pass(code, address_table):
    out = []
    for line in code:
        cmd = mnemotable.get(line.mnemonic)
        if cmd is not None:
            out.append(cmd['value'])
            for arg in line.args:
                arg = address_table.get(arg, arg)
                out.append(_shape(str(arg).zfill((cmd['size']-1)*2)))
    return " ".join(out)

def _shape(x):
    return x[:2] + (' ' if len(x) == 4 else '') + x[2:]

You can see that the code is crude and straightforward, the most complex part of it is code for formatting results. LabelTable is an auxiliary class which stores label locations and advances memory location. It is trivial to come up with some implementation for it, the one I used I placed in Appendix in order to save space.

Calling assemble() with a program now will result in a string of space-delimited pairs of characters representing hexadecimal values, so the job is done, right? Not yet. Assembler as described is as useless as it is simple. The other most important part that was omitted from the previous discussion is the question how one can abstract some common patterns in the code? Consider, for example assemblers for MIPS that have pseudoinstructions which look like regular ones to the user, but consist of several CPU instructions under the hood. Arguably the most important part of a DIY assembler built for research purposes is an ability to handle such pseudinstructions and preferably some other abstractions. For instance it would be nice to have say an ability to use libraries which get included at assembly time. Such libraries could provide fixed-point arithmetics routines an could depend on libraries actually implementing basic arithmetics (in case it is needed). Thus I found myself in desperate need of a macroprocessor. Once again I could have used existing tools like m4, but decided to roll my own instead.

At first macroprocessors might seem like scary magic. That’s not true, macroprocessors in essence are rather unsophisticated devices, especially ones that I was interested in. In fact all I needed was to be able to include some pieces of code and to substitute some higher-level entries with actual code. Both this activities are quite similar. Let me start with includes first as they seem the lowest hanging fruit.

Macroprocessor should run before assembler, thus it is its job to expand entries like INCLUDE foo.asm when it sees one. This could be achieved by following the next algorithm: push all the lines in a program to a stack, pop a line, if it does not start with ‘include’ then push it to output queue, if it does then push lines from file to include on top of program stack, repeat while there is nothing left on program stack. Of course errors must be handled properly and it also helps to keep a list of already included files at hand since even several such files are capable of breeding include loops. It could be tempting to use this mechanism for directly substituting code, but it is better not to do so since it would require overcomplicated and error-prone loop detection mechanism. It is much better to rely on macro definitions for this job.

Macro definition could look like this:

MACRO foo bar
      lda bar
ENDMACRO

First of all, it starts with word MACRO which is reserved specifically for this purpose. One would have to strike it out of her lexicon when working with this macroprocessor. Then goes macro name followed by named arguments to macro. Then, starting on the next line goes macro body – the code which would get substituted later when macro names appears in code. Finally ENDMACRO finishes macro definition.

When processing input macroprocessor maintains a table of known macro names. When MACRO appears in input stream macro processor processes everything till the end of line to create an entry for this table consisting of macro name and macro arguments, then reads everything till ENDMACRO and stores obtained code in the table. When reading in other tokens macroprocessor checks if token matches to any known macro name and if yes expands this macro and pushes expansion result on top of input. This allows one to use macros inside macros which is quite handy when building on top of a minimalistic architecture.

Macro substitution is slightly more involved than includes: macro body could contain labels, so each time a macro gets expanded all known labels must be mangled to avoid surprise jumps during run time. To achieve this labels must be known to the macroprocessor which, in turn, means, that macro code must be partially parsed to obtain all label names in use. When label names are known mangling could be done as trivially as substituting label name substrings with label names extended with line number on which a macro appeared. Arguments could be expanded this way as well.

There is more to macro processing than these simple tricks including, but not limited to expression evaluation and conditional expansion which can help with combating recursive expansions (I don’t deal with it in my simple demonstration in any other way than by issuing this stern warning to be careful when working with your experimental setups). I am stopping here since any other augmentations above the most necessary feature set would mean a transition to a more complex system than I intend to work with.

To build a macroprocessor one first needs to be able to parse two special cases of INCLUDE and MACRO. This could be easily achieved by expanding parser definition like this²:

endmacro = Literal('ENDMACRO')('macroend')
filename = Word(alphanums + '_.')("filename")
include = (Literal('INCLUDE').suppress() + filename)('include')
macrostart = (Literal('MACRO').suppress() + identifier('macroname')
            + Optional(args))('macrostart')
mline = endmacro | include | macrostart | codeline | comment

Now the parser from above could be used in a preprocessor:

def preprocess(program):
    def process_include(res, included, linelist):
        fname = res.include.filename
        if fname in included:
            return linelist
        if os.path.exists(fname) and os.path.isfile(fname):
            with open(fname, 'r') as f:
                to_include = f.readlines()
            to_include.extend(linelist)
            included.append(fname)
        else:
            raise Exception("%s: file not found" % fname)
        return to_include

    def process_macro(res, macrotable, linelist):
        margs = res.macrostart.args.asList() if res.macrostart.args else []
        mbody, mlabels = [], []
        while True:
            nextline = linelist.pop(0)
            tmp = mline.parseString(nextline)
            if tmp.macroend:
                break
            if tmp.label:
                mlabels.append(tmp.label)
            mbody.append(nextline)
        macrotable[res.macrostart.macroname] = {'body': "\n".join(mbody),
            "args": margs, "labels": mlabels}
        reurn macrotable, linelist

    def process_codeline(res, macrotable, linelist, out, ctr):
        if res.codeline.mnemonic not in macrotable:
            out.append(line)
            return linelist
        if res.codeline.label:
            out.append(res.codeline.label + ': ' +'nop')
        macro = macrotable[res.codeline.mnemonic]
        body = mangle_labels(macro['body'], macro['labels'], ctr)
        body = expand_args(body, macro['args'],
            list(res.codeline.args)).split('\n')
        body.extend(linelist)
        return body

    def mangle_labels(body, labels, modifier):
        newlabels = dict((x, x+str(modifier)) for x in labels)
        for label, newlabel in newlabels.items():
            body = body.replace(label, newlabel)
        return body

    def expand_args(body, abstract_args, actual_args):
        newargs = {}
        if abstract_args:
            newargs = dict(zip(abstract_args, actual_args))
        for abstract_arg, actual_arg in newargs.items():
            body = body.replace(abstract_arg, actual_arg)
        return body

    linelist = program.split('\n')
    included, out, macrotable = [], [], {}
    ctr = 0
    while linelist:
        line = linelist.pop(0).rstrip('\n')
        ctr += 1
        res = mline.parseString(line)
        if res.include:
            linelist = process_include(res, included, linelist)
        elif res.macrostart:
            macrotable, linelist = process_macro(res, macrotable, linelist)
        elif res.comment:
            out.append(line)
        elif res.codeline:
            linelist = process_codeline(res, macrotable, linelist, out, ctr)
        else:
            raise Exception("Impossible!")
    return "\n".join(out)

Assembling with the code above now looks like assemble(preprocess(program)).

To sum it all up, the resulting tool is straightforward, simplistic, does not handle many corner cases and turns a blind eye to most of possible errors. It is just enough to go on exploring instruction sets. The only means of abstraction available is rudimentary preprocessing, but my observations indicate that this is enough for experiments. More complex tools either won’t be used in fullness or indicate a rather complex and feature rich system which is out of scope now. In case basic features described in this post are not enough to reach ones goals a much deeper intro into assemblers should be consulted, for example this one.

Appendix

LabelTable implementation:

class LabelTable(object):

    def __init__(self, mnemotable=mnemotable):
        self.mt = mnemotable
        self.table = {}
        self.current_address = 0

    def add_entry(self, entry):
        if entry in self.table:
            raise Exception("Duplicate label definition")
        address = _shape(hex(self.current_address)[2:].zfill(4))
        self.table[entry] = address

    def advance_address(self, l):
        self.current_address += self.mt[l]['size']

There is a special case of mnemonics that look like they have arguments, but in fact they don’t (consider register exchange commands). I am intentionally omitting them now. Also I am omitting the case of complex memory addressing schemes, since I believe that if a CPU supports such features then it is not a simple one anymore and by induction its instruction set is not simple as well. So quite naturally it gets excluded from the discussion. ↩
MACRO and ENDMACRO could be treated as open and close brackets for a piece of code, but in my opinion that would overcomplicate the basic tool. ↩