Building Computing Machines from Ground Up: BrainF–k Interpreter Optimization

revision 0.9.0

One of the things that strikes when looking at BF interpreter from the previous post churning data is how slow it is. It is so slow not only because the underlying language is painfully sequential, but also because so many steps are wasted on performing each instruction. Moreover, some subcircuits make good sense in Logisim, but are non-trivial when implemented in hardware. Good news is that the original design is very naive and thus could be optimized. In this post I will show how this could be done.

First approximation

All optimizations start similarly: one stares critically at a system under consideration for some time. After a while patterns become visible, those which could be amended, updated or even removed. Let us do the same with the BF interpreter from the previous post.

Let us look at schematics. A big DMX capable to route signals to eight subelements is used to route to just four. The original logic was to have one route to nowhere to play it extra safe, but is it really needed? Apparently no, since logical zeroes on control inputs (which are cared about by the control unit) mean nothing happens to the counters. So we can easily eliminate one of address lines for counter. To keep changes minimal let us just remap 010 from ‘no-counter’ to ‘bracket counter’. The table looses one column: CTRSEL:3 becomes CTRSEL:2. Not much, but still slightly better than before.

While we are here let us do one more minor update – let us get rid of counters which act as latches. I used them in such way in the first design on purpose: to avoid bringing too many entities into play. Now it is time to replace them with better fitting components – registers. The schematic now looks like this:

Computer cleanup. DMX is made smaller, counters acting as registers are replaced with actual registers.

Note that additional next step latch turns out to be redundant as well.

One phase is good, two are better

Let us continue looking closer at the control table. One pattern that keeps popping up everywhere is the use of guardian entries: in order to make sure proper control signals are present on component inputs each step which contains any of counters clock in high state is prepended by exactly the same step but with clock in low state. This is both very visual and very inefficient: if somehow removed it could provide two-fold increase in operation speed.

In fact a hint to a solution to this problem is already present on the schematic. Observe that next instruction propagation is delayed with a separate register. Clock signal for this register is generated by inverting basic clock source which is used to clock next state latch. Thus both rising edge and falling edge of clock signal are used to trigger different substeps within one step. But clock signal used in these circuit is a very low-frequency one, and quite a long time passes between edges. This time is much longer than it is needed for control signals to stabilize¹. It would have been nice if a rising edge could be used several times with some delay first to update next step latch and then, when control signals have propagated to counters and stabilized used once again to clock counters. In other words, if clock signal could be split in, say, two with first signal distributed unchanged and the second distributed after been stored somewhere for some period of time, delayed somehow. And that is exactly where propagation delay comes to help us. Recall that it takes time for a signal to traverse any gate. Now consider two inverters placed one after another – they don’t change the signal itself, but introduce a delay two times bigger than delay of a single inverter. Thus in the simplest form clock source could be fed directly to next step latch and simultaneously to a delay line built out of inverters. The output of the delay line would be used as second phase of clock signal.

Unfortunately as of version 2.7.1 Logisim does not provide tools to fully explore the concept of gate delay: apparently it is not available to the end user as a parameter of a gate and it is not quite clear from the documentation to which extent it is accounted for in simulation. Yet it looks like that for the very basic experiments it still works well enough so I will keep using it for now. For less visual, but more in-depth simulation I suggest using other tools, like Qucs which while less visual allow for much more precise control of what is happening in the simulation.

From the perspective of control table the change looks tremendous: almost half of rows disappears! Also CLK data line is not needed any more and can be dropped. The last but not least, now the maximum number of steps in any particular command is not more than four which allows me to remove one address line and one data line for next step. Also I have lost the ability to latch flags only after updating PC because previously I relied on CLK routed to PC for this and now there is no more CLK line. This is not a big deal since all I have to do now is to add an OR gate to detect when PC is getting updated and use its output to enable flags latch update. Note that now output control enables display to react on clock input and auxiliary latch in HLT line goes away.

The resulting circuit is presented below:

Second phase of clock signal is introduced. CLK line is removed from the circuit and control words in control ROM, auxiliary latches are removed, PC update detector is added.

And here is corresponding control table.

To sum up as a result of optimization many components go away (which is pretty good) and control table shrinks in size in two which is very good. But what else could be done?

Further optimization

New control table requires a ROM with 512 words of 10 bits. While the address bus size is important only if one decides to build her own ROM unit – contemporary commercial ones don’t seem to be made for less than 2^13 addresses, data bus width could matter: eight bit are easy to come by, everything above that should be built from two or more ROM units. This looks especially wasteful when one needs ten bits. This makes one wonder if it is possible to shrink control word even more.

Control table entry consists of next step which takes two bits and can’t be shrunk any more – some commands need three steps, two control lines can’t be shrunk as well – all four states are used. Since we have four different counters, CTRSel can’t be shrunk too. Thus six bits are already occupied. Four bits are left for toggling flags and controlling I/O. The good news four bits mean sixteen possible combinations and only five of these possible combinations are actually used, the bad news five states need three bits to represent them:

Num    MDHT 
-------------------------------------------------------------------------------
000    0000 
001    0001 
010    0010 
011    1000 
100    1100
-------------------------------------------------------------------------------
M - toggle motion flag       D - toggle direction flag
H - halt                     T - enable terminal

Unfortunately every combination must be present and can’t be reduced further, but at least one could easily reduce four last bits to three and use a decoder to produce actual states:

A decoder is used to convert state number to state value.

Unfortunately this still leaves us with nine bits which is one bit more than convenient for a commercially available unit.

Actually now I am standing at bifurcation point. On one hand I can try harder and shrink control word even more and eventually fit it in eight bits, on the other I can stop tuning control word length and consider instead even bigger speedup. I will act like this: I will start exploring the second option and later will come back to the first one.

Consider this: since counters are independent and all are triggered by phase 2 of clock pulse why update them sequentially when they could be updated simultaneously? This would result in abolition of a demultiplexer and two control lines for counter selection, and introduction of six new control lines which go directly to counters. This way most of the commands will get executed in one step with exception of input command which on one hand introduces huge delays no matter how implemented, on the other is rare and thus does not affect an average program. Now this is a very good speedup for this architecture: it is hard to go below one clock pulse per instruction so I think I’ll settle with it for now, especially given the fact that I started with about four clock cycles per instruction. The resulting schematics looks like this:

A circuit for expanded control word. DMX was removed, next step latch had to stay because input instruction still requires two steps. FLatch has to be clocked with falling edge of phase 2 to avoid thrashing.

Now the control tables becomes compact once again:

              ADDRESS                       STORED VALUE
          +--------------+          +---------------------------+
Command   CODE  ZMBD  STEP             STATE VECTOR     NEXT STEP
-------------------------------------------------------------------------------
 +         000  X0XX   0            0000  01 00 00 01       0                 
           000  X1X0   0            0000  01 00 00 00       0                 
           000  X1X1   0            0000  11 00 00 00       0                 

 -         001  X0XX   0            0000  01 00 00 11       0                 
           001  X1X0   0            0000  01 00 00 00       0                 
           001  X1X1   0            0000  11 00 00 00       0                 

 >         010  X0XX   0            0000  01 00 01 00       0                 
           010  X1X0   0            0000  01 00 00 00       0                 
           010  X1X1   0            0000  11 00 00 00       0                 

 <         011  X0XX   0            0000  01 00 11 00       0                 
           011  X1X0   0            0000  01 00 00 00       0                 
           011  X1X1   0            0000  11 00 00 00       0                 

 .         100  X0XX   0            0001  01 00 00 00       0                 
           100  X1X0   0            0000  01 00 00 00       0                 
           100  X1X1   0            0000  11 00 00 00       0                 

 ,         101  X0XX   0            0010  00 00 00 00       1                  
           101  X0XX   1            0000  01 00 00 10       0             
           101  X1X0   0            0000  01 00 00 00       0                 
           101  X1X1   0            0000  11 00 00 00       0                 

 [         110  00XX   0            0000  01 00 00 00       0                  
           110  10XX   0            1000  01 00 00 00       0                  
           110  X1X0   0            0000  01 01 00 00       0                  
           110  X101   0            0000  11 11 00 00       0                  
           110  X111   0            1100  01 00 00 00       0                  

 ]         111  00XX   0            1100  11 00 00 00       0                  
           111  10XX   0            0000  01 00 00 00       0                 
           111  X1X1   0            0000  11 01 00 00       0                  
           111  X100   0            0000  01 11 00 00       0                  
           111  X110   0            1000  01 00 00 00       0                  
-------------------------------------------------------------------------------
FLAGS = [Zero, Motion, Bracket match, Direction]
STATE VECTOR = [(TOGGLE_MOTION_FLAG, TOGGLE_COUNT_DIRECTION, HLT, TERMCLK),
                (PCL, PCC), (BRAL, BRAC), (ADDRL, ADDRC), (ML, MC)]

Now the computer runs at a decent speed and it is good time to come back to the crossroad and once again try and shrink control word size. And the shrinking would be achieved by utilizing an observation that there are sixteen distinctive control words in the entire table. Think of it. Each word is large, but there are so few of them that they don’t occupy the entire range of which the control bus is capable. Thus we can sacrifice clarity a little and introduce yet another mapping from a four bit number (just enough for the number of states at hand) to control lines state:

Num    State word
-------------------------------------------------------------------------------
0000   0000 01 00 00 00  0
0001   0000 01 00 00 01  0
0010   0000 01 00 00 10  0
0011   0000 01 00 00 11  0
0100   0000 01 00 01 00  0
0101   0000 01 00 11 00  0
0110   0000 01 01 00 00  0
0111   0000 01 11 00 00  0
1000   0000 11 00 00 00  0
1001   0000 11 01 00 00  0
1010   0000 11 11 00 00  0
1011   0001 00 00 00 00  1
1100   0010 00 00 00 00  1
1101   1000 01 00 00 00  0
1110   1100 01 00 00 00  0
1111   1100 11 00 00 00  0
-------------------------------------------------------------------------------
STATE VECTOR = [(TOGGLE_MOTION_FLAG, TOGGLE_COUNT_DIRECTION, HLT, TERMCLK),
                (PCL, PCC), (BRAL, BRAC), (ADDRL, ADDRC), (ML, MC), NEXT_STEP]

Then this mapping could be used to populate control table, output of which is fed to a device which contains the mapping above. It could be either conventional ROM as I used in the circuit for the demonstration purpose, or a device similar to the decoder described above. This allows us to both avoid the use of extra components in control units and still have all the benefits of wide control words.

Final circuit looks like this:

A final circuit which utilizes both wide control words and compact storage. Wide control words are stored in separate device with just four address bits.

Control table for it could be easily derived from expanded control table and substitution table. Circuits for this post could be found in a usual place

Afterword

This post has finalized the basics of development of BrainF–k computer: the design is relatively performant and at the same time still easy to understand. This design is ready to be translated to be implemented in hardware yet to do that I have to deal with some debt first, namely I have to discuss various ways to implement memory units and displays. After that everything will be ready to build a real BrainF–k computer.

Signals don’t need to be stabilized in Logisim, but here I am referring to what happens in a real circuit. ↩