revision 1.0.0

Rule 90 for 1D cellular automaton produces a lovely Sierpinsky carpet when it starts with a single non-empty cell. It defines the next state of a cell in one dimensional cellular automaton as a XOR of current neighbors. It is trivial to implement in many styles and languages. While being extremely simple it provides a nice model for experiments with implementation of distributed systems in which nodes talk to neighbors through message passing. Here I’ll describe my small experiments with implementing it in Erlang and share what could be gleaned from it.

Reference Implementation

To make sure any experimental implementations produce correct results let’s cobble quick and dirty reference generator first:

-module(linlife).
-import(lists, [foreach/2, split/2, zip/2, filter/2, map/2]).
-export([go/0]).

-define(WIDTH, 25).
-define(NUMGEN, 15).
-define(DUP, lists:duplicate).
-define(PRINT, io:format).

go() ->
  Field = ?DUP(?WIDTH div 2, false) ++ [true] ++ ?DUP(?WIDTH div 2, false),
  make_step(Field, 1).

make_step(_, ?NUMGEN + 1) -> ok;
make_step(Field, Step) ->
  show(Field, Step),
  {B1, L1} = split(length(Field) - 1, Field),
  {B2, L2} = split(1, Field),
  Neighbors = zip(L1 ++ B1, L2 ++ B2),
  make_step(map(fun({NL, NR}) -> NL xor NR end, Neighbors), Step + 1).

show(Field, Iteration) ->
  ?PRINT("~-4w ", [Iteration]),
  foreach(fun(X) -> if X == true -> ?PRINT("H");
                            true -> ?PRINT(".") end end, Field),
  ?PRINT(" ~2w~n", [length(filter(fun(X) -> X end, Field))]).

The code is rather straightforward: the only public function go() prepares a field of WIDTH cells with one cell in the middle set to true, others set to false and passes the ball to make_step(). The latter prints out the field, shifts the field left and right by one element and then loops after computing new field by applying XOR to shifted fields. As soon as a predefined number of steps is reached pattern (_, ?NUMGEN +1) matches and the loop finishes.

When run it will produce the following:

1    ............H............  1
2    ...........H.H...........  2
3    ..........H...H..........  2
4    .........H.H.H.H.........  4
5    ........H.......H........  2
6    .......H.H.....H.H.......  4
7    ......H...H...H...H......  4
8    .....H.H.H.H.H.H.H.H.....  8
9    ....H...............H....  2
10   ...H.H.............H.H...  4
11   ..H...H...........H...H..  4
12   .H.H.H.H.........H.H.H.H.  8
13   H.......H.......H.......H  4
14   HH.....H.H.....H.H.....HH  8
15   .HH...H...H...H...H...HH.  8

A piece of Sierpinsky carpet with an addition of generation number and the number of alive cells in each generation. It takes about 8 ms to print the image above of which most is spent on formatting and printing out the image with actual computation taking less then 250 us. It takes ~60.5 ms to compute 15k generation with this approach. I like to measure things so below is a table with computation time as function of world size and number of generations.

Cell No         10        100       1000
     17     4.2e-5    2.81e-4    0.00271
     33     5.3e-5    4.73e-4   0.004614
     65     9.8e-5    8.75e-4   0.008727
    129    1.71e-4   0.001609   0.016249
    257    3.52e-4   0.003294    0.03296
    513    7.84e-4    0.00668   0.065937
   1025   0.001502   0.012434   0.124597
   2049   0.002826   0.025358   0.255067
   4097   0.005658    0.05185   0.542044
   8193   0.012276   0.107905   1.067854

The table basically tells us that the more cells are processed for bigger periods the longer it will take it to compute with time growing pretty linearly.

Apart from showing samples of slightly unusual syntax the code is very straightforward and does not allow Erlang to shine in all its glory. (A note for erudites, yes, it could be done using bitstrings, yes, it will be much much faster than lists, but no, I am not using it as a reference point since the looks of resulting code will provide too much of a distraction for no gain.) Let us consider a different approach to see what else Erlang could offer.

Cellular Automata Revisited

One dimensional cellular automaton is extremely simple so it is very tempting to consider it as an array and to perform array operations on it. Let’s consider instead what was used as an inspiration for cellular automata: cells. Cells are autonomous objects which occupy some space, exchange chemicals with the outside world through the border of this space and change their state basing on what has been received. In the reduced world of one dimensional cellular automaton cells emit not signalling molecules but their state to neighbors and receive states from the neighbors. Basing on current state and neighbors’ states the next state is computed. In case of rule 90 self state is irrelevant and the next state is just a XOR of neighbor states.

In terms of Erlang each cell could be represented with a process which is responsible for receiving messages with updates from neighbors, computing next state and emitting updates to all interested parties – neighbor cells and possibly some monitoring process used for displaying global state. Obviously such process should remember who are its neighbors and supervising process. When running it must keep track of who has already sent an update. Thus internal state processing could look like this:

cell_process(PID, State, Info) ->
  NewNeighborState = Info#cellinfo.neighborstates ++ [State],
  NewWaitList = delete(PID, Info#cellinfo.waitonlist),
  [NextNeighbor] = delete(PID, Info#cellinfo.neighborpids),
  if NewWaitList == [] ->
       NewState = compute_next_state(NewNeighborState),
       foreach(fun(X) -> X ! {self(), stateupdate, NewState} end,
               Info#cellinfo.neighborpids),
       Info#cellinfo.superpid ! {self(), NewState},
       NewInfo = Info#cellinfo{selfstate=NewState,
              	               neighborstates=[],
                               waitonlist=Info#cellinfo.neighborpids};
     true -> 
       NewInfo = Info#cellinfo{neighborstates=NewNeighborState,
                               waitonlist=NewWaitList}
  end,
  {NextNeighbor, NewInfo}. 

compute_next_state(NeighborStates) ->
  foldl(fun(X, A) -> X xor A end, false, NeighborStates).

Internal processor gets PID of a neighbor and neighbors’ state as an input. In order to try to preserve purity internal state is carried around as another argument which could be a record defined as follows:

-record(cellinfo, {superpid :: identifier(),  % PID of a supervisor
                   selfstate :: boolean(),
                   neighborpids :: [identifier()], 
                   neighborstates = [] :: [boolean()], 
                   % These cells have not sent an update yet:
                   waitonlist :: [identifier()]}).

After neighbor state list is updated and current neighbor’s PID is removed from a list of not reported members the next neighbor to wait on is determined. This is an important step to perform but more on it later, for now let’s just remember that it has been done. The following code is rather clear as well: if every neighbor has reported then compute new self state, send messages with the new state to supervisor and to each neighbor, clean up neighbor states list and populate wait list with all members’ pids. If, however, not everyone has reported yet, then just update state record. The last statement just returns a tuple of next PID to wait on and new inner state. It is worth noting that decoupling of rule logic from bookkeeping happens rather naturally here and to switch to two dimensional cellular automaton by slightly modifying cellinfo record and by supplying proper compute_next_state().

All is well, but who initiates processing, i.e. who calls cell_process()? It turns out that one must define several helper functions to set everything up. General logic is this: first some initial process gets spawned with some initialization function which passes the ball to main code which loops as long as it receives messages it can understand, parses them and, in turn, calls the processor. The code implementing it is preset below as well as discussion:

cell(SupervisorPID) when is_pid(SupervisorPID) ->
  SupervisorPID ! {self(), online},
  receive
    {SupervisorPID, init, InData} ->
      foreach(fun(X) ->
                X ! {self(), stateupdate, InData#initializer.state} end,
              InData#initializer.neighbors),
      Info = #cellinfo{superpid=SupervisorPID,
                       selfstate=InData#initializer.state,
                       neighborpids=InData#initializer.neighbors,
                       neighborstates=[],
                       waitonlist=InData#initializer.neighbors}
  end,
  cell(Info);

cell(Info) ->
  receive
    {PID, stateupdate, State} ->
       {NextNeighbor, NewInfo} = cell_process(PID, State, Info);
    stop -> NextNeighbor = 0, NewInfo = 0, exit()
  end,
  cell(NextNeighbor, NewInfo).

cell(PIDToWait, Info) ->
  receive
    {PIDToWait, stateupdate, State} ->
       {NextNeighbor, NewInfo} = cell_process(PIDToWait, State, Info);
    stop -> NextNeighbor = 0, NewInfo = 0, exit("Bye!")
  end,
cell(NextNeighbor, NewInfo).

For each cell supervising process (which will get a bit of attention later) spawns cell() and passes it its PID. First thing the fresh cell does is to ping supervisor back and tell it that it has successfully spawned. Then it waits till supervisor provides it with details about its desired state and PIDs of its neighbors. As soon as supervisor provides it with initialization data it messages neighbors. This initialization scheme might look complicated, but in fact it is not – it actually reduces the amount of work a supervisor must do (it does not need to do some fancy preallocation to init cells immediately). After a cell has performed this initialization stage it moves to the second one which turned out to be important. Now a fresh cell is waiting for its first update from a neighbor. Since all processes are running asynchronously we can’t generally be sure who sends this update first, so the cell does not make any expectations and just waits for an update. However as soon as first update message is received the cell morphs one last time and now it is waiting for an update from a very specific neighbor, i.e. the one who was not the first. And this is very important since as we can’t predict which process finishes computing its next state first, we can’t guarantee that a cell won’t get two updates from the same neighbor in a row which, in turn, could trigger a false state update! That’s why cell_process() returns NextNeighbor – we should explicitly wait on it or the results might get funny. What is even more funny, one would occasionally get correct results, just because it happened so that the processes were scheduled in favorable order. I tried running the code without this last morph and managed to get expected results in whooping forty four runs out of two thousands. If nothing else this must convince anyone that adding a little of local synchronicity could be rather helpful. (Now one might wonder what happens if some process still manages to send several messages before its peers. Good news – nothing disastorous! The messages won’t be procecced, but as long as the cell is alive they will stay in process’ messagebox, so the next time process actually waits for updates from this particular speedy peer it will immediately get those in exactly the same order they were received.) Now the cell is looping endlessly as long as nobody is sending it atom ‘stop’.

One piece is still left out – supervisor process which orchestrates cells. Basing on the experience with cell process it is easy to guess that it could be constructed as a succession of functions responsible for different initialization steps. Recalling stages of a single cell it is easy to come up with what has to be done:

  • spawn all cells;
  • wait for all cells to report their online status;
  • initialize running cells;
  • collect cell states;
  • when enough generations have reported back shut everything down.

The corresponding code might look like following:

-module(life).
-export([run/2]).

-import(lists, [foreach/2, member/2, delete/2, foldl/3, keyreplace/4,
                seq/2, map/2, keyfind/3, sublist/2, filter/2]).
-record(cellinfo, {superpid :: identifier(),
                   selfstate :: boolean(),
                   neighborpids :: [identifier()], 
                   neighborstates = [] :: [boolean()], 
                   waitonlist :: [identifier()]}).
-record(initializer, {state :: boolean(),
                      neighbors :: [identifier()]}).

run(NumCell, NumGen) ->
   run(NumCell, [], NumCell, NumGen).

run(Num, CellMap, NumCells, NGen) when Num >= 0 ->
  NewCell = spawn(cell, cell, [self()]),
  run(Num-1, [{Num, NewCell}] ++ CellMap, NumCells, NGen);
run(_, CellMap, NumCells, NGen) -> waitonline(CellMap, 0, NumCells, NGen).

waitonline(WhomToWait, Ctr, NCells, NGen) when Ctr < NCells ->
  receive {_, online} -> waitonline(WhomToWait, Ctr + 1, NCells, NGen) end;
waitonline(CellList, Ctr, NCells, NGen) ->
  UpdatePairs = map(fun(X) -> get_neighb_pids(X, CellList, NCells) end,
                    CellList),
  init_cells(UpdatePairs, [], NCells, NGen).

init_cells([], StateMap, NCell, NGen) -> 
  cell_monitor(StateMap, 0, TStart, NCell, NGen);
init_cells([NextCell | RemainingCells], StateMap, NCell, NGen) ->
  {Num, PID, Neighbors} = NextCell,
  if Num == NCell div 2 -> State = true;
     true -> State = false
  end,
  PID ! {self(), init, #initializer{state=State, neighbors=Neighbors}},
  init_cells(RemainingCells, [{Num, PID, [State]}] ++ StateMap, NCell, NGen).

cell_monitor(FieldState, Num, TS, NCell, NGen) when Num < NCell*NGen -> 
  receive _ -> ok end,
  cell_monitor(FieldState, Num + 1, TS, NCell, NGen);
cell_monitor(FieldState, _, TS, NCell, NGen) ->
  foreach(fun(X) -> {_, PID, _} = X, PID ! stop end, FieldState).

Here the code ended up to be rather trivial as well: run() spawns the requested number of cells, then calls waitonline(). waitonline() just counts how many times it was told that a cell is online and as soon as it was updated as many times as there are cells it invokes init_cells() which, unsurprisingly, tells each cell who its’ neighbors are and what its’ current state is. When done it passes the ball to cell monitor.

A careful reader should have asked by now “Why cell monitor does essentially nothing with state updates it receives? Also why there is a wildcard in pattern?”. Well, to answer the second one – the system under consideration is rather simple and I decided that it is ok to cut a few corners this particular time. As to why state updates are ignored, it turned out that writing decent real-time state displaying code was taking me much more time than implementing everything else, so I’ve shortcutted a bit and cobbled a Python script listening to a socket and doing all the formatting and printing. It is trivial, does not add any value to the discusison and I decided not include it.

Now it is time to fire up the code and see what happens. Run results could be seen in the table below:

 Cell No         10        100       1000
      17    3.97e-4    0.00587    0.03288
      33   0.001712   0.010163   0.058069
      65   0.001415   0.013891   0.099306
     129    0.00516   0.026559   0.197915
     257   0.009532   0.048242   0.407558
     513   0.016565   0.091794   0.936569
    1025   0.022423   0.202998   2.054239
    2049   0.058971   0.512124   4.535901
    4097   0.207364   1.794758  10.411121
    8193   0.675303   2.871532  27.610455

It could already be clearly seen that it takes significantly longer to do computations for any population size now than when ordinary lists were used. The reason is easy to guess: no matter how lightweight cell processes have to be scheduled and switched, messages have to be sent, all necessary maintenance must be done, and all these small costs eventually sum up to significantly slow the automaton down. “But if it slows everything down so much what is the point of this implementation?” one could justly ask. Bear with me a little more and it will become eventually clear.

Going Multinode

Consider what happens if you have to compute a fixed number of generations in a certain amount of time. If you are using the first implementation you are fine as long as your population size is below some threshold value which is determined by hardware you use. But what are you options when you hit this threshold after you absolutely must increase population size? You could try running your load on better hardware, you could try tweaking your operting system settings to make it disturb your program as little as possible, but all these options are rather limited. There is not much to do when they are depleted and you still have to add some more cells. Somewhat surprisingly the slow implementation can show a path to salvation.

Recall that each cell is a process now. A process communicates with other processes through their IDs. It turns out, that it does not really matter whether a process is running in the same VM, in a different VM on another core or somewhere in a network as long as the VM it is running on is connected to other VMs and its ID is known. So the cell process will stay the same no matter where on Earth its peer is, and it is not an exacerbation, since a peer process could indeed physically exist on the other side of the planet as long as network latency does not matter.

The only place a change must be introduced is the supervisor. Now it must take into consideration the distribution of cells across nodes. In this study I’ll assume static nodenames, static naming scheme and static connections between nodes which could be not the case in real world scenarios. These assumptions will, as usual, won’t overload the resulting code and won’t shadow the important parts. I will have to introduce a bunch of hardcoded values, which I belive is totally fine for a synthetic study.

Before diving deep into changes to supervisor let me first describe test setup. Now I will run the code on two Erlang VMs, one running a supervisor and started as erl -sname thranduil on physical host named ‘mirkwood’, another named ‘elrond’ and running on a host named ‘rivendell’. I’ll hardcode these names for the sake of simplicity. Both hosts must be able to reach each other, be able to resolve each other’s names and share code. For a quick test all these prerequisites could be easily satisfied manually, however please don’t forget that any real-world application would require proper automation. The nodes are true hardware ones because I’ve decided against running multiple virtual nodes in order to reduce possible measurement noise. In this setup ‘elrond’ does nothing while life:run() is called on ‘thranduil’. Now to the code changes.

First of all, two VMs must find out about each others’ existence. Cheapish way to help them is to make one ping another. It is convenient to do it first when running the code:

run() ->
  pong = net_adm:ping('elrond@rivendell'),
  ...

Now if ‘elrond’ is reachable on ‘rivendell’ and it returns ‘pong’ the code will continue to execute in exactly the same manner as before. The other issue to resolve is how to start cells remotely. It turns out, that this could be done easily by using spawn/4 which takes a node on which to run a function as an argument. Now since I have only two nodes at hand I’ll just split cells array in two and run one half on one node, and other one on another. The code for doing so turned out trivial as well:

run(CellCtr, CellMap, NumCells, NGen) when CellCtr >= 0 ->
  if CellCtr > (NumCells div 2) -> Node = node();
       true -> Node = 'elrond@rivendell'
  end,
  NewCell = spawn(Node, cell, cell, [self()]),
  ...

And that’s it. Apart from additional work of setting up two servers all that has to be done to distribute the load over them could be boiled down to those five lines above. While this is really not the best way to approach load distribution it is still capable to give some feeling of the amount of work needed. As a mental exercise consider how the same task could be done with your language of choice.

With code written and nodes ready it is time to run measurements! Here are the run results:

 Cell No         10        100       1000
      17   0.029516   0.312973   6.528962
      33   0.035309   0.293366   4.599328
      65   0.019603   0.266519   2.888129
     129   0.016427   0.300707  22.978286
     257   0.011035   1.815778  59.821181 
     ....................... wait, what?!

Which unambigously tell us that connecting two nodes via a wireless network was a very bad idea. Anyway, I had to try it to see what would happen. Now let’s connect the nodes with good old wires and see if the times improve:

 Cell No         10        100       1000
      17   0.006196   0.045763    0.40451
      33    0.01086   0.066265   0.455357
      65   0.003999   0.064106   0.646815
     129   0.005138   0.071591   0.612606
     257   0.006841   0.090289   0.873023
     513   0.010659   0.112242   1.319318
    1025   0.018277    0.17078   2.197606
    2049   0.038378   0.366629    4.20114
    4097   0.073783   0.914429   8.553961
    8193   0.185747   3.045522  16.233432

Now it looks more like something one would expect to see: it takes less time to do the job on two nodes than on one. For the convenience of analysis I am including a plot of distributed run compared to fully local run and naive expectation of two-fold speedup:

image-title-here

These results meet expectations: there is obviously a network overhead that is especally visible on small populations which plays smaller and smaller role on bigger ones and which eventully gives way to speedup. The speedup is not as big as one could hope for – around 1.5 instead of maximal 2, but it is still clearly visible. It would have been nice to try spreading the load over more nodes, but it turned out that all I could procure in reasonable amount of time were just two nodes with LAN port, so larger clusters will have to wait till I lay my hands on more computers with a socket for twisted pair. The code turned out to be rather trivial and compact. Of course it is not robust enough and lots ad lots of possible corner cases related to sudden deaths of cells are not covered, but it is not the purpose of a POC to deal with corner cases. It should serve an illustrative purpose which I hope it does well. Also it hints on another path for improving performance: one could combine the best of two implementations, the speed of list based and scalability of process based by introducing processes dealing with groups of cells.

Conclusion

This small study had a simple goal – to showcase a way of reasoning about distributed systems in terms of Erlang, a tool designed specifically to tackle essentially concurrent systems and streams of events. Note, however, that demonstrated way is by no means the best one to use Erlang, but a mere primer, aimed mostly to wet appetite and hint on what could be done. Many aspects crucial to real world systems were left out: proper supervision, OTP and others. I decided that those were not really important for a simple study like this and would obscure the benefits of Erlang rather than highlight them. All in all it is a complex tool which requires certain effort to master. Also I hope that I have not created any false expectations, since being a good tool for a specific task set does not make it a silver bullet (which, unfortunately does not seem to exist). While almost any task could be solved by using Erlang it is better not to forget that the same holds true for a really fast Brainfuck interpreter, so understanding when not to use something is no less important.

Afterparty

As a postscriptum I decided to share a bitstring implementation without further discussion:

-module(bitlife).
-export([run/2]).

-define(OUTPUT(Y),
        io:format("~s~n", [[case X of $0->$.; $1->$H end ||
                            X <- lists:flatten(io_lib:format("~33.2.0B",
     	                                                     [Y]))]])).

run(NCell, NGen) ->
  HS = NCell div 2,
  BS = NCell + 1,
  << InitField:BS >> = << 0:HS, 1:1, 0:HS >>,
  ?OUTPUT(InitField),
  step(InitField, NCell, NGen),

step(_, _, 0) -> ok;
step(Field, NCell, Ctr) ->
  BS = NCell + 1,
  << First:1, Rest:NCell >> = << Field:BS >>,
  << Begin:NCell, End:1 >> = << Field:BS >>,
  << LS:BS >> = << Rest:NCell, First: 1 >>,
  << RS:BS >> = << End:1, Begin: NCell >>,
  << NewField:BS >> = << (LS bxor RS):BS >>,
  ?OUTPUT(NewField),
  step(NewField, NCell, Ctr - 1).

It is blindingly fast compared to any of presented implementations and also looks rather alien to the first look. Despite the looks it might be worth using keeping this option in mind.

   Cell No         10        100       1000
        16     1.5e-5     9.2e-5    8.96e-4
        32     1.2e-5     9.0e-5    8.87e-4
        64     1.5e-5    1.32e-4   0.001243
       128     2.2e-5    1.68e-4   0.001627
       256     2.5e-5    2.07e-4   0.002041
       512     5.0e-5    4.26e-4   0.004243
      1024     6.9e-5    5.97e-4   0.006045
      2048    1.12e-4    9.49e-4   0.009607
      4096    1.63e-4   0.001566   0.015906
      8192    2.86e-4   0.002798    0.02803

Another interesting observation happened when I tried to increase array size on two physical nodes some more. Instead of speedup I got significant slowdown and a nasty memory leak, but that is a core of a different story.