Conway's Game Of Life with 128x64 graphic LCD. UPDATE: also with 128x64 SPI OLED

It's worth experimenting with the length of the 'bunch of cells' variables.

Using the 'do multiple neighbour counts at once' approach with another alive/dead CA on the Maple works out to take

.. 30 ms/gen with byte variables for the bunches
.. 24 ms/gen with uint16
.. 20 ms/gen unsigned long
.. 101 ms/gen unsigned long long

for updating a 256x256 cell world. Clearly, 32 bit variables are faster for the Maple's ARM, even though you need to deal with twice as many as when you're using 64 bit ones.

(Ooh, adding something that tests if the neighbourhood bunches are all zero - or, for this one, ones, which would be pointless for Life - and acting accordingly takes the byte / uint16 / unsigned long times down to 23 ms / 19 ms / 9 ms = over 100 gens/second for 256x256. Although the longer sizes trigger the test less frequently, they save more time when they do.)

Ooh2: If you test for being all zero, it looks like the size of bunch length is almost irrelevant for Life. 'Obviously' 64 bits stays by far the worst, but starting with a random 50% alive state, bytes are a small fraction quicker than uint16s which are a tiny fraction quicker than unsigned longs. There's enough stable/blinking debris to mean that the longer lengths are all zero much less of the time, so the overhead of the test wipes out more of the savings.