Assembly Language tricks for M0+ processor

About 8 months ago I began writing a 32kb/s ACELP decoder & 64kb/s MP3 decoder for Audible audiobooks and I have JUST got it working on a Verilog emulator. It’s a SAMD21G18 cut down. Getting the fexed-point decoders was hard work. A CLZ & MULA (32-bit x 32-bit → 64-bit multiply and accumulate have been added.

I have a few tricks to speed up assembly language code that I hope might be of value to at least 1 person. If others have tricks, it would be wonderful if I could manage without the CLZ and/or MULA but hey ho.

  1. The bottom 255 addresses can be used as ‘zero page’ since the address can be seetup by immediate.

  2. In bottom-level subroutines, stack & use LR.

  3. Store & use SP. SP has extra addressing modes.

  4. Hi registers support ADD instructions so use loop-counters that increase to 0.

  5. Use SP for ‘switch’ type instructions. All 8 lo registers and the new IP can be unstacked FAST.

  6. The tiny pipeline means that branching around 1 instruction means constant time of 2 cycles.

  7. Shift/Rotate Rd = Rd <</>> Rs has interesting properties. bits 0-7 of Rs used, not bits 0-4.

  8. Don’t forget that SBCS subtracts with NOT of C flag.

  9. ARMv6 MULS doesn’t touch Carry or oVerflow (previous versions did).

  10. Pay attention to cache (if present), code in RAM & self-modifying code for fastest execution.

I’m sorry if this is all common knowledge but when I first embarked on Thumb, I was shocked that the people who brought us the ARM 32-bit instruction set that was brimming with clever tricks to find I only really had 8 registers that were REALLY general and the absolute minimal amount of functionality. I can only imagine it to be off-putting to someone to whom this is the first asm they have coded. I found that ‘Hacker’s Delight’ taught me some useful stuff and that the 68000 has many similarities so check out sites where people are still writing Amiga demos.

Lastly, it is the bus bandwidth that is the limiting factor. ARM claim the M0+ manages 0.87 MIPS/MHz but I have got upto about 0.94 when I consider the DMA running does not stop code in cache. It’s VERY tricky but it is possible to make code that looks beautiful but don’t look at compiled C, it isn’t pretty.

What cache?
I have also found the M0 processor to be disappointing. Thanks for your tips.
Even the ones that sounds really dangerous.
See also: Qfplib: a family of floating-point libraries for ARM Cortex-M cores - a bunch of floating point functions for M0, in 1k of flash!