Hello all,
I am coding an MP3 / ACELP decoder on the Arduino M0 Pro. Now I have my 32-bit x 32-bit ---> 64-bit signed multiply routine. MULL (17 cycles) but there are two other code fragments that are the cornerstone of the work (around 35% of CPU time is taken up by 3 fragments). I am looking for any tips people have for:
MULSHIFT32 - Multiply together two 32-bit numbers and return the top 32-bits of the result.
CLZ - count leading zeros.
MULSHIFT32 The issue that all the methods I know just are just the MULL in which the low 32-bits are disguarded.
CLZ - all of the methods I know of use a divide-and-conquer strategy branching around ADD instructions. I think the examples are all Thumb v4 compatible but Tv6 offers the Extent and Reverse instructions which haven't been well examined. I also note that the MULS instruction no longer leaves C & V in unpredictable states so may offer alternatives.
Although M0+ only has a 2-cycle branch shadow, I would still rather a constant time and I keep thinking that extend and reverse could quickly work out which byte (8 bit) lump has the first 1 bit and one of the areas yet to be explored (according to Joseph Yiu) is the use of the instruction format:
RORS Rd,Rd,Rd
i.e. Rd is rotated by the value in Rd [0:7]
WHY it uses the bottom 8 bits rather than the bottom 5 bits I really don't know. Nobody at ARM seems to know BUT in another context, it allows certain 32-bit numbers to be placed into a register in 32-bits rather than 48 bits (LDR Rd SP,# + 32-bit value in literal pool).