Fast maths for Cortex M0+ CPU for MP3 decoder

Hi,
Writing an MP3 decoder on the Uno M0 is JUST possible but you need to get just a few bits of code optimized to prove that you have the cycles to get the job done. For that reason I'm sharing these:

/***************************************

  • Unsigned 32 x 32 -> 64 bit multiply. *
    ***************************************/
    // Factor0 = r0
    // Factor1 = r1
    // Result = r0:r1

UMUL:
uxth r2,r0 //Factor0 lo [0:15]
lsrs r0,r0,#16 //Factor0 hi [16:31]
lsrs r3,r1,#16 //Factor1 hi [16:31]
uxth r1,r1 //Factor1 lo [0:15]

mov r4,r1 //Copy Factor1 lo [0:15]

muls r1,r2 //Factor1 lo * Factor0 lo
muls r4,r0 //Factor0 lo * Factor0 hi
muls r0,r3 //Factor0 hi * Factor1 hi
muls r3,r2 //Factor1 hi * Factor0 lo

lsls r2,r4,#16 //(Factor1 lo * Factor0 hi)<<16
lsrs r4,r4,#16 //(Factor1 lo * Factor0 hi)>>16
adds r1,r2 //(Factor1 lo * Factor0 lo) + (Factor1 lo * Factor0 hi)<<16
adcs r0,r4 //(Factor0 hi * Factor1 hi) + (Factor0 lo * Factor0 hi)>>16

lsls r2,r3,#16 //(Factor1 hi * Factor0 lo)<<16
lsrs r3,r3,#16 //(Factor1 hi * Factor0 lo)>>16
adds r1,r2 //(Factor1 lo * Factor0 lo) + (Factor0 lo * Factor0 hi)<<16 + (Factor1 hi * Factor0 lo)<<16
adcs r0,r3 //(Factor0 hi * Factor1 hi) + (Factor0 lo * Factor0 hi)>>16 + (Factor1 hi * Factor0 lo)>>16

bx lr

/************************************
*Signed 32 x 32 -> 64 bit multiply. *
************************************/
// Factor0 = r0
// Factor1 = r1
// Result = r0:r1

SMUL:
uxth r2,r0 //Factor0 lo [0:15]
asrs r0,r0,#16 //Factor0 hi [16:31]
asrs r3,r1,#16 //Factor1 hi [16:31]
uxth r1,r1 //Factor1 lo [0:15]

mov r4,r1 //Copy Favtor1 lo [0:15]

muls r1,r2 //Factor1 lo * Factor0 lo
muls r4,r0 //Factor0 lo * Factor0 hi
muls r0,r3 //Factor0 hi * Factor1 hi
muls r3,r2 //Factor1 hi * Factor0 lo

lsls r2,r4,#16 //(Factor1 lo * Factor0 hi)<<16
asrs r4,r4,#16 //(Factor1 lo * Factor0 hi)>>16
adds r1,r2 //(Factor1 lo * Factor0 lo) + (Factor1 lo * Factor0 hi)<<16
adcs r0,r4 //(Factor0 hi * Factor1 hi) + (Factor0 lo * Factor0 hi)>>16

lsls r2,r3,#16 //(Factor1 hi * Factor0 lo)<<16
asrs r3,r3,#16 //(Factor1 hi * Factor0 lo)>>16
adds r1,r2 //(Factor1 lo * Factor0 lo) + (Factor0 lo * Factor0 hi)<<16 + (Factor1 hi * Factor0 lo)<<16
adcs r0,r3 //(Factor0 hi * Factor1 hi) + (Factor0 lo * Factor0 hi)>>16 + (Factor1 hi * Factor0 lo)>>16

bx lr

/********************************

  • Unsigned square 32² -> 64 bit *
    ********************************/
    //Input = r0
    //Result = r0:r1

USQR:
lsrs r1,r0,#16 //input [16:31]>>16
uxth r0,r0 //input [0:15]

mov r2,r1 //copy [16:31]>>16

muls r2,r0 //input [0:15] * input[16:31]
muls r1,r1 //input [16:31]>>16²
muls r0,r0 //input [0:15]²

lsrs r3,r2,#15 //(input [0:15] * input[16:31])>>15
lsls r2,r2,#17 //(input [0:15] * input[16:31])<<17
adds r0,r2 //input [0:15]² + (input [0:15] * input[16:31])<<17
adcs r1,r3 //(input [0:15] * input[16:31])<<17)

bx lr

/******************************

  • Signed square 32² -> 64 bit *
    ******************************/

SSQR
asrs r1,r0,#16 //input [16:31]>>16
uxth r0,r0 //input [0:15]

mov r2,r1 //copy [16:31]>>16

muls r2,r0 //input [0:15] * input[16:31]
muls r1,r1 //input [16:31]>>16²
muls r0,r0 //input [0:15]²

lsrs r3,r2,#15 //(input [0:15] * input[16:31])>>15
lsls r2,r2,#17 //(input [0:15] * input[16:31])<<17
adds r0,r2 //input [0:15]² + (input [0:15] * input[16:31])<<17
adcs r1,r3 //(input [0:15] * input[16:31])<<17)

bx lr

/*******************************

  • Unsigned 32-bit square-root. *
    *******************************/

USQRT:
movs r1,#1 //set initial adder to $80000000
lsls r1,#30 //

movs r2,#0 //Set initial result to 0.
.2b:
adds r3,r2,r1
lsrs r2,#1
cmp r0,r3
bcc .1f
subs r0,r3
adds r2,r1
.1f:
lsrs r1,#2
bne .2b
movs r0,r2

bx lr

/*************************************

  • Absolute value of a 32-bit number. *
    *************************************/

FASTABS:
asrs r1, r0, #31 //[1]extend sign-bit down to bits 0-30
eors r0, r1 //[1]negate register if negative
subs r0, r1 //[1]value += 1 if negated to return absolute

bx lr

/**********************

  • Count Leading zeros *
    **********************/
    //r0 = register to count lead zeros

CLZ:
mov r1,#32 //[1]set all bits full (32 leading zeros)

lsrs r2,r0,#16 //[1] test high 16 bits
beq .f1 //[1/2] if zero, jump forward
subs r1,r1,#16 //[1] otherwise decrement our counter (result)
mov r0,r2 //[1] and keep new value */
.1f
lsrs r2,r0,#8 //[1] test bits 31..23 or 15..8
beq .2f //[1/2] if zero, jump forward
subs r1,r1,#8 //[1] otherwise decrement our counter (result)
mov r0,r2 //[1] and keep new value
.2f
ld r2,=.table //[2/0] point to our look-up table (if table is at $00000004)
ldrb r0,[r2,r0] //[2] convert byte to a count between 0 and 8 (ldrb R0[PC,#4] means base inclusinve)
adds r0,r0,r1 //[1] add the number of bits we've already counted (8, 16 or 24)

bx .lr

//** Macro to define the multiplies

.macro rep_byte byte,count
.rept \count

.byte \byte

.endr
.endm

.macro .1f

//Natural ->LogP table generation

.1:
rep_byte 8,1 /* table entry 0 /
rep_byte 7,1 /
table entry 1 /
rep_byte 6,2 /
table entry 2 /
rep_byte 5,4 /
table entry 3 /
rep_byte 4,8 /
table entry 4 /
rep_byte 3,16 /
table entry 5 /
rep_byte 2,32 /
table entry 6 /
rep_byte 1,64 /
table entry 7 /
rep_byte 0,128 /
table entry 8 */

.endm

I guess. Please use code tags when posting code.
Aren't you missing some colons after labels (like SSQR)? (is this ARM assembler, or gas?)
It would be good to make the routines compatible with the C EABI calling convention (I think that only means that you need to save r4 in a couple of places. And a bunch of directives to get proper "thumb" symbol defintions.)

".1f" is defined multiple times. Is that what you meant to write? I thought "local labels" (in gas) were like:

USQRT:
      movs   r1,#1         //set initial adder to $80000000
      lsls   r1,#30         //
      movs   r2,#0         //Set initial result to 0.
2:
      adds   r3,r2,r1
      lsrs   r2,#1
      cmp      r0,r3
      bcc      1f
      subs   r0,r3
      adds   r2,r1
1:
      lsrs   r1,#2
      bne      .2b
      movs   r0,r2

You might like: GitHub - WestfW/structured_gas: "Structured programming" macros for Gnu Assembler (gas) - I think the ARM version works...

Sorry for being unclear. I'm using Atmel Studio & the GCC tool chain... and your setup for an assembly language project, westfw.

I've read quite a number of papers in which each facet of the MP3 decoder is analysed to develop DSP solutions. Without a doubt it's the IDCT that is the important metric. Be it fixed or floating-point, the IDCT takes almost 50% of the processors time. This is useful because unless one can get that running sufficiently fast, there is no point in carrying on. Conveniently, the code in a a single .cpp file (dct32.cpp) of the Helix implementation. Thus far, my 4th attempt has converted 200 lines of C++ into >1800 lines of assembly language. I should explain that the the first pass is unrolled, the second pass juggled (you can JUST manage when you can use LR) and the third pass reordered. It doesn't need to use memory as temporary storage which significantly the compiler does - a LOT.

My days of SH2 coding have proved useful but I'm glad I didn't end up needing to put literal pools in the middle of the code & brach around it. It just looks ugly. The few loops involved require me to conditionally branch around 32-bit long branches. I've never programmed Thumb 'en colère' before now but I'm interested to note that the 32-bit BL instruction is ±4MB so one of the ideas I've had is to make the entire code/data relocatable. I haven't worried about the ROM & RAM footprints too much until now but I'm now considering setting myself an arbitrarily low figure of 24/8Kbytes just to see how few resources can be consumed by a software MP3 decoder.

What I have done is bought an Aurduino Due and once I've got that set up I propose to convert the compiled C++ Thumb-2 code into Thumb-1 assembly language routine by routine. Of course it's a wrench to lose the 64-bit multiply & count leading-zero functionality but I note that even in the most dense routines, 2/3 of the time is still spend performing logic operations rather than multiply & multiply-and-accumulate and so as long as the variables always fit into r0-r15 (with the 32 samples being worked on being placed on the stack) and no temporary storage, it's hand-optimization to shave off cycles.

There are a few aspects of the C++ I may need to ask questions about but I notice that simple tricks like triple-buffering can be used to reliably provide the DACs with data even though the whole thing depends on the risky just-in-time strategies. IF there proves to be occasional frames in which there are simply not enough cycles, I have been considering ways to avoid stuttering. Given that up to 32 frequency sub-bands are being recombined, it may be possible to quickly work out which are the most important to providing the audible output and gracefully losing fidelity.

From the work I've done so far, I'm confident that the full range of Audible codecs are possible. I spent a lot of time investigating the various Audible formats from various ACELP bit-rates used in the early days to the enhanced .aax (64 kbit/second MP3) used today. My main reason for exploring this is that Cortex M0+ based FLASH memory controllers are beginning to turn up (for about 25¢ each!). While in the developed world a computer or smart phone capable of playing Audible isn't a significant price barrier, in nations like India & China, it excludes the majority of people. If the PoP processor that's already built into the FLASH is capable of reading the data, outputting an MP3 stream and has JUST a but of CPU time spare; that's enough for a basic front end. I am led to believe (from personal communications with people like Bunnie (BUnnie Designs) & Michael Yiu (ARM)) that these controllers have at least 1 SIO that is used to test & set up the FLASH so the hardware to support a device could potentially be very low cost.

In an ideal world, such a device would also record & decode DAB & DAB+ (both are MP2 derived) because both of those technologies would integrate at a tiny cost and because theoretically they can send data at a much lower cost even than mobile phones. I'm thinking of kids 0-8 YO, the early stages of language development.

Interesting...

What I have done is bought an Aurduino Due and once I've got that set up I propose to convert the compiled C++ Thumb-2 code into Thumb-1 assembly language routine by routine.

Isn't CM0 a proper subset of CM3? You should just be able to compile your working C++ code with a lying "-mcpu=" and get a first approximation useful for measurement (although, it's not clear that the compiler will do as good a job generating M0 code as a careful hand-translation, I guess. I've been pretty unimpressed with M0 code generation.)

as long as the variables always fit into r0-r15

M0 only has "quick" access to some of those registers :frowning:
You have a race against time, before M3/M4 chips and/or RISC-V become "as meaninglessly cheap" as the M0... (perhaps. Are you sure that flash size isn't a limiting factor? Some of the cheap M0 chips have ... remarkably little flash, given the libraries commonly in use.)