HACK M0/M0/M1 - multiply and accumulate bits 32-63

I have optimized (with help) my 32-bix x 32-bit MULS signed multiply ((64 bit result) to take 17 cycles. Likewise my CLZ takes 13 cycles. My ABS takes 3 cycles) so these people are givted but quiet).

I just wondered how more more professional tricks are known to ARM but not officially ‘out there’ because these guys were amazing!.

A method to efficient (<20 cycles) MAC (top 64 bits of result as written to a register) is my goal and of course I am happy to share that which has been OKed to share.

Well - CLZ down to 10-12 (as long as lookup is in first page of ROM).

I only need bits 32-63 of my multiply and I wonder if anyone knows a short-cut (full 32-bit x 32-bit -->64-bit takes 17 cycles). Karatsuba multiplication looks good until you realize most instructions only have 2 fields so you cannot MULS R0,R1,R2 (R1 x R2 result in R0).

If anyonw knows Karatbusa, speak up!