I have optimized (with help) my 32-bix x 32-bit MULS signed multiply ((64 bit result) to take 17 cycles. Likewise my CLZ takes 13 cycles. My ABS takes 3 cycles) so these people are givted but quiet).
I just wondered how more more professional tricks are known to ARM but not officially ‘out there’ because these guys were amazing!.
A method to efficient (<20 cycles) MAC (top 64 bits of result as written to a register) is my goal and of course I am happy to share that which has been OKed to share.