Trying for a while to optimize an expensive sensor-requesting and processing loop. Now, I am quite close to a perfectly working solution. Exchanging some small inner loops with inline-assembler made the break through.
My main problem now is that all assembler commands with "Immediate" only work with the 16 upper registers which run out pretty fast when using 16-Bit variables.
And I observed that r28:r29 are occupied by the so called "frame pointer" which points to variables on the stack. But as I avoided all of these variables, in the resulting .s-file there is no access to the frame pointer in my whole function. Nonetheless, the compiler denies to use these two registers, so they are wasted.
Is there any flag, pseudo comment, or other trick to persuade the compiler to not allocate the frame pointer registers?
Actually, I guess without exact explanations what's going on it will be difficult. But ok, here is the content of a small inner loop:
void lin_to_inv_quasi_log( unsigned short v )
{
byte e = highByte( v );
if ( e == 0 ) {
e = lowByte( v );
if ( e >= 128 ) { // Most frequently used
e >>= 1;
e += (64-32);
} else if ( e >= 32 ) { // 2nd most
e -= 32;
} else {
e = 0; // Least frequently used
}
} else if ( e == 1 ) {
e = (byte)(v>>2);
e += (128-32); // 3rd most
} else if ( e == 2 ) {
e = (byte)(v>>3);
e += (192-32);
} else {
e = 255;
}
while ( !( UCSR0A & (1<<UDRE0) ) ); UDR0 = e;
}
A really fast lookup-table would have 64kBytes of size. If I would add "if ( v > 759 ) e = 255; else e = lookup_table[v];" this would reduce the size to 759 Bytes which would be ok.
But then the question of which assembler code is faster comes up.
The assembler version of my solution needs for the most frequently used branch only 2 cpi, 2 br.., 1 lsr, 1 subi = 6 CPU cycles
The above small table-lookup solution takes ... uh ... 1 cpi (only high byte), 1 br.., 1 addw v,Z (if the base address is in Z and I don't need v afterwards), and 1 ld r,v
That's 6 CPU cycles.
Benefits: 1) all values are equally fastly estimated, 2) very simple code, 3) any function could be easily implemented
Shortcomings: 1) Much more flash memory occupied, 2) initialization function of the table-lookup field necessary, 3) an additional index register is occupied