I'm making an assembler routine to reverse the bits of a uint32_t. I have code to reverse the bytes and also the bits. What I'd like to do now is combine the two and call the bit reverse after the byte reverse is done. It seems economical to make the bit reverser a subroutine of its own and call it on each of the four registers holding the reversed bytes.
My idea is to pass each register in turn to the subroutine by pushing it onto the stack. The reverser routine does its thing and puts the result back at the stack (frame?) location it was on entry. Caller pops the modified value back to the register it came from.
The instruction set (ATMega328P) provides no direct access to the stack via the stack pointer. Apparently the Y register is used for this - it has auto decrement/increment capability to facilitate its use as a frame pointer (a new concept to me).
Questions I haven't found an answer to, after much reading:
Is the Y register updated automagically somehow and I can just assume it'll always point to the current frame?
If not, how might I load Y with the correct address to point to the passed argument?
Assuming I understand this correctly, do I need an offset from Y to access the passed argument or, since it's only one byte will it be at zero offset?
And in an additional IDE tab, the assembler routine. My conception of what the bit reverser call would look like is behind assembler comments ( ; )
#include "avr/io.h"
.global reverse_32
reverse_32:
; Algorithm copied from AVR __builtin_swap32
eor r22, r25 ; outer byte exchange
eor r25, r22
eor r22, r25
eor r23, r24 ; inner byte exchange
eor r24, r23
eor r23, r24
; push r22 <-- argument for reverse_bits subroutine, frame pointer updated?
; rcall reverse_bits
; pop r22 <-- value returned from subroutine
; and so on for the other registers
ret
; reverse_bits:
; ld r20, Y <------------ gets caller-pushed reg from frame pointer?
; do reversing stuff
; st Y, r20 <------------ puts modified r20 back at frame pointer address?
; ret
I am. In fact that's where I began this project. However, all the '%' signs and order of operands notation befuddles me (easily done). I've downloaded Jim Eli's Arduino Inline Assembly and studied that, and a bunch of the AVR pages he extracted from, among others but, that penny hasn't dropped yet.
Would inline assembly handle the stack access better than what I've attempted?
My idea is to pass each register in turn to the subroutine by pushing it onto the stack.
avr-gcc has a standard method of passing arguments (defined at the aforementioned link.) It won't normally use the stack for passing arguments or return values UNLESS there are "very many" arguments or in the case of VARARGS. (it uses registers for the first N arguments.)
You probably shouldn't deviate from that without good reason.
Is the Y register updated automagically somehow and I can just assume it'll always point to the current frame?
It must be set up by the called function, IFF there are stack arguments or local variables. (I'm not actually sure how stack-based arguments are saved. I think they are just PUSHed, and then the called function sets up Y based on the function signature.)
If not, how might I load Y with the correct address to point to the passed argument?
Push R28 and R29, move SPH to r29, SPL to R28. Perhap adjust for local variables, save interrupt state, turn off interrupts, move Y back to SP, restore interrupt state.
do I need an offset from Y to access the passed argument or, since it's only one byte will it be at zero offset?
Yes. Fortunately, Y is one of the index registers that allows offsets (LDD Rd, T+q) Local variables will end up starting at Y, then there will be 4 (or 5, for mega) bytes of saved SP and PC from the call, and then the arguments that were pushed. The offset (Atmel called it "Displacement") can be up to 64.
Since you want Y as a "fixed" "frame pointer", you would be more likely to use a displacement than the autoincrement or decrement addressing modes.
Yes. The compiler will create / destroy a stack frame for you when it's needed leaving you to focus on the interesting stuff.
Even better. Register allocation is handled by the compiler so, when a parameters is passed in a register, which the first N are, you can focus on the interesting stuff.
Tiny - meant for ATtiny processors (but works fine on ATmega processors)
Debug - meant for debugging; it's a unidirectional protocol
Knock - the protocol starts with a knock so the receiver can be interrupt driven
Bang - the data byte is bit-banged
Typically communications would be over MISO because that's already connected between the target and the programmer and the intended data flow with MISO is from target to programmer.
It's self clocking so it should work with the target running from 125 KHz to 16 MHz.
I got it working but, by a different route. After looking again at all the documents mentioned I discovered I didn't need the stack, or SRAM, at all - just registers working directly on the passed value.
My AVR assembler fu is still very weak but here's the result:
// ver 1.4 2/8/23
// Inline assembler experiment:
// Pass an unsigned long, and return it with
// bits swapped. Uses looping.
uint32_t a = 0x12345678;
void setup() {
Serial.begin(115200);
Serial.println(a, HEX);
uint32_t retnVar = 0;
retnVar = revtest2(a);
Serial.println(retnVar, HEX);
}
void loop() {}
__attribute__((noinline))
uint32_t revtest2(uint32_t inVar) {
/*
Reverse the bits of a 32-bit variable end for end
Bits are reversed and bytes swapped by register rotation
*/
uint8_t cntr;
asm volatile(
"ldi %[cntr],8" "\n\t" // eight iterations per byte
"L%=doagain:" "\n\t"
"ror %A0" "\n\t" // rotate bits in LSB through carry flag into MSB
"rol %D0" "\n\t" // rotate bits in MSB through carry flag into LSB
"dec %[cntr]" "\n\t"
"brne L%=doagain" "\n\t"
"ror %A0" "\n\t" // catch leftover most significant bit
"ldi %[cntr],8" "\n\t" // eight iterations per byte
"L%=doagain2:" "\n\t"
"ror %B0" "\n\t" // rotate bits in LSB through carry flag into MSB
"rol %C0" "\n\t" // rotate bits in MSB through carry flag into LSB
"dec %[cntr]" "\n\t"
"brne L%=doagain2" "\n\t"
"ror %B0" "\n\t" // catch leftover most significant bit
:
:[inVar] "r" (inVar), [cntr] "r" (cntr)
);
return (inVar);
}
@Coding_Badly - thanks for the link. I'm not sure how it happened but looking over that code and comparing it to the other examples made a very small click in my brain. The resulting code is a lot of monkey-see monkey-do but, hey, it's a start.
I didn't realize that but it makes sense. Lots of considerations I was unaware of.
@oldcurmudgeon - Thanks for the link. Another look at 'To find the register where a function argument is passed, initialize the register number Rn with R26...' started me on the path to my answer.
Oh, I forgot something. It's neither here nor there but, the compiler/assembler apparently doesn't like the rol instruction in this bit of code. It seems to prefer an adc . I worked this up on WOKWI and it's easy there to see the compiled code. I've noticed this before in other algorithms converted to assembly.
The effect is the same but since both take one cycle and one instruction I don't get the need for the alteration.
It turns out that ROL Rn and ADC Rn, Rn are the same binary instruction. The assembler should accept either one, but the disassembler will always show ADC.
It is "moderate amusing" how some manufacturers will leave out these "instruction aliases" (make the instruction set appear more minimal), some will explicitly document them ("isn't this clever! A NOP is just mov r8, r8"), and some sort-of hide the equivalences (as Atmel did for AVR.)
That was included in my stuff because the optimizer really wanted to clone every referenced function every time it was referenced. I was targeting ATtiny processors so all that cloning would often exceed available Flash.
The compiler has been upgraded a few times since then. I suspect forcing the compiler's hand may no longer be necessary.
avr-gcc seems to be particularly poor at low-level optimize-by-size decisions. I think perhaps it doesn't understand that on an AVR, adding two 32bit numbers is at least 4 times bigger than adding two 8bit numbers
I haven't seen any signs that it's gotten better.
I can't speak to what this may mean but, I added another call to the reverser and see no difference in code size whether __attribute_noinline__ is commented out or not.