Assembler: passing subroutine arguments on the stack

dougp · February 5, 2023, 5:01pm

I'm making an assembler routine to reverse the bits of a uint32_t. I have code to reverse the bytes and also the bits. What I'd like to do now is combine the two and call the bit reverse after the byte reverse is done. It seems economical to make the bit reverser a subroutine of its own and call it on each of the four registers holding the reversed bytes.

My idea is to pass each register in turn to the subroutine by pushing it onto the stack. The reverser routine does its thing and puts the result back at the stack (frame?) location it was on entry. Caller pops the modified value back to the register it came from.

The instruction set (ATMega328P) provides no direct access to the stack via the stack pointer. Apparently the Y register is used for this - it has auto decrement/increment capability to facilitate its use as a frame pointer (a new concept to me).

Questions I haven't found an answer to, after much reading:

Is the Y register updated automagically somehow and I can just assume it'll always point to the current frame?
If not, how might I load Y with the correct address to point to the passed argument?
Assuming I understand this correctly, do I need an offset from Y to access the passed argument or, since it's only one byte will it be at zero offset?

Example code:

extern "C" {
  uint32_t reverse_32(uint32_t);
}

void setup() {
  Serial.begin(115200);
  delay(250);
 
uint32_t arg1 = 0xDEADBEEF;
 Serial.println();
 Serial.print(arg1, HEX);
 Serial.print("  bytes reversed =  ");
   arg1 = reverse_32(arg1); // call assembler routine to reverse endianness
  Serial.println(arg1, HEX);
}

void loop() {
}

And in an additional IDE tab, the assembler routine. My conception of what the bit reverser call would look like is behind assembler comments ( ; )

#include "avr/io.h"

.global reverse_32

reverse_32:

; Algorithm copied from AVR __builtin_swap32

eor r22, r25 ; outer byte exchange
eor r25, r22
eor r22, r25
eor r23, r24 ; inner byte exchange
eor r24, r23
eor r23, r24

; push r22  <-- argument for reverse_bits subroutine, frame pointer updated?
; rcall reverse_bits
; pop r22   <-- value returned from subroutine
; and so on for the other registers
ret

; reverse_bits:
; ld r20, Y  <------------ gets caller-pushed reg from frame pointer?
; do reversing stuff
; st Y, r20  <------------ puts modified r20 back at frame pointer address?
; ret

Thanks for looking.

oldcurmudgeon · February 5, 2023, 6:32pm

You may want to read
https://gcc.gnu.org/wiki/avr-gcc

Coding_Badly · February 5, 2023, 6:33pm

Are you open to using inline assembly?

dougp · February 5, 2023, 7:03pm

I am. In fact that's where I began this project. However, all the '%' signs and order of operands notation befuddles me (easily done). I've downloaded Jim Eli's Arduino Inline Assembly and studied that, and a bunch of the AVR pages he extracted from, among others but, that penny hasn't dropped yet.

Would inline assembly handle the stack access better than what I've attempted?

dougp · February 5, 2023, 7:04pm

That's already among my bookmarks.

westfw · February 6, 2023, 2:27am

My idea is to pass each register in turn to the subroutine by pushing it onto the stack.

avr-gcc has a standard method of passing arguments (defined at the aforementioned link.) It won't normally use the stack for passing arguments or return values UNLESS there are "very many" arguments or in the case of VARARGS. (it uses registers for the first N arguments.)

You probably shouldn't deviate from that without good reason.

Is the Y register updated automagically somehow and I can just assume it'll always point to the current frame?

It must be set up by the called function, IFF there are stack arguments or local variables. (I'm not actually sure how stack-based arguments are saved. I think they are just PUSHed, and then the called function sets up Y based on the function signature.)

If not, how might I load Y with the correct address to point to the passed argument?

Push R28 and R29, move SPH to r29, SPL to R28. Perhap adjust for local variables, save interrupt state, turn off interrupts, move Y back to SP, restore interrupt state.

do I need an offset from Y to access the passed argument or, since it's only one byte will it be at zero offset?

Yes. Fortunately, Y is one of the index registers that allows offsets (LDD Rd, T+q) Local variables will end up starting at Y, then there will be 4 (or 5, for mega) bytes of saved SP and PC from the call, and then the arguments that were pushed. The offset (Atmel called it "Displacement") can be up to 64.
Since you want Y as a "fixed" "frame pointer", you would be more likely to use a displacement than the autoincrement or decrement addressing modes.

Coding_Badly · February 6, 2023, 2:38am

Yes. The compiler will create / destroy a stack frame for you when it's needed leaving you to focus on the interesting stuff.

Even better. Register allocation is handled by the compiler so, when a parameters is passed in a register, which the first N are, you can focus on the interesting stuff.

Give TinyDebugKnockBang a gander. It's overwhelming only because of the size. I suspect the core methodTinyDebugKnockBangClass::sendByte is going to be of the most interest.

The code uses some assembly macros that are defined simply by including a global function; which is super handy.

If you have questions just let me know.

dougp · February 6, 2023, 6:00am

Looks like it's back to the books. Thanks!

That's a doozy of a name.

Coding_Badly · February 6, 2023, 6:29am

It's fairly literal...

Tiny - meant for ATtiny processors (but works fine on ATmega processors)
Debug - meant for debugging; it's a unidirectional protocol
Knock - the protocol starts with a knock so the receiver can be interrupt driven
Bang - the data byte is bit-banged

Typically communications would be over MISO because that's already connected between the target and the programmer and the intended data flow with MISO is from target to programmer.

It's self clocking so it should work with the target running from 125 KHz to 16 MHz.

dougp · February 10, 2023, 4:44pm

I got it working but, by a different route. After looking again at all the documents mentioned I discovered I didn't need the stack, or SRAM, at all - just registers working directly on the passed value.

My AVR assembler fu is still very weak but here's the result:

// ver 1.4 2/8/23

// Inline assembler experiment:
// Pass an unsigned long, and return it with
// bits swapped. Uses looping.

uint32_t a = 0x12345678;

void setup() {
  Serial.begin(115200);
  Serial.println(a, HEX);
  uint32_t retnVar = 0;
  retnVar = revtest2(a);
  Serial.println(retnVar, HEX);   
}

void loop() {}


__attribute__((noinline))

uint32_t revtest2(uint32_t inVar) {
  /*
    Reverse the bits of a 32-bit variable end for end
    Bits are reversed and bytes swapped by register rotation
  */
  uint8_t cntr;


  asm volatile(

    "ldi %[cntr],8"      "\n\t" //  eight iterations per byte

    "L%=doagain:"        "\n\t"
    "ror %A0"            "\n\t"  // rotate bits in LSB through carry flag into MSB
    "rol %D0"            "\n\t"  // rotate bits in MSB through carry flag into LSB
    "dec %[cntr]"        "\n\t"
    "brne L%=doagain"    "\n\t"
    "ror %A0"            "\n\t" // catch leftover most significant bit

    "ldi %[cntr],8"      "\n\t" //  eight iterations per byte

    "L%=doagain2:"       "\n\t"
    "ror %B0"            "\n\t"    // rotate bits in LSB through carry flag into MSB
    "rol %C0"            "\n\t"    // rotate bits in MSB through carry flag into LSB
    "dec %[cntr]"        "\n\t"
    "brne L%=doagain2"   "\n\t"
    "ror %B0"            "\n\t" // catch leftover most significant bit

    :
    :[inVar] "r" (inVar), [cntr] "r" (cntr)
  );

  return (inVar);
}

@Coding_Badly - thanks for the link. I'm not sure how it happened but looking over that code and comparing it to the other examples made a very small click in my brain. The resulting code is a lot of monkey-see monkey-do but, hey, it's a start.

I didn't realize that but it makes sense. Lots of considerations I was unaware of.

@oldcurmudgeon - Thanks for the link. Another look at 'To find the register where a function argument is passed, initialize the register number Rn with R26...' started me on the path to my answer.

Thanks, all, for the help.

dougp · February 10, 2023, 6:40pm

Oh, I forgot something. It's neither here nor there but, the compiler/assembler apparently doesn't like the rol instruction in this bit of code. It seems to prefer an adc . I worked this up on WOKWI and it's easy there to see the compiled code. I've noticed this before in other algorithms converted to assembly.

The effect is the same but since both take one cycle and one instruction I don't get the need for the alteration.

000002d8 <_Z8revtest2m>:
    "brne L%=doagain2"   "\n\t"
    "ror %B0"            "\n\t" // catch leftover most significant bit

    :
    :[inVar] "r" (inVar), [cntr] "r" (cntr)
  );
 2d8:	20 e0       	ldi	r18, 0x00	; 0
 2da:	28 e0       	ldi	r18, 0x08	; 8

000002dc <L549doagain>:
 2dc:	67 95       	ror	r22
 2de:	99 1f       	adc	r25, r25
 2e0:	2a 95       	dec	r18
 2e2:	e1 f7       	brne	.-8      	; 0x2dc <L549doagain>
 2e4:	67 95       	ror	r22
 2e6:	28 e0       	ldi	r18, 0x08	; 8

000002e8 <L549doagain2>:
 2e8:	77 95       	ror	r23
 2ea:	88 1f       	adc	r24, r24
 2ec:	2a 95       	dec	r18
 2ee:	e1 f7       	brne	.-8      	; 0x2e8 <L549doagain2>
 2f0:	77 95       	ror	r23

  return (inVar);
 2f2:	08 95       	ret

westfw · February 11, 2023, 12:47am

It turns out that ROL Rn and ADC Rn, Rn are the same binary instruction. The assembler should accept either one, but the disassembler will always show ADC.

It is "moderate amusing" how some manufacturers will leave out these "instruction aliases" (make the instruction set appear more minimal), some will explicitly document them ("isn't this clever! A NOP is just mov r8, r8"), and some sort-of hide the equivalences (as Atmel did for AVR.)

dougp · February 11, 2023, 1:00am

Well I'll be darned!

Coding_Badly · February 11, 2023, 7:28am

Excellent!

Yup. Been there.

That was included in my stuff because the optimizer really wanted to clone every referenced function every time it was referenced. I was targeting ATtiny processors so all that cloning would often exceed available Flash.

The compiler has been upgraded a few times since then. I suspect forcing the compiler's hand may no longer be necessary.

westfw · February 11, 2023, 9:49am

avr-gcc seems to be particularly poor at low-level optimize-by-size decisions. I think perhaps it doesn't understand that on an AVR, adding two 32bit numbers is at least 4 times bigger than adding two 8bit numbers
I haven't seen any signs that it's gotten better.

gcjr · February 11, 2023, 10:15am

have you considered this approach?

unsigned
swap (
    unsigned x )
{
    unsigned N = 8 * sizeof(unsigned);
    unsigned y    = 0;
    unsigned msk0 = 1;
    unsigned mskN = 1 << (N-1);

    for (unsigned n = 0; n < N; n++, msk0 <<= 1, mskN >>= 1)
        y |= x & msk0 ? mskN : 0;

    return y;
}

dougp · February 11, 2023, 10:44pm

I can't speak to what this may mean but, I added another call to the reverser and see no difference in code size whether __attribute_noinline__ is commented out or not.

dougp · February 11, 2023, 11:39pm

@gcjr I believe I understand what's supposed to happen - I just can't get the code to cooperate. I'm probably missing something in translation.

// https://forum.arduino.cc/t/assembler-passing-subroutine-arguments-on-the-stack/1086531/16


uint32_t bigVar = 0xfffffe7f;

void setup() {
  Serial.begin(115200);
  Serial.println(bigVar, HEX);
  uint32_t retnVar = bigVar;
  retnVar = swap(retnVar);
  Serial.println(retnVar, HEX);
  retnVar = swap(0xac051320);
}

void loop() {}

unsigned long swap(unsigned long x) {
  unsigned N = 8 * sizeof(unsigned long);

  unsigned long y = 0;
  unsigned long msk0 = 1;
  unsigned long mskN = 1 << (N - 1);
  // bit swap algorithm
  for (unsigned n = 0; n < N; n++, msk0 <<= 1, mskN >>= 1) {
    y |= x & msk0 ? mskN : 0;
    // Serial.print(x & msk0 ? mskN : 0);
  }
  return y;
}

/* gc algo
unsigned
swap (
    unsigned x )
{
    unsigned N = 8 * sizeof(unsigned);
    unsigned y    = 0;
    unsigned msk0 = 1;
    unsigned mskN = 1 << (N-1);

    for (unsigned n = 0; n < N; n++, msk0 <<= 1, mskN >>= 1)
        y |= x & msk0 ? mskN : 0;
   return y;
}*/

gcjr · February 12, 2023, 12:59am

mskN = 1L << (N-1);

// https://forum.arduino.cc/t/assembler-passing-subroutine-arguments-on-the-stack/1086531/16

uint32_t bigVar = 0xfffffe7f;
uint32_t retnVar;

void setup() {
    Serial.begin(115200);

    retnVar = swap(2);
}


void loop() {}

unsigned long swap(unsigned long x) {
    unsigned N = 8 * sizeof(unsigned long);

    unsigned long y = 0;
    unsigned long mskN = 1L << (N-1);
    unsigned long msk0 = 1;

    Serial.print   (x, HEX);
    Serial.print   ("  ");
    Serial.print   (N);
    Serial.print   ("  ");
    Serial.print   (msk0, HEX);
    Serial.print   ("  ");
    Serial.print   (mskN, HEX);

    for (unsigned n = 0; n < N; n++, msk0 <<= 1, mskN >>= 1) {
        y |= x & msk0 ? mskN : 0;
        // Serial.print(x & msk0 ? mskN : 0);
    }
    Serial.print   ("  ");
    Serial.println (y, HEX);
    return y;
}


/* gc algo
unsigned
swap (
unsigned x )
{
    unsigned N = 8 * sizeof(unsigned);
    unsigned y    = 0;
    unsigned msk0 = 1;
    unsigned mskN = 1 << (N-1);

    for (unsigned n = 0; n < N; n++, msk0 <<= 1, mskN >>= 1)
    y |= x & msk0 ? mskN : 0;
    return y;
}*/

dougp · February 12, 2023, 2:18am

Yes, that fixed it.

That algorithm does translate into a lot of assembler instructions and clock cycles.

Topic		Replies	Views
Inline-Assembler ändert Wert nur mit Serial.print Deutsch	8	1701	May 6, 2021
Rotate a variable using inline assembly Programming	22	3100	May 5, 2021
Swap two integers without using a temporary variable Bar Sport	57	540	January 28, 2025
quickly "reversing" a byte? Programming	15	22688	May 5, 2021
Compiled inline assembler questions Programming	14	146	July 12, 2025

Assembler: passing subroutine arguments on the stack

Related topics