Bad veneer generation by linker

[[ I hope this is the right forum for this post, apologies if not ]]

I'm writing a function which I want to execute from RAM rather than flash. Provided the function is written in C all is well, but if it's written in assembler the linker seems to insert a bad veneer which uses ARM32 instructions (which aren't available on the RP2040) rather than the Thumb instructions used for the C version. As a result the code crashes.

Do I need to do something else to tell the linker that the veneer must use Thumb instructions, or is this a linker bug?

My minimal example code is split into three files:

badveneer.ino

#include <pico/platform.h>

extern "C" void __no_inline_not_in_flash_func(emma)();
extern "C" void __no_inline_not_in_flash_func(jane)();

void setup() {
  emma();
  jane();
}

void loop() {
}

emma.cpp

#include <pico/platform.h>

extern "C" void __no_inline_not_in_flash_func(emma)() {
}

jane.S

.thumb

.section .time_critical.jane

.global jane
.thumb_func

jane:
  bx  lr

The resulting generated code is

10002878 <setup>:
#include <pico/platform.h>

extern "C" void __no_inline_not_in_flash_func(emma)();
extern "C" void __no_inline_not_in_flash_func(jane)();

void setup() {
10002878:	b510      	push	{r4, lr}
  emma();
1000287a:	f000 f809 	bl	10002890 <__emma_veneer>
  jane();
1000287e:	f000 e804 	blx	10002888 <__jane_veneer>
}
10002882:	bd10      	pop	{r4, pc}
10002884:	0000      	movs	r0, r0
	...

10002888 <__jane_veneer>:
10002888:	e51ff004 	ldr	pc, [pc, #-4]	; 1000288c <__jane_veneer+0x4>
1000288c:	200000c0 	.word	0x200000c0

10002890 <__emma_veneer>:
10002890:	b401      	push	{r0}
10002892:	4802      	ldr	r0, [pc, #8]	; (1000289c <__emma_veneer+0xc>)
10002894:	4684      	mov	ip, r0
10002896:	bc01      	pop	{r0}
10002898:	4760      	bx	ip
1000289a:	bf00      	nop
1000289c:	200000c3 	.word	0x200000c3

I do have a usable workround for this problem, which is to code a veneer by hand, but this shouldn't be necessary, I think?

Cheers,
John

I think this post is related.

I don't think that post is related, or if it is, I've missed the point!

It definitely is possible to execute code from RAM on the RP2040. For C code you do it by declaring the function __no_inline_not_in_flash_func (and it makes quite a bit of difference under some circumstances). It's also possible to do it with assembler code, but the linker makes this harder than it ought to be.

I don't know whether it's possible on the ATmega, as discussed in that thread.

I had the same question a while ago, for my project I came to the conclusion that the architecture did not allow this.

I am not familiar with the RP2040, please disregard my comment.

What happens if you add a ".syntax unified" to your .S file?

Perhaps related: Another gcc problem ? - Raspberry Pi Forums

No change: it still crashes, and the automatically generated veneer still uses ldr pc, [pc, #-4] which isn't available on the RP2040.

Thanks for the suggestion, though!

I suggest you move the question to the RPi Pico forums, where there is significantly deeper knowledge of such obscure functionality.

Here are some suggestions for further debugging, even if they're just grasping at straws:

  1. What happens if you use the Philhower core instead of the Arduino core? It's a bit (a lot) closer to "raw sdk." Perhaps the mbed core is missing a linker switch needed to identify the cpu type.
  2. Same question, using the RPi SDK and build process, directly.
  3. I don't understand why a "veneer" is needed in the first place. gcc wil happily generate "mov reg, #target; blx reg" sequences to permit calls to anywhere in the address space, were it not specifically turned off to save space and registers (-mshort-calls, I think,) There's a function attribute to force this: __attribute__((long_call)) - I wonder why the __no_inline... macro doesn't seem to use it?

(man, sometimes the m0 instruction set looks particularly sucky!)

Heh, __attribute__((long_call)) sounded really promising, but the linker's foiled things again by setting the LSB to zero in the address, so the blx tries to switch to ARM32 mode!

My annotations in [[[ ]]]...

Disassembly of section .text.setup:

10002878 <setup>:
#include <pico/platform.h>

extern "C" void __no_inline_not_in_flash_func(emma)();
extern "C" void __attribute__((long_call)) jane();

void setup() {
10002878:	b510      	push	{r4, lr}
  emma();
1000287a:	f000 f805 	bl	10002888 <__emma_veneer>
  jane();
1000287e:	4b01      	ldr	r3, [pc, #4]	; (10002884 <setup+0xc>)
10002880:	4798      	blx	r3
}
10002882:	bd10      	pop	{r4, pc}
10002884:	200000c0 	.word	0x200000c0   [[[ &jane, LSB 0, ARM32 mode ]]]

10002888 <__emma_veneer>:
10002888:	b401      	push	{r0}
1000288a:	4802      	ldr	r0, [pc, #8]	; (10002894 <__emma_veneer+0xc>)
1000288c:	4684      	mov	ip, r0
1000288e:	bc01      	pop	{r0}
10002890:	4760      	bx	ip
10002892:	bf00      	nop
10002894:	200000c3 	.word	0x200000c3   [[[ &emma, LSB 1, Thumb mode ]]]

   :

Disassembly of section .data:

200000c0 <__data_start__>:
200000c0:	4770      	bx	lr

200000c2 <emma>:
#include <pico/platform.h>

extern "C" void __no_inline_not_in_flash_func(emma)() {
}
200000c2:	4770      	bx	lr

I'll try to work out how to move the question to the Pico forum as you suggest.

Well, I got the veneer case to work by adding some additional directives to jane.S
(I used -S to see what the assembly functions produced by emma.cpp looked like, and pretty much copied all of them to jane.S)

.thumb
.section .time_critical.jane
.global jane
.thumb_func
.syntax unified
.align  1
.arch armv6 - m
.code   16

.type   jane, % function


.cfi_startproc

jane:
  bx  lr
  
.cfi_endproc

I have no idea how much of that should be necessary.
EDIT: It looks like those changes cause the long_call version to produce correct code as well:

10000348 <setup>:
10000348:       b510            push    {r4, lr}
1000034a:       f00c ff95       bl      1000d278 <__emma_veneer>
1000034e:       4b01            ldr     r3, [pc, #4]    ; (10000354 <setup+0xc>)
10000350:       4798            blx     r3
10000352:       bd10            pop     {r4, pc}
10000354:       200000c1        andcs   r0, r0, r1, asr #1

I also noticed that the compile command for the .S file did not include any of -march=armv6-m -mcpu=cortex-m0plus -mthumb, but adding them did not seem to change anything.

Bingo!

It seems to be the .type jane, % function that's important. I've added

.type cic84fromBits, % function

to my (real, not simplified) code, changed the declaration in the C caller to

extern "C" void __attribute__((long_call)) cic84fromBits(int16_t* dst, const uint8_t* src, size_t n);

and got rid of the hand-written veneer. All is well, and it'll execute a couple of cycles faster too.

Thank you!

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.