how to prevent omission of "unused" functions

If I code

void foo() __attribute__((used)) __attribute__((naked)) __attribute__((noinline));
void foo() {
      asm volatile ("nop");
}

int main(void) {
       void (*my_function_pointer)() = &foo; 
}

and disassemble the result I see, that foo is not in the result. In order to make it appear I have to put a dummy call to foo somewhere. However I thought that that’s what the “used” attribute is for. Has anyone any hints how to tell the compiler to not optimize foo away?

I have a hard time seeing the need for this and suspect your interest is purely academic - or ... ?

If you enable sufficient compiler warnings, the compiler will issue a warning for unused/unreferenced functions. The attribute "used" will however allow you to suppress such warnings on a per function basis.

The compiler will generate code for foo, but the linker will eventually remove it. So if you use avr-objdum to look at the assembly code, the source reference is there, but the code for foo is gone.

Hi Ben,

it is academic in so far as I can force the compiler to generate the desired functions by placing dummy calls. However I aim at creating funtions like

void token1() __attribute__((used)) __attribute__((naked)) __attribute__((noinline));
void token1() {
      asm volatile ("some lines of inlined asm code");
}

Then I would place the function pointers into a list, load the start adress into the Z register and then just jump to the pointed token, increment the register and jump to the next and so on.

You may wonder why I want to do something like this. Well, I have some timing critical code. Either I switch to a much faster controller --> requires to dump >2000 lines of tested C code and to learn a new platform or I push the current platform. Right now I use inline assembler but I need some more flexibility. So the idea is to create some dedicated interpreter that way. I figured that using this approach the interpreter overhead per token will be only 3 cycles (two for the jump and one for the add). That is the "interpreter logic" is just 4 bytes that I can inline at the end of each "token".

Of course I know that this is not exactly the Arduino way, but it will give me an ultra fast state machine plus the flexibility of an interpreter. Notice that my token sequence will be in RAM, so I can setup the actual code / token sequence at runtime (I could even upload it through the serial interface) without the need to flash. The only thing that keeps bugging me is that I have to put those dummy calls into my code.

So it's not purely academic. It's about avoiding ugly workarounds :wink:

P.S. For the ultra critical parts I can even increment at the start of the token code thus temporarily decreasing the inter token latency for the jump to the next token to two cycles. Not to bad for code that is "interpreted"

I sugget you look at the linker command line options. Specifically you need to avoid the -gc-sections (garbage call) option.

Then I would place the function pointers into a list, load the start adress into the Z register and then just jump to the pointed token, increment the register and jump to the next and so on.

When you “put the function pointers in a list”, that should make the “referenced” as far as the linker is concerned. In your example, the assignment of the function pointer in main is getting optimized away because the local variable is never referenced either. Just actually write the whole program, and the right things should happen! Using pointers to a function should also prevent it from being inline’d out of existence.

Really, really, try to avoid using obscure compiler-dependent atributes and similar. I’m not sure about “naked”, but the following looks like it demonstrates what you want to do, and it does include foo() in the elf file, without needing to resort to incomprehensible incantations

void foo() {
  asm volatile ("nop");
}

void setup() {
}

static void (*my_function_pointers[5])();

void bar(byte index)
{
   (*my_function_pointers[index])();
}

void loop(void) {

  my_function_pointers[0] = foo;
  bar(0);
}

Better change my example to have  bar(digitalRead(1));or the indirectness will get optimized away. Grumble grumble too smart compilers…

the interpreter overhead per token will be only 3 cycles (two for the jump and one for the add)

Sounds like FORTH. Except I think you underestimate the cost of this sort of indirect jump on AVR. You had in mind doing something like
(*functable[token])()
To do the indirect call requires that the function pointer be in the Z register, which will likely also be used for the indexing. So this assembles to:

   (*my_function_pointers[index])();
 116:   e8 2f           mov     r30, r24
 118:   f0 e0           ldi     r31, 0x00     ; extend byte to int (probably removable)
 11a:   ee 0f           add     r30, r30    ; sizeof func ptr = 2, so double value
 11c:   ff 1f           adc     r31, r31
 11e:   e0 50           subi    r30, 0x00    ; Skip register address space by adding 256.
 120:   ff 4f           sbci    r31, 0xFF       ; (not very inteligently!)
 122:   01 90           ld      r0, Z+          ; z <-- (z) fetch actual function addr.
 124:   f0 81           ld      r31, Z
 126:   e0 2d           mov     r30, r0
 128:   09 95           icall          ;indirect call.

@Benf: this is helpful advice. But it looks as if this switches off the optimization for all includes. If this is really the case I will stick to my kludge of dummy calls.

@Westf: I do not think that I underestimate the cost. I will not let the compiler generate this kind of code. I will hand code it in assembler. That is I let the compiler generate the address list but handling the Z register, jumping and doing the time critical stuff will be in assembler. Although I am a beginner with C I have ~5 years assembler experience from the past. Looks as if the C linker is much harder for me than some piece of assembler :slight_smile:

BenF's advice will get you all the functions in the .o modules in your end binary, but it won't stop the compiler-level optimization. (of course, lately the arduino environment sorta counts on having the linker optimization turned on; having somewhat given up on partitioning the source into tiny modules.) What it does at compile time is tell the compiler to put each function in its own "section", and then at link time omit any sections that aren't referenced from some other section...

I think I got it. This was helpful.

With regard to my superfast interpreter. I think I misunderstood the datasheet. It will take 6 cycles to load the Z register (indirect using Y with postincrement) and the ijmp. Still fast enough for my intendend use thoguh.

Now I actually tried my idea and it works but only to some extend:

If I code

pointer = (int) &token;

and then in the assembler part

            "lds r30, pointer"      "\n\t"
            "lds r31, pointer+1""\n\t"
            "ijmp"                        "\n\t"

It will call the naked function “token” and execute it as expected.

However if I code

pointer = (int) &token;
pointer = (int) &pointer;

and then in the assembler part

            "lds r28, pointer"      "\n\t"
            "lds r29, pointer+1""\n\t"
            "ld r30, Y+"            "\n\t"
            "ld r31, Y+"            "\n\t"
            "ijmp"                        "\n\t"

It fails to call the naked function “token” and instead crashes. However I would expect that the results should be equivalent. To be more precise I would expect that loading Z indirect by using the values of Y would derefence the pointer thus that Z will point to &token.

Has anyone an idea why the second approach fails?

Argh! This belongs to the list of the more stupid mistakes. Slapping myself into the neck. Almost as clever a coding an endless loop. Again I messed the C part...

heh heh.

Have you ever looked at Forth? It’s a language. Awful to read, but the execution environment looks a lot like what you’re proposing, and tiny subsets are often useful as pseudo-interpreters. It’s got good clues as to how to put together a fully functional language (including conditionals and loops) using a thread of (essentially) function addresses. You might end up with:

func_t *mytokens[] = {
     putx,
     puty,
     add2,
     write,
     restart
}

With asm looking something like:

startinterp:
    lds YL, low(mytokens)
    lds YH, high(mytokens)
nextoken:
    ld Y+, ZL
    ld Y+, ZH  ;; note Y now pointing at next token.
    ijmp

putx: ;; blah, blah, don't touch Y
   jmp nexttoken

That makes the nextoken code 6 cycles (they’re all 2-cycle instructions.) You can put the “nexttoken” code at the end of each token’s code; saves two cycles, uses some program space.

Ah, looks almost exactly as what I have implemented 30 minutes ago. Except that I ommited the jump and inlined the next step part (as you already suggested as an option). That is I am trading memory for speed.

I think I should definitely have a look at Forth. Looks like it would be a very good source for getting ideas I might otherwise have to make up on my own. I was never aware how Forth did achieve its speed, I only knew its syntax.

Thanks a lot for this great hint.

Hi westf,

following your advice I found this interesting link:
http://www.annexia.org/_file/jonesforth.s.txt.

So right now I am implementing kind of a subset of FORTH. I do not need the full language and I will stick to C, but really interesting and helpful :slight_smile:

Thanks a lot, Udo