PSTR() flash string de-duplication -- a problem, a solution, and a question

My current Arduino project uses a lot of short constant strings.

I'm using the F() macro to place them in flash memory, and since C++ is generally smart enough to put only a single copy of each distinct string literal into the resulting executable, I even split them into smaller tokens to conserve flash memory by providing more opportunity to merge identical (shorter) strings.

And then I looked into the compiled ELF file and found that no flash string merging happened. At all. Duh.

Of course, there is at least one obvious workaround for that problem (which is to create a string table where each distinct string occurs only once), but I found that unsatisfactory, so I set out to find a solution that allows me to keep my F("foo") string literals in place and still get only one copy of each distinct string literal in flash memory.

...and I'm happy to say that I found one. :slight_smile: Here's a write-up of my journey (for your amusement) that ends with a drop-in replacement for the PSTR() macro:

http://michael-buschbeck.github.io/arduino/2013/10/20/string-merging-pstr/

And now, the question(s).

  • My PSTR() replacement seems to work fine (my code works, including all Arduino library code it uses). I think that the solution I'm proposing is reasonably robust, but I'd really like a second opinion. What do you think -- is this a viable solution? Does it have unwanted side effects that I'm not seeing yet?
  • Assuming that solution is viable, do you think it'd be worthwhile to approach someone (perhaps the avr-libc guys?) with a patch?

Michael,

Well done with this detective work and problem solving! Your write-up is also a nice read -- nice work all around.

After a careful read through, I can't see any problems with what you propose, and I will be interested in trying it out to see how it works in practice. I'm not an expert in this area by means, however -- but another regular forum contributor, MichaelMeissner, does have an extensive compiler background, and in particular has worked on gcc, so he would be an excellent person to ask about how to go about proposing your patch as a candidate for inclusion in the general releases. I've sent him a PM to alert him to this thread, but I wouldn't hesitate to contact him directly -- very nice (as well as knowledgeable) fellow.

Thank you :slight_smile: -- I appreciate your feedback.

One additional thing I forgot to mention is that the modified PSTR() macro also gets rid of those misleading "only initialized variables can be placed into program memory area" compiler warnings that usually accompany each and every F() occurrence. That wasn't what I was aiming for, but it's definitely a bonus.

using GCC 4.8.0 I applied your suggested fix, and I've come across a line it doesnt like to compile

Serial.println(F("%"));

of course this snippet needs other code, but the percent symbol doesn't expand out correctly.

darryl:
using GCC 4.8.0 I applied your suggested fix, and I've come across a line it doesnt like to compile

Serial.println(F("%"));

I can reproduce that: GCC tells me "error: invalid 'asm': invalid %-code". Good catch. I'll look into that.

The good news is that GCC's asm only attempts to interpret "%-codes" if any outputs are declared.

Of course, I do need outputs to get the string address out, but I can split the asm statement into two consecutive ones, the first one defining the string, the second one getting its address.

There's a problem with that, too, though, because now I can't use the "%=" construct to generate a unique label for the string. (And "%=" would resolve to different values in those two asm blocks anyway.) But fortunately, the assembler has a concept of "local labels" that seem to work even in my case.

Long story short, here's an updated version that works even in the F("%") case:

#define PSTR(str) \
  (__extension__({ \
    PGM_P ptr;  \
    asm volatile \
    ( \
      ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t" \
      "0: .string " #str                                 "\n\t" \
      ".popsection"                                      "\n\t" \
    ); \
    asm volatile \
    ( \
      "ldi %A0, lo8(0b)"                                 "\n\t" \
      "ldi %B0, hi8(0b)"                                 "\n\t" \
      : "=d" (ptr) \
    ); \
    ptr; \
  }))

Side note: I tested this with some statements like these:

Serial.print(F("%"));
Serial.print(F("%%"));
Serial.print(F("%%%"));

and it turns out the linker actually cleverly merges those three strings into each other. In the ELF file, only "%%%" remains, and "%" and "%%" end up pointing into that string at the appropriate offset.

yep, that's working for me nicely. ( tested on 4.80 & 4.7.2 and the standard windows 4.3.2 ? )

have some karma. great to see that optimization working as it should of done all along :slight_smile:

Darryl

Many thanks for the karma, and I'm happy it works better now. :slight_smile:

I've made a follow-up post with the improved version:
http://michael-buschbeck.github.io/arduino/2013/10/22/string-merging-pstr-percent-codes

The real solution is to get the compiler and binutils (linker) to do this automatically. However, given the Arduino group remains at a 4 or 5 year old release (4.3.2), which is 5 GCC releases out of date (4.8.1 is the current released version), it is problematical to get it fixed in that area.

As you discovered, using the asm interface is problematical due to the '%' characters.

You might see if the linker used in the Arduino release supports mergeable sections. Here is a fragment from the assembler's .section directive:

7.96 `.section NAME'
====================

Use the `.section' directive to assemble the following code into a
section named NAME.

   This directive is only supported for targets that actually support
arbitrarily named sections; on `a.out' targets, for example, it is not
accepted, even with a standard `a.out' section name.

ELF Version
-----------

   This is one of the ELF section stack manipulation directives.  The
others are `.subsection' (*note SubSection::), `.pushsection' (*note
PushSection::), `.popsection' (*note PopSection::), and `.previous'
(*note Previous::).

   For ELF targets, the `.section' directive is used like this:

     .section NAME [, "FLAGS"[, @TYPE[,FLAG_SPECIFIC_ARGUMENTS]]]

   The optional FLAGS argument is a quoted string which may contain any
combination of the following characters:

`a'  section is allocatable
`w' section is writable
`x'  section is executable
`M' section is mergeable
`S'  section contains zero terminated strings
`G' section is a member of a section group
`T'  section is used for thread-local-storage

 The optional TYPE argument may contain one of the following constants:

`@progbits'  section contains data
`@nobits'  section does not contain data (i.e., section only occupies space)
`@note'  section contains data which is used by things other than the program
`@init_array' section contains an array of pointers to init functions
`@fini_array' section contains an array of pointers to finish functions
`@preinit_array' section contains an array of pointers to pre-init functions

 Many targets only support the first three section types.

   If FLAGS contains the `M' symbol then the TYPE argument must be
specified as well as an extra argument--ENTSIZE--like this:

     .section NAME , "FLAGS"M, @TYPE, ENTSIZE

   Sections with the `M' flag but not `S' flag must contain fixed size
constants, each ENTSIZE octets long. Sections with both `M' and `S'
must contain zero terminated strings where each character is ENTSIZE
bytes long. The linker may remove duplicates within sections with the
same name, same entity size and same flags.  ENTSIZE must be an
absolute expression.  For sections with both `M' and `S', a string
which is a suffix of a larger string is considered a duplicate.  Thus
`"def"' will be merged with `"abcdef"';  A reference to the first
`"def"' will be changed to a reference to `"abcdef"+3'.

   If FLAGS contains the `G' symbol then the TYPE argument must be
present along with an additional field like this:

     .section NAME , "FLAGS"G, @TYPE, GROUPNAME[, LINKAGE]

   The GROUPNAME field specifies the name of the section group to which
this particular section belongs.  The optional linkage field can
contain:
`comdat'
     indicates that only one copy of this section should be retained

`.gnu.linkonce'
     an alias for comdat

   Note: if both the M and G flags are present then the fields for the
Merge flag should come first, like this:

     .section NAME , "FLAGS"MG, @TYPE, ENTSIZE, GROUPNAME[, LINKAGE]

   If no flags are specified, the default flags depend upon the section
name.  If the section name is not recognized, the default will be for
the section to have none of the above flags: it will not be allocated
in memory, nor writable, nor executable.  The section will contain data.

MichaelMeissner:
The real solution is to get the compiler and binutils (linker) to do this automatically.

I tend to agree, and in fact, avr-g++ incorporated a PROGMEM-specific fix regarding the -fmerge-constants compiler option in 4.7.0.

That seemed to be exactly what I wanted, so I tried that first (with GCC 4.7.0 and 4.8.0) -- and with the right combination of compiler switches (-fno-data-sections, as somewhat obliquely implied by that bugzilla comment I linked to above, and of course -fmerge-constants or -fmerge-all-constants), I actually managed to get avr-g++ to merge strings defined through the regular PSTR() macro -- but only for global variables and static locals in non-member functions.

I absolutely couldn't get avr-g++ to merge strings in static locals defined in member functions (even static member functions), which is unfortunate because that's precisely where I need them. And I think that's not a bug but rather an unfortunate consequence of a design choice and essential C++ semantics, so it's basically unfixable.

(I'm out of my depth where compiler and linker internals are concerned, so the following may be completely off the mark. It's just my attempt to make sense of the things I've observed.)

I think the basic underlying problem is the design choice that attribute((progmem)) needs to be attached to a variable -- everything that follows is just an unavoidable chain of consequences:

  • attribute((progmem)) must be attached to a variable, so PSTR() must use a (static local) variable;
  • static local variables in class methods have non-trivial linking semantics across translation units (because the logically same static local might conceivably be defined in several translation units when methods are inlined), so the compiler must emit weak symbols for them;
  • the linker won't merge strings referenced through weak symbols, presumably because it can never know if any weakly-referenced string will actually end up being used.

So I wouldn't even know what to suggest to the avr-g++ makers -- short of scrapping the entire attribute((progmem)) approach and adding some built-in way that comes close to my asm-based workaround, which just puts the string literal directly in the right kind of section. :-/

bad news, found a case where 4.7.2 and 4.8.0 dont work, they are missing some of the code from the newer macro.

bool test = false;

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
}

void loop() {
  // put your main code here, to run repeatedly: 
    test = !test;
    test ? Serial.print(F("true")) : Serial.print(F("false"));

  // you will want a delay, if actually uploading this code onto an AVR board
  delay(100);
}

produces

.Lscope8:
	.section	.text.loop,"ax",@progbits
	.stabs	"loop:F(0,4)",36,0,0,loop
.global	loop
	.type	loop, @function
loop:
	.stabs	"sketch_feb17c.ino",132,0,0,.Ltext6
.Ltext6:
	.stabn	68,0,10,.LM21-.LFBB9
.LM21:
.LFBB9:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
.LBB23:
	.stabn	68,0,12,.LM22-.LFBB9
.LM22:
	lds r24,test
	ldi r25,lo8(1)
	eor r24,r25
	sts test,r24
	.stabn	68,0,13,.LM23-.LFBB9
.LM23:
	tst r24
	breq .L16
.LBB24:
	.stabn	68,0,13,.LM24-.LFBB9
.LM24:
/* #APP */
 ;  13 "sketch_feb17c.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	0: .string "true"
	.popsection
	
 ;  0 "" 2
/* #NOAPP */
	rjmp .L17
.L16:
.LBE24:
.LBB25:
	.stabn	68,0,13,.LM25-.LFBB9
.LM25:
/* #APP */
 ;  13 "sketch_feb17c.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	0: .string "false"
	.popsection
	
 ;  0 "" 2
/* #NOAPP */
.L17:
/* #APP */
 ;  13 "sketch_feb17c.ino" 1
	ldi r22, lo8(0b)
	ldi r23, hi8(0b)
	
 ;  0 "" 2
/* #NOAPP */
.LBE25:
	ldi r24,lo8(Serial)
	ldi r25,hi8(Serial)
	jmp _ZN5Print5printEPK19__FlashStringHelper
.LBE23:

in the case of the true string ( defined at .L24 ) we dont have any lo8 and hi8 access to the address in .L16

the false string at .L25 and its .L17 do produce code that makes it to being assembled.

any thoughts ?

I'm guessing the compiler is trying to be too clever and just above .L16 has the relative jump to .L17 thinking that will load r22 & r23 correctly, but they wont have the right values, thus you only get one string being output :frowning:

okay, have fixed this now.

instead of having two asm lines in the PSTR define. put them into one. and the optimiser takes them verbatim ( as marked by /* APP */ to /*NOAPP */ in one hit, and the pointer loads to the string dont get omitted.

this seems to work for me, but again, keep an eye on the outputs.

#define PSTR(str)                                                  \
  (__extension__( {                                                \
    PGM_P ptr;                                                     \
    __asm__ __volatile__ (                                         \
      ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"    \
      "0: .string " #str                                 "\n\t"    \
      ".popsection"                                      "\n\t"    \
      "ldi %A0, lo8(0b)"                                 "\n\t"    \
      "ldi %B0, hi8(0b)"                                 "\n\t"    \
      : "=d" (ptr) );                                              \
    ptr;                                                           \
  } \
  ))
#endif

which now gives

.LM24:
/* #APP */
 ;  13 "sketch_feb18a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	0: .string "true"
	.popsection
	ldi r22, lo8(0b)
	ldi r23, hi8(0b)
	
 ;  0 "" 2
/* #NOAPP */
	rjmp .L17
.L16:
.LBE24:
.LBB25:
	.stabn	68,0,13,.LM25-.LFBB9
.LM25:
/* #APP */
 ;  13 "sketch_feb18a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	0: .string "false"
	.popsection
	ldi r22, lo8(0b)
	ldi r23, hi8(0b)
	
 ;  0 "" 2
/* #NOAPP */
.L17:

which of course, is back to the original code solution, which fails on the % case :frowning:

Oh my. That's a headscratcher...

Thanks for the reduced test case.

I'm still not totally clear about the actual mechanism that makes this bug come to pass -- the compiler should defer any kind of optimization to long after the local "0:" labels have been resolved to an actual unique generated symbol, which in turn should prevent any kind of conflation of the successive asm blocks that contain the "0b" reference because they'd refer to different actual symbols.

Plus, I'd expect asm volatile to prevent any kind of optimization of that kind anyway. Even if those two consecutive pairs of ldi statements do the same thing in the compiler's opinion, volatile should make it include those statements anyway; but the first asm block's ldi statements are nowhere to be seen. That's weird.

I'll play around with that myself a bit.

Here's a possible workaround: Replace the "0:" label and the "0b" backreferences with "999:" and "999b", respectively. (Any number greater than 9 should work just as well.)

When I did that, the disassembly suddenly started showing two distinct pairs of ldi statements (with different offsets) where the original (using the "0:" label) hadn't. The idea of giving this change a try was inspired by the following statement from the relevant GNU Assembler docs:

It is also worth noting that the first 10 local labels (0:...9:) are implemented in a slightly more efficient manner than the others.

It seems that "slightly more efficient" also implies "slightly buggy" in our particular case here.

Here's the full repro case:

void setup() {}

bool flag = false;
volatile PGM_P result;

void loop()
{
  flag = !flag;

  result =
    flag
      ? (__extension__({
          PGM_P ptr;
          asm volatile
          (
            ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"
            "999: .string " "\"false\""                        "\n\t"
            ".popsection"                                      "\n\t"
          );
          asm volatile
          (
            "ldi %A0, lo8(999b)"                               "\n\t"
            "ldi %B0, hi8(999b)"                               "\n\t"
            : "=d" (ptr)
          );
          ptr;
        }))
      : (__extension__({
          PGM_P ptr;
          asm volatile
          (
            ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"
            "999: .string " "\"true\""                         "\n\t"
            ".popsection"                                      "\n\t"
          );
          asm volatile
          (
            "ldi %A0, lo8(999b)"                               "\n\t"
            "ldi %B0, hi8(999b)"                               "\n\t"
            : "=d" (ptr)
          );
          ptr;
        }));
}

Here's the corresponding disassembly of the loop() method -- note the two distinct ldi pairs at offsets ae and b4 that refer to string offsets 104 and 110, respectively:

0000009e <loop>:
  9e:	80 91 02 01 	lds	r24, 0x0102
  a2:	91 e0       	ldi	r25, 0x01	; 1
  a4:	89 27       	eor	r24, r25
  a6:	80 93 02 01 	sts	0x0102, r24
  aa:	88 23       	and	r24, r24
  ac:	19 f0       	breq	.+6      	; 0xb4 <loop+0x16>
  ae:	88 e6       	ldi	r24, 0x68	; 104
  b0:	90 e0       	ldi	r25, 0x00	; 0
  b2:	02 c0       	rjmp	.+4      	; 0xb8 <loop+0x1a>
  b4:	8e e6       	ldi	r24, 0x6E	; 110
  b6:	90 e0       	ldi	r25, 0x00	; 0
  b8:	90 93 01 01 	sts	0x0101, r25
  bc:	80 93 00 01 	sts	0x0100, r24
  c0:	08 95       	ret

For comparison, the disassembly when "0:" is used as the local label instead of "999:" -- note that there's only a single ldi pair at offset b0, as you had shown already:

0000009e <loop>:
  9e:	80 91 02 01 	lds	r24, 0x0102
  a2:	91 e0       	ldi	r25, 0x01	; 1
  a4:	89 27       	eor	r24, r25
  a6:	80 93 02 01 	sts	0x0102, r24
  aa:	88 23       	and	r24, r24
  ac:	09 f0       	breq	.+2      	; 0xb0 <loop+0x12>
  ae:	00 c0       	rjmp	.+0      	; 0xb0 <loop+0x12>
  b0:	8e e6       	ldi	r24, 0x6E	; 110
  b2:	90 e0       	ldi	r25, 0x00	; 0
  b4:	90 93 01 01 	sts	0x0101, r25
  b8:	80 93 00 01 	sts	0x0100, r24
  bc:	08 95       	ret

Having stared at the broken assembly a little longer, I find this sequence of instructions most peculiar:

  ac:	09 f0       	breq	.+2      	; 0xb0 <loop+0x12>
  ae:	00 c0       	rjmp	.+0      	; 0xb0 <loop+0x12>
  b0:	8e e6       	ldi	r24, 0x6E	; 110
  b2:	90 e0       	ldi	r25, 0x00	; 0

So rjmp jumps to the very next instruction (which is entirely pointless), and the preceding breq conditionally skips that pointless rjmp instruction and branches to the following ldi instruction as well.

This doesn't make sense -- especially if you bear in mind that this is supposed to be assembly code that's been optimized for space. (The two branching instructions could simply be removed without any functional change given this isolated sequence of opcodes.)

Compare the correct assembly:

  ac:	19 f0       	breq	.+6      	; 0xb4 <loop+0x16>
  ae:	88 e6       	ldi	r24, 0x68	; 104
  b0:	90 e0       	ldi	r25, 0x00	; 0
  b2:	02 c0       	rjmp	.+4      	; 0xb8 <loop+0x1a>
  b4:	8e e6       	ldi	r24, 0x6E	; 110
  b6:	90 e0       	ldi	r25, 0x00	; 0

Here, the rjmp skips the second pair of ldi instructions and the breq skips the first pair (and the rjmp). That makes sense; it's a straightforward if-else construct.

At least two things seem to be going wrong independently in the broken case, I think:

  • The "optimized" local label resolution algorithm incorrectly maps both occurrences of the "0b" label reference to the same generated symbol (the one corresponding to the second occurrence of "0:"), leading to two identical pairs of ldi instructions that erroneously both point to the same string offset.
  • An optimization algorithm, presumably one optimizing for space, notices the identical ldi pairs, combines them into one and moves it out of the if-else construct. However, the optimizer subsequently fails to notice and remove the pointless "rjmp .+0" instruction. (If the rjmp had been removed, the breq instruction that skips over it would have been reduced to "breq .+0", which in turn should certainly have been removed as well.)

If I'm guessing right, the first issue is a veritable bug somewhere in the compiler or the assembler, specifically the part where local labels "0:" to "9:" are given special handling. The latter one, on the other hand, is just a missed optimization (which isn't relevant to the non-broken case anyway).

okay, i'm getting wrong code still, so I guess its related to some of my options.

avr-g++ -save-temps -c -gstabs -Os -ffunction-sections -fdata-sections -MMD -fno-exceptions -mmcu=atmega328p -DF_CPU=16000000L -DARDUINO=155 -DARDUINO_AVR_UNO -DARDUINO_ARCH_AVR -Wall -mrelax -fmerge-constants -frename-registers -ffreestanding -finline-functions-called-once -fpack-struct -fshort-enums -funsigned-bitfields

thats the command line ( minus the -I options from my windows based install ( of course its run from a temp folder, and the leading path I've omitted along with the input an output names.

the "-mrelax" looks a likely candicate, so will do some investigation and report back.

thanks again for peering into this issue :slight_smile:

Darryl

okay, using -O1 or -O2 or -O3 produces working code.... so its like it related to -Os :frowning:

so I did some more playing around. using -fverbose-asm on compiling

we get this info.

 ;  GNU C++ (GCC) version 4.8.0 20130306 (experimental) (avr)
 ; 	compiled by GNU C version 3.4.5 (mingw-vista special r2), GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2
 ;  GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
 ;  options passed:  -fpreprocessed sketch_feb19a.ii -mmcu=atmega328p
 ;  -mrelax
 ;  -auxbase-strip D:\tmp\build2743407637356413609.tmp\sketch_feb19a.cpp.o
 ;  -gstabs -Os -Wall -fverbose-asm -fno-optimize-strlen
 ;  -fno-rtti -fno-enforce-eh-specs -fno-exceptions
 ;  options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 ;  -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 ;  -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 ;  -fdefer-pop -fdevirtualize -fdwarf2-cfi-asm -fearly-inlining
 ;  -feliminate-unused-debug-types -fexpensive-optimizations
 ;  -fforward-propagate -ffunction-cse -fgcse -fgcse-lm -fgnu-runtime
 ;  -fguess-branch-probability -fhoist-adjacent-loads -fident
 ;  -fif-conversion -fif-conversion2 -findirect-inlining -finline
 ;  -finline-atomics -finline-functions -finline-functions-called-once
 ;  -finline-small-functions -fipa-cp -fipa-profile -fipa-pure-const
 ;  -fipa-reference -fipa-sra -fira-hoist-pressure -fira-share-save-slots
 ;  -fira-share-spill-slots -fivopts -fkeep-static-consts
 ;  -fleading-underscore -fmath-errno -fmerge-constants
 ;  -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 ;  -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
 ;  -fpeephole -fpeephole2 -fprefetch-loop-arrays -freg-struct-return
 ;  -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 ;  -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 ;  -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 ;  -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 ;  -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros
 ;  -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-aliasing
 ;  -fstrict-overflow -fstrict-volatile-bitfields -fsync-libcalls
 ;  -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 ;  -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 ;  -ftree-copy-prop -ftree-copyrename -ftree-dce -ftree-dominator-opts
 ;  -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert
 ;  -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
 ;  -ftree-parallelize-loops= -ftree-phiprop -ftree-pre -ftree-pta
 ;  -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize
 ;  -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge
 ;  -ftree-ter -ftree-vect-loop-version -ftree-vrp -funit-at-a-time
 ;  -fverbose-asm -fzero-initialized-in-bss -mrelax

for the above -Os i've also stopped its optimizing_strlen. and for the below -O2 option, i had to manually add in the -finline-functions. so that the compiler uses the same options..... or so we hope !

 ;  GNU C++ (GCC) version 4.8.0 20130306 (experimental) (avr)
 ; 	compiled by GNU C version 3.4.5 (mingw-vista special r2), GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2
 ;  GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
 ;  options passed:  -fpreprocessed sketch_feb19a.ii -mmcu=atmega328p
 ;  -mrelax
 ;  -auxbase-strip D:\tmp\build2743407637356413609.tmp\sketch_feb19a.cpp.o
 ;  -gstabs -O2 -Wall -fverbose-asm -fno-optimize-strlen -finline-functions
 ;  -fno-rtti -fno-enforce-eh-specs -fno-exceptions
 ;  options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 ;  -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 ;  -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 ;  -fdefer-pop -fdevirtualize -fdwarf2-cfi-asm -fearly-inlining
 ;  -feliminate-unused-debug-types -fexpensive-optimizations
 ;  -fforward-propagate -ffunction-cse -fgcse -fgcse-lm -fgnu-runtime
 ;  -fguess-branch-probability -fhoist-adjacent-loads -fident
 ;  -fif-conversion -fif-conversion2 -findirect-inlining -finline
 ;  -finline-atomics -finline-functions -finline-functions-called-once
 ;  -finline-small-functions -fipa-cp -fipa-profile -fipa-pure-const
 ;  -fipa-reference -fipa-sra -fira-hoist-pressure -fira-share-save-slots
 ;  -fira-share-spill-slots -fivopts -fkeep-static-consts
 ;  -fleading-underscore -fmath-errno -fmerge-constants
 ;  -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 ;  -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
 ;  -fpeephole -fpeephole2 -fprefetch-loop-arrays -freg-struct-return
 ;  -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 ;  -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 ;  -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 ;  -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 ;  -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros
 ;  -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-aliasing
 ;  -fstrict-overflow -fstrict-volatile-bitfields -fsync-libcalls
 ;  -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 ;  -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 ;  -ftree-copy-prop -ftree-copyrename -ftree-dce -ftree-dominator-opts
 ;  -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert
 ;  -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
 ;  -ftree-parallelize-loops= -ftree-phiprop -ftree-pre -ftree-pta
 ;  -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize
 ;  -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge
 ;  -ftree-ter -ftree-vect-loop-version -ftree-vrp -funit-at-a-time
 ;  -fverbose-asm -fzero-initialized-in-bss -mrelax

but its produces broken code still for the -Os use.

heres the two outputs

.LM1:
.LFBB2:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
.LBB2:
	.stabn	68,0,9,.LM2-.LFBB2
.LM2:
	lds r24,flag	 ;  flag.3, flag
	ldi r25,lo8(1)	 ;  tmp48,
	eor r24,r25	 ;  flag.3, tmp48
	sts flag,r24	 ;  flag, flag.3
	.stabn	68,0,44,.LM3-.LFBB2
.LM3:
	cpse r24,__zero_reg__	 ;  flag.3,
	rjmp .L6	 ; 
.LBB3:
	.stabn	68,0,36,.LM4-.LFBB2
.LM4:
/* #APP */
 ;  36 "sketch_feb19a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	999: .string "false"
	.popsection
 ;  0 "" 2
	.stabn	68,0,42,.LM5-.LFBB2
.LM5:
 ;  42 "sketch_feb19a.ino" 1
	ldi r24, lo8(999b)	 ;  iftmp.4
	ldi r25, hi8(999b)	 ;  iftmp.4
 ;  0 "" 2
/* #NOAPP */
.LBE3:
	.stabn	68,0,44,.LM6-.LFBB2
.LM6:
	sts result+1,r25	 ;  result, iftmp.4
	sts result,r24	 ;  result, iftmp.4
	ret
.L6:
.LBB4:
	.stabn	68,0,20,.LM7-.LFBB2
.LM7:
/* #APP */
 ;  20 "sketch_feb19a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	999: .string "true"
	.popsection
 ;  0 "" 2
	.stabn	68,0,26,.LM8-.LFBB2
.LM8:
 ;  26 "sketch_feb19a.ino" 1
	ldi r24, lo8(999b)	 ;  iftmp.4
	ldi r25, hi8(999b)	 ;  iftmp.4
 ;  0 "" 2
/* #NOAPP */
.LBE4:
	.stabn	68,0,44,.LM9-.LFBB2
.LM9:
	sts result+1,r25	 ;  result, iftmp.4
	sts result,r24	 ;  result, iftmp.4
	ret
.LBE2:
	.size	loop, .-loop

and .....

.LM1:
.LFBB2:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
.LBB2:
	.stabn	68,0,9,.LM2-.LFBB2
.LM2:
	lds r24,flag	 ;  flag.3, flag
	ldi r25,lo8(1)	 ;  tmp48,
	eor r24,r25	 ;  flag.3, tmp48
	sts flag,r24	 ;  flag, flag.3
	.stabn	68,0,44,.LM3-.LFBB2
.LM3:
	tst r24	 ;  flag.3
	breq .L3	 ; ,
.LBB3:
	.stabn	68,0,20,.LM4-.LFBB2
.LM4:
/* #APP */
 ;  20 "sketch_feb19a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	999: .string "true"
	.popsection
 ;  0 "" 2
/* #NOAPP */
	rjmp .L5	 ;                   <<<<<<<<<<<<<<<<<<<<< bad code !
.L3:
.LBE3:
.LBB4:
	.stabn	68,0,36,.LM5-.LFBB2
.LM5:
/* #APP */
 ;  36 "sketch_feb19a.ino" 1
	.pushsection .progmem.data, "SM", @progbits, 1
	999: .string "false"
	.popsection
 ;  0 "" 2
/* #NOAPP */
.L5:
	.stabn	68,0,42,.LM6-.LFBB2
.LM6:
/* #APP */
 ;  42 "sketch_feb19a.ino" 1
	ldi r24, lo8(999b)	 ;  iftmp.4
	ldi r25, hi8(999b)	 ;  iftmp.4
 ;  0 "" 2
/* #NOAPP */
.LBE4:
	.stabn	68,0,44,.LM7-.LFBB2
.LM7:
	sts result+1,r25	 ;  result, iftmp.4
	sts result,r24	 ;  result, iftmp.4
	ret
.LBE2:
	.size	loop, .-loop

note how in one case case it swapped the true and false blocks, and alose does the test of the flag at .LM3 in both case differently.

:frowning:

Thanks for all your efforts Michael and Darryl. This is a useful thing to get working properly. 8) Karma to both of you!

Minor problem:

#define m "This is a macro defined string"

PSTR(m);

doesn't work as expected, because the C++ preprocessor doesn't expand the str in the asm code. The assembler sees

.string m

instead of

.string "This is a macro defined string"

Easy fix:

#define PSTR(M) PSTRA(M) // invoke asm version indirectly, allows for macro expansion of M by preprocessor

#define PSTRA(str) \
  (__extension__({ \
    PGM_P ptr;  \
    asm volatile \
    ( \
      ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t" \
      "999: .string " #str                               "\n\t" \
      ".popsection"                                      "\n\t" \
    ); \
    asm volatile \
    ( \
      "ldi %A0, lo8(999b)"                               "\n\t" \
      "ldi %B0, hi8(999b)"                               "\n\t" \
      : "=d" (ptr) \
    ); \
    ptr; \
  }))

BTW, I'm using the 999: local label fix in the above because as I understand the recent discussion, the 0: local label optimisation was causing problems even without the Os option, and so is still needed..

Is this still the current thinking?

-Mark

yes the 999: label is better, but on my experimental version of later GCC , its still not working.

i've been heading into docs on the optimizing, and although the -O2 and -Os options I tried come back with different code ( even at the low level the individual options are set the same, which should give the same output it doesn't :frowning:

I'm looking into the tree_pass optimizing, and in particular the the option -ftree-tail-merge, which looks for identical code between different branches, and I think this is somehow used differently from -O2 & -O3 and the broken -Os in my tool chain.

Darryl