Go Down

Topic: PSTR() flash string de-duplication -- a problem, a solution, and a question (Read 9155 times) previous topic - next topic

Michael Buschbeck

My current Arduino project uses a lot of short constant strings.

I'm using the F() macro to place them in flash memory, and since C++ is generally smart enough to put only a single copy of each distinct string literal into the resulting executable, I even split them into smaller tokens to conserve flash memory by providing more opportunity to merge identical (shorter) strings.

And then I looked into the compiled ELF file and found that no flash string merging happened. At all. Duh.

Of course, there is at least one obvious workaround for that problem (which is to create a string table where each distinct string occurs only once), but I found that unsatisfactory, so I set out to find a solution that allows me to keep my F("foo") string literals in place and still get only one copy of each distinct string literal in flash memory.

...and I'm happy to say that I found one. :-) Here's a write-up of my journey (for your amusement) that ends with a drop-in replacement for the PSTR() macro:

http://michael-buschbeck.github.io/arduino/2013/10/20/string-merging-pstr/

And now, the question(s).

  • My PSTR() replacement seems to work fine (my code works, including all Arduino library code it uses). I think that the solution I'm proposing is reasonably robust, but I'd really like a second opinion. What do you think -- is this a viable solution? Does it have unwanted side effects that I'm not seeing yet?

  • Assuming that solution is viable, do you think it'd be worthwhile to approach someone (perhaps the avr-libc guys?) with a patch?


pico

Michael,

Well done with this detective work and problem solving! Your write-up is also a nice read -- nice work all around.

After a careful read through, I can't see any problems with what you propose, and I will be interested in trying it out to see how it works in practice. I'm not an expert in this area by means, however -- but another regular forum contributor, MichaelMeissner, does have an extensive compiler background, and in particular has worked on gcc, so he would be an excellent person to ask about how to go about proposing your patch as a candidate for inclusion in the general releases.  I've sent him a PM to alert him to this thread, but I wouldn't hesitate to contact him directly -- very nice (as well as knowledgeable) fellow.



WiFi shields/Yun too expensive? Embeddedcoolness.com is now selling the RFXduino nRF24L01+ <-> TCP/IP Linux gateway: Simpler, more affordable, and even more powerful wireless Internet connectivity for *all* your Arduino projects! (nRF24L01+ shield and dev board kits available too.)

Michael Buschbeck

Thank you :-) -- I appreciate your feedback.

One additional thing I forgot to mention is that the modified PSTR() macro also gets rid of those misleading "only initialized variables can be placed into program memory area" compiler warnings that usually accompany each and every F() occurrence. That wasn't what I was aiming for, but it's definitely a bonus.

darryl


using GCC 4.8.0  I applied your suggested fix,  and I've come across a line it doesnt like to compile

Code: [Select]

Serial.println(F("%"));


of course this snippet needs other code, but the percent symbol doesn't expand out correctly.
--
 Darryl

Michael Buschbeck


using GCC 4.8.0  I applied your suggested fix,  and I've come across a line it doesnt like to compile

Code: [Select]

Serial.println(F("%"));


I can reproduce that: GCC tells me "error: invalid 'asm': invalid %-code". Good catch. I'll look into that.

Michael Buschbeck

The good news is that GCC's asm only attempts to interpret "%-codes" if any outputs are declared.

Of course, I do need outputs to get the string address out, but I can split the asm statement into two consecutive ones, the first one defining the string, the second one getting its address.

There's a problem with that, too, though, because now I can't use the "%=" construct to generate a unique label for the string. (And "%=" would resolve to different values in those two asm blocks anyway.) But fortunately, the assembler has a concept of "local labels" that seem to work even in my case.

Long story short, here's an updated version that works even in the F("%") case:

Code: [Select]

#define PSTR(str) \
  (__extension__({ \
    PGM_P ptr;  \
    asm volatile \
    ( \
      ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t" \
      "0: .string " #str                                 "\n\t" \
      ".popsection"                                      "\n\t" \
    ); \
    asm volatile \
    ( \
      "ldi %A0, lo8(0b)"                                 "\n\t" \
      "ldi %B0, hi8(0b)"                                 "\n\t" \
      : "=d" (ptr) \
    ); \
    ptr; \
  }))


Side note: I tested this with some statements like these:

Code: [Select]

Serial.print(F("%"));
Serial.print(F("%%"));
Serial.print(F("%%%"));


and it turns out the linker actually cleverly merges those three strings into each other. In the ELF file, only "%%%" remains, and "%" and "%%" end up pointing into that string at the appropriate offset.

darryl

yep, that's working for me nicely.  ( tested on 4.80 & 4.7.2 and the standard windows 4.3.2 ? )

have some karma. great to see that optimization working as it should of done all along :-)

Darryl
--
 Darryl

Michael Buschbeck

Many thanks for the karma, and I'm happy it works better now. :-)

I've made a follow-up post with the improved version:
http://michael-buschbeck.github.io/arduino/2013/10/22/string-merging-pstr-percent-codes

MichaelMeissner

The real solution is to get the compiler and binutils (linker) to do this automatically.  However, given the Arduino group remains at a 4 or 5 year old release (4.3.2), which is 5 GCC releases out of date (4.8.1 is the current released version), it is problematical to get it fixed in that area.

As you discovered, using the asm interface is problematical due to the '%' characters.

You might see if the linker used in the Arduino release supports mergeable sections.  Here is a fragment from the assembler's .section directive:

Code: [Select]

7.96 `.section NAME'
====================

Use the `.section' directive to assemble the following code into a
section named NAME.

   This directive is only supported for targets that actually support
arbitrarily named sections; on `a.out' targets, for example, it is not
accepted, even with a standard `a.out' section name.

ELF Version
-----------

   This is one of the ELF section stack manipulation directives.  The
others are `.subsection' (*note SubSection::), `.pushsection' (*note
PushSection::), `.popsection' (*note PopSection::), and `.previous'
(*note Previous::).

   For ELF targets, the `.section' directive is used like this:

     .section NAME [, "FLAGS"[, @TYPE[,FLAG_SPECIFIC_ARGUMENTS]]]

   The optional FLAGS argument is a quoted string which may contain any
combination of the following characters:

`a'  section is allocatable
`w' section is writable
`x'  section is executable
`M' section is mergeable
`S'  section contains zero terminated strings
`G' section is a member of a section group
`T'  section is used for thread-local-storage

The optional TYPE argument may contain one of the following constants:

`@progbits'  section contains data
`@nobits'  section does not contain data (i.e., section only occupies space)
`@note'  section contains data which is used by things other than the program
`@init_array' section contains an array of pointers to init functions
`@fini_array' section contains an array of pointers to finish functions
`@preinit_array' section contains an array of pointers to pre-init functions

Many targets only support the first three section types.

   If FLAGS contains the `M' symbol then the TYPE argument must be
specified as well as an extra argument--ENTSIZE--like this:

     .section NAME , "FLAGS"M, @TYPE, ENTSIZE

   Sections with the `M' flag but not `S' flag must contain fixed size
constants, each ENTSIZE octets long. Sections with both `M' and `S'
must contain zero terminated strings where each character is ENTSIZE
bytes long. The linker may remove duplicates within sections with the
same name, same entity size and same flags.  ENTSIZE must be an
absolute expression.  For sections with both `M' and `S', a string
which is a suffix of a larger string is considered a duplicate.  Thus
`"def"' will be merged with `"abcdef"';  A reference to the first
`"def"' will be changed to a reference to `"abcdef"+3'.

   If FLAGS contains the `G' symbol then the TYPE argument must be
present along with an additional field like this:

     .section NAME , "FLAGS"G, @TYPE, GROUPNAME[, LINKAGE]

   The GROUPNAME field specifies the name of the section group to which
this particular section belongs.  The optional linkage field can
contain:
`comdat'
     indicates that only one copy of this section should be retained

`.gnu.linkonce'
     an alias for comdat

   Note: if both the M and G flags are present then the fields for the
Merge flag should come first, like this:

     .section NAME , "FLAGS"MG, @TYPE, ENTSIZE, GROUPNAME[, LINKAGE]

   If no flags are specified, the default flags depend upon the section
name.  If the section name is not recognized, the default will be for
the section to have none of the above flags: it will not be allocated
in memory, nor writable, nor executable.  The section will contain data.

Michael Buschbeck

#9
Oct 22, 2013, 07:19 pm Last Edit: Oct 22, 2013, 07:22 pm by Michael Buschbeck Reason: 1

The real solution is to get the compiler and binutils (linker) to do this automatically.


I tend to agree, and in fact, avr-g++ incorporated a PROGMEM-specific fix regarding the -fmerge-constants compiler option in 4.7.0.

That seemed to be exactly what I wanted, so I tried that first (with GCC 4.7.0 and 4.8.0) -- and with the right combination of compiler switches (-fno-data-sections, as somewhat obliquely implied by that bugzilla comment I linked to above, and of course -fmerge-constants or -fmerge-all-constants), I actually managed to get avr-g++ to merge strings defined through the regular PSTR() macro -- but only for global variables and static locals in non-member functions.

I absolutely couldn't get avr-g++ to merge strings in static locals defined in member functions (even static member functions), which is unfortunate because that's precisely where I need them. And I think that's not a bug but rather an unfortunate consequence of a design choice and essential C++ semantics, so it's basically unfixable.

(I'm out of my depth where compiler and linker internals are concerned, so the following may be completely off the mark. It's just my attempt to make sense of the things I've observed.)

I think the basic underlying problem is the design choice that attribute((progmem)) needs to be attached to a variable -- everything that follows is just an unavoidable chain of consequences:

  • attribute((progmem)) must be attached to a variable, so PSTR() must use a (static local) variable;

  • static local variables in class methods have non-trivial linking semantics across translation units (because the logically same static local might conceivably be defined in several translation units when methods are inlined), so the compiler must emit weak symbols for them;

  • the linker won't merge strings referenced through weak symbols, presumably because it can never know if any weakly-referenced string will actually end up being used.


So I wouldn't even know what to suggest to the avr-g++ makers -- short of scrapping the entire attribute((progmem)) approach and adding some built-in way that comes close to my asm-based workaround, which just puts the string literal directly in the right kind of section. :-/

darryl

bad news, found a case where 4.7.2 and 4.8.0 dont work,  they are missing some of the code from the newer macro.

Code: [Select]

bool test = false;

void setup() {
 // put your setup code here, to run once:
 Serial.begin(115200);
}

void loop() {
 // put your main code here, to run repeatedly:
   test = !test;
   test ? Serial.print(F("true")) : Serial.print(F("false"));

 // you will want a delay, if actually uploading this code onto an AVR board
 delay(100);
}


produces

Code: [Select]

 
   
.Lscope8:
.section .text.loop,"ax",@progbits
.stabs "loop:F(0,4)",36,0,0,loop
.global loop
.type loop, @function
loop:
.stabs "sketch_feb17c.ino",132,0,0,.Ltext6
.Ltext6:
.stabn 68,0,10,.LM21-.LFBB9
.LM21:
.LFBB9:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
.LBB23:
.stabn 68,0,12,.LM22-.LFBB9
.LM22:
lds r24,test
ldi r25,lo8(1)
eor r24,r25
sts test,r24
.stabn 68,0,13,.LM23-.LFBB9
.LM23:
tst r24
breq .L16
.LBB24:
.stabn 68,0,13,.LM24-.LFBB9
.LM24:
/* #APP */
;  13 "sketch_feb17c.ino" 1
.pushsection .progmem.data, "SM", @progbits, 1
0: .string "true"
.popsection

;  0 "" 2
/* #NOAPP */
rjmp .L17
.L16:
.LBE24:
.LBB25:
.stabn 68,0,13,.LM25-.LFBB9
.LM25:
/* #APP */
;  13 "sketch_feb17c.ino" 1
.pushsection .progmem.data, "SM", @progbits, 1
0: .string "false"
.popsection

;  0 "" 2
/* #NOAPP */
.L17:
/* #APP */
;  13 "sketch_feb17c.ino" 1
ldi r22, lo8(0b)
ldi r23, hi8(0b)

;  0 "" 2
/* #NOAPP */
.LBE25:
ldi r24,lo8(Serial)
ldi r25,hi8(Serial)
jmp _ZN5Print5printEPK19__FlashStringHelper
.LBE23:


in the case of the true string ( defined at .L24 ) we dont have any lo8 and hi8 access to the address in .L16

the false string at .L25  and its .L17  do produce code that makes it to being assembled.

any thoughts ?

I'm guessing the compiler is trying to be too clever and just above .L16 has the relative jump to .L17 thinking that will load r22 & r23 correctly, but they wont have the right values, thus you only get one string being output :-(

--
 Darryl

darryl

okay,  have fixed this now.

instead of having two __asm__ lines in the PSTR define.  put them into one. and the optimiser takes them verbatim ( as marked by /* APP */ to /*NOAPP */ in one hit, and the pointer loads to the string dont get omitted.

this seems to work for me, but again, keep an eye on the outputs.

Code: [Select]

#define PSTR(str)                                                  \
 (__extension__( {                                                \
   PGM_P ptr;                                                     \
   __asm__ __volatile__ (                                         \
     ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"    \
     "0: .string " #str                                 "\n\t"    \
     ".popsection"                                      "\n\t"    \
     "ldi %A0, lo8(0b)"                                 "\n\t"    \
     "ldi %B0, hi8(0b)"                                 "\n\t"    \
     : "=d" (ptr) );                                              \
   ptr;                                                           \
 } \
 ))
#endif



which now gives

Code: [Select]

.LM24:
/* #APP */
;  13 "sketch_feb18a.ino" 1
.pushsection .progmem.data, "SM", @progbits, 1
0: .string "true"
.popsection
ldi r22, lo8(0b)
ldi r23, hi8(0b)

;  0 "" 2
/* #NOAPP */
rjmp .L17
.L16:
.LBE24:
.LBB25:
.stabn 68,0,13,.LM25-.LFBB9
.LM25:
/* #APP */
;  13 "sketch_feb18a.ino" 1
.pushsection .progmem.data, "SM", @progbits, 1
0: .string "false"
.popsection
ldi r22, lo8(0b)
ldi r23, hi8(0b)

;  0 "" 2
/* #NOAPP */
.L17:

--
 Darryl

darryl

which of course, is back to the original code solution, which fails on the %  case  :-(
--
 Darryl

Michael Buschbeck

Oh my. That's a headscratcher...

Thanks for the reduced test case.

I'm still not totally clear about the actual mechanism that makes this bug come to pass -- the compiler should defer any kind of optimization to long after the local "0:" labels have been resolved to an actual unique generated symbol, which in turn should prevent any kind of conflation of the successive asm blocks that contain the "0b" reference because they'd refer to different actual symbols.

Plus, I'd expect asm volatile to prevent any kind of optimization of that kind anyway. Even if those two consecutive pairs of ldi statements do the same thing in the compiler's opinion, volatile should make it include those statements anyway; but the first asm block's ldi statements are nowhere to be seen. That's weird.

I'll play around with that myself a bit.

Michael Buschbeck

Here's a possible workaround: Replace the "0:" label and the "0b" backreferences with "999:" and "999b", respectively. (Any number greater than 9 should work just as well.)

When I did that, the disassembly suddenly started showing two distinct pairs of ldi statements (with different offsets) where the original (using the "0:" label) hadn't. The idea of giving this change a try was inspired by the following statement from the relevant GNU Assembler docs:

Quote

It is also worth noting that the first 10 local labels (0:...9:) are implemented in a slightly more efficient manner than the others.


It seems that "slightly more efficient" also implies "slightly buggy" in our particular case here.

Here's the full repro case:

Code: [Select]

void setup() {}

bool flag = false;
volatile PGM_P result;

void loop()
{
  flag = !flag;

  result =
    flag
      ? (__extension__({
          PGM_P ptr;
          asm volatile
          (
            ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"
            "999: .string " "\"false\""                        "\n\t"
            ".popsection"                                      "\n\t"
          );
          asm volatile
          (
            "ldi %A0, lo8(999b)"                               "\n\t"
            "ldi %B0, hi8(999b)"                               "\n\t"
            : "=d" (ptr)
          );
          ptr;
        }))
      : (__extension__({
          PGM_P ptr;
          asm volatile
          (
            ".pushsection .progmem.data, \"SM\", @progbits, 1" "\n\t"
            "999: .string " "\"true\""                         "\n\t"
            ".popsection"                                      "\n\t"
          );
          asm volatile
          (
            "ldi %A0, lo8(999b)"                               "\n\t"
            "ldi %B0, hi8(999b)"                               "\n\t"
            : "=d" (ptr)
          );
          ptr;
        }));
}


Here's the corresponding disassembly of the loop() method -- note the two distinct ldi pairs at offsets ae and b4 that refer to string offsets 104 and 110, respectively:

Code: [Select]

0000009e <loop>:
  9e: 80 91 02 01 lds r24, 0x0102
  a2: 91 e0        ldi r25, 0x01 ; 1
  a4: 89 27        eor r24, r25
  a6: 80 93 02 01 sts 0x0102, r24
  aa: 88 23        and r24, r24
  ac: 19 f0        breq .+6      ; 0xb4 <loop+0x16>
  ae: 88 e6        ldi r24, 0x68 ; 104
  b0: 90 e0        ldi r25, 0x00 ; 0
  b2: 02 c0        rjmp .+4      ; 0xb8 <loop+0x1a>
  b4: 8e e6        ldi r24, 0x6E ; 110
  b6: 90 e0        ldi r25, 0x00 ; 0
  b8: 90 93 01 01 sts 0x0101, r25
  bc: 80 93 00 01 sts 0x0100, r24
  c0: 08 95        ret


For comparison, the disassembly when "0:" is used as the local label instead of "999:" -- note that there's only a single ldi pair at offset b0, as you had shown already:

Code: [Select]

0000009e <loop>:
  9e: 80 91 02 01 lds r24, 0x0102
  a2: 91 e0        ldi r25, 0x01 ; 1
  a4: 89 27        eor r24, r25
  a6: 80 93 02 01 sts 0x0102, r24
  aa: 88 23        and r24, r24
  ac: 09 f0        breq .+2      ; 0xb0 <loop+0x12>
  ae: 00 c0        rjmp .+0      ; 0xb0 <loop+0x12>
  b0: 8e e6        ldi r24, 0x6E ; 110
  b2: 90 e0        ldi r25, 0x00 ; 0
  b4: 90 93 01 01 sts 0x0101, r25
  b8: 80 93 00 01 sts 0x0100, r24
  bc: 08 95        ret

Go Up