Why I hate __attribute__((packed))

Once again I have seen posts promoting use of __attribute__((packed)) on a struct. I can't stand it anymore, so I have written this rambling post on why I think that is a bad solution.

The packed attribute is commonly used when a poorly designed structure is shared between processors with different bus widths. Lets see how this could happen.

Say you are building a small system where you have a couple of UNOs talking to a MEGA 2560. You build a struct to hold the data something like this. You arranged the data in order you thought of it and the order it is used in your code.

struct StructData {
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
  int32_t someData;
};

Things have been working fine for a year and now you decide to upgrade the MEGA to an ESP32 to add additional functions. You immediately start getting errors on the data field. You post the problem on the internet and get the response (from ChatGPT??) to sprinkle the magic words _attribute__((packed)) on your structure and things will be fine. You do that and things are working. Problem solved.

struct __attribute__((packed)) StructData {
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
  int32_t someData;
};

Now lets see what actually happened. We write a little test program to see what the different processors do.

struct StructData {
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
  int32_t someData;
};

StructData struct1;

void changeStruct(int32_t x){
  struct1.someData += x;
}
void setup() {

  Serial.begin(115200);
  Serial.print("size of StructData ");
  Serial.println(sizeof(StructData));
  Serial.print("offset to StructData.someData ");
  Serial.println(offsetof(StructData ,someData)); 

  changeStruct(100);
  // do something with data so it doesn't get optimized out
  Serial.println( struct1.someData);
 
}

void loop() {

}

On the UNO we get
size of StructData 6
offset to StructData.someData 2
100

On the ESP32 we get
size of StructData 8
offset to StructData.someData 4
100

Hmm, it seems that with structures what you see may not be what you get.The UNO and MEGA2560 are 8 bit processors so data can be byte aligned.

The ESP32 is a 32 bit processor and requires 4 byte alignment for 32 bit data so the compiler added a hidden 2 byte pad to get the correct alignment. Something like this.

struct __attribute__((packed)) StructData {
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
  char hiddenPad[2]; // added bytes for alignment (not actually named) 
  int32_t someData;
};

Sprinkling the magic __attribute__((packed)) gives us the following. Where the pad bytes are removed from the structure.

ESP32 with __attribute__((packed))
size of StructData 6
offset to StructData.someData 2
100

Both processors now think the data is in the same place so you get the expected result.

A better solution would be to rearrange the struct so that the data has the correct alignment. Something like this where the alignment is the same on both processors.

struct StructData {
  int32_t someData;
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
};

But how did the ESP32 suddenly handle unaligned data? The magic words didn't suddenly change the processor. Well lets take a look at what actually happens. Generate another test program with 2 structs where the data is correctly aligned and add __attribute__((packed)) to one of them and see what code is generated.

struct StructData {
  int32_t someData;
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
};

struct __attribute__((packed)) StructDataPack {
  int32_t someData;
  byte NodeID;       // ID of transmitting node
  byte seq;         // sequence number
};

StructData struct1;
StructDataPack structP;

// a function to reference the data 
void changeStruct(int32_t x){
  structP.someData += x;
  struct1.someData += x;
}
void setup() {

  Serial.begin(115200);
  changeStruct(100);
  // do something with data so it doesn't get optimized out
  Serial.println( structP.someData);
  Serial.println( struct1.someData);

}

void loop() {
  
}

Some of the generated code where the fields are used.

void changeStruct(int32_t x){
   0:	004136        	entry	a1, 32
  structP.someData += x;
   3:	000091        	l32r	a9, fffc0004 <_Z12changeStructi+0xfffc0004>
   6:	0109a2        	l8ui	a10, a9, 1
   9:	000982        	l8ui	a8, a9, 0
   c:	11aa80        	slli	a10, a10, 8
   f:	20aa80        	or	a10, a10, a8
  12:	020982        	l8ui	a8, a9, 2
  15:	118800        	slli	a8, a8, 16
  18:	20a8a0        	or	a10, a8, a10
  1b:	030982        	l8ui	a8, a9, 3
  1e:	018880        	slli	a8, a8, 24
  21:	2088a0        	or	a8, a8, a10
  24:	882a      	add.n	a8, a8, a2
  26:	74a880        	extui	a10, a8, 8, 8
  29:	004982        	s8i	a8, a9, 0
  2c:	0149a2        	s8i	a10, a9, 1
  2f:	75a080        	extui	a10, a8, 16, 8
  32:	758880        	extui	a8, a8, 24, 8
  35:	0249a2        	s8i	a10, a9, 2
  38:	034982        	s8i	a8, a9, 3
  struct1.someData += x;
  3b:	000091        	l32r	a9, fffc003c <_Z12changeStructi+0xfffc003c>
  3e:	0988      	l32i.n	a8, a9, 0
  40:	882a      	add.n	a8, a8, a2
  42:	0989      	s32i.n	a8, a9, 0
}
  44:	f01d      	retw.n

The compiler writers take the __attribute__((packed)) to mean "I screwed up, the data may not be aligned please fix it".

The fix for data that may not be aligned is to generates code to do the unaligned 32 bit load and store a byte at a time. It is not so magic. The packed case take almost 5 times as many instructions as the unpacked case.

In addition the packed case has only 1 short form instruction where in the unpacked case 3 of the 4 instructions are short form.

So __attribute__((packed)) leads to poor design, code bloat and potential performance problems.

By the way if you are using an ARM M0 or M0+ (Pi pico and some NANOs for example) the same thing applies.

void changeStruct(int32_t x){
  structP.someData += x;
   0:	4a0b      	ldr	r2, [pc, #44]	@ (30 <_Z12changeStructl+0x30>)
   2:	7853      	ldrb	r3, [r2, #1]
   4:	7811      	ldrb	r1, [r2, #0]
   6:	021b      	lsls	r3, r3, #8
   8:	430b      	orrs	r3, r1
   a:	7891      	ldrb	r1, [r2, #2]
   c:	0409      	lsls	r1, r1, #16
   e:	4319      	orrs	r1, r3
  10:	78d3      	ldrb	r3, [r2, #3]
  12:	061b      	lsls	r3, r3, #24
  14:	430b      	orrs	r3, r1
  16:	181b      	adds	r3, r3, r0
  18:	0a19      	lsrs	r1, r3, #8
  1a:	7013      	strb	r3, [r2, #0]
  1c:	7051      	strb	r1, [r2, #1]
  1e:	0c19      	lsrs	r1, r3, #16
  20:	0e1b      	lsrs	r3, r3, #24
  22:	7091      	strb	r1, [r2, #2]
  24:	70d3      	strb	r3, [r2, #3]
  struct1.someData += x;
  26:	4a03      	ldr	r2, [pc, #12]	@ (34 <_Z12changeStructl+0x34>)
  28:	6813      	ldr	r3, [r2, #0]
  2a:	181b      	adds	r3, r3, r0
  2c:	6013      	str	r3, [r2, #0]
}
  2e:	4770      	bx	lr

is to use data types that ensure that they are aligned.

For instance, when passing an unsigned
integer value from a Uno to an ESP32 use uint32_t as the data type in the code on both sides

Wrong - on the UNO only byte alignment is required. That is the whole point of my rant. See the comment above.

On the UNO we get
size of StructData 6
offset to StructData.someData 2
100

Run the code to see what you get.

Have it your own way, but you are the one that hates __attribute__((packed)) because it leads to

So why use it when you don't need to ?

Of course, in the first place passing any kind of structs between processors is also poor design. Probably happens less often now, but some CPUs are big endian, so your "better solution" also fails.

Real comms protocols serialize and deserialize data into a well defined frame structure, which is defined independently of target specific representation. That's extra work, so most amateurs will pass a struct as a quick and dirty method. But if you are going to be quick and dirty, you may as well use __attribute__((packed)).

That is my point You should define your structures so you don't need to. I haven't used it in decades.

Then we agree

Network byte order is a definition of how the bytes are ordered for transmission. There are C functions that translate between host and network order which can minimize the work on the programmer.

Many processors have instructions to swap between big and little endian which minimizes the overhead.

<nostalgia alert>
I haven't had to worry about that for over 15 years. I was passing data between a POWER processor and an AMD Opteron.
</nostalgia alert>

Nice, in theory. I see "packed" used mostly when a program data structure has to map directly only some external format over which I have no control, like a set of peripheral registers, or a network packet. The compiler can handle alignment issues, or I can do it myself, but having it insert padding results in unexpected and buggy behavior that is essentially silent.
(Yeah, this frequently ends up with the structure creators explicitly padding their structures:

   uint8_t myregister;
   uint8_t _PADDING1[3];

So what? That's what SHOULD be done to explicitly describe the hardware...)

Yes that is a problem. In that case I would hold my nose and just code it. :grinning:

This I agree with. You're asking for trouble assuming anything about how the data is internally represented.

And then most pros find that the whole serialization and de- make their device slower than they want it to be, and go back to dealing with the raw data stream as it exists.

When was the last time you saw ASN.1 or SUN XDR used outside of their original contexts? And I'm pretty sure I've seen a few IETF working group announcements recently for "lightweight data formats." (of course, XML and JSON are popular, when speed is less important. And they're about as "well defined" as you can get.)

This thread bakes up the thesis of Donald Knuth, author of The Art of Computer Programming:
“People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like.
Otherwise the programs they write will be pretty weird.”