[SOLVED] Split uint-32 to bytes

Hello,
I ll be using a mega2560 on a project. Need 2560 because of i/o pins.
There is a a need of extracting the 12 Msbs of an calculated uint32 to ports. For example, bit31-24 go to portA, bits 23-20 go to half portc , lsb side).
So I have

unsigned long product=0;

...
product=0x1A2B3C4D; //for test purposes
...

PORTA=product>>24;// get 8 msb 24-31

PORTC=product>>20&0x0F; //get 8lsb 20-27and "zero out" 24-27

There are 3 such cases. 24+20=44 shiftsX3=132 clock ticks=(well, if correct) 8.25usec. Thats a long time for the case.
So I wonder, is there anything faster?

thank you

demkat1:
There are 3 such cases. 24+20=44 shiftsX3=132 clock ticks=(well, if correct) 8.25usec.

Are we discussing hypothetical performance?

Or, have you actually measured something?

Or, have you inspected a disassembly?

Thats a long time for the case.

You did not mention what would be a "short time". Or an "acceptable time".

Have you determined that an AVR processor running at 16 MHz is capable of doing what you want to do?

... 12 Msbs ... calculated uint32 ...

That part strikes me as at least an order of magnitude more expensive than the bit shifts. Are you optimizing the wrong code?

Use a union.

union
{
	struct
	{
		uint8_t byte1;
		uint8_t byte2;
		uint8_t byte3;
		uint8_t byte4;
	};
	uint32_t longint;
} sample;

AVR-GCC packs the struct LSB to MSB so byte4 would be [31:24] down to byte1 [7:0].

If you implement this union and then assign sample.longint = 2882343476 which is 0xABCD1234,
printing sample.byte4 ... sample.byte1 gives 171(AB), 205(CD), 18(12) and 52(34).

You 32bit number is broken down to its 4 byte-wide constituents with one assignment.

1 Like

DKWatson:
Use a union.

The compiler already does that. Optimally. Which is why that part of the compiler is referred to as the "optimizer".

Moving on, if you just want the 12 msb, trash the first 20,

union
{
 struct
 {
 uint16_t garbage1;
 uint8_t garbage2: 4;
 uint8_t nibble1: 4;
 uint8_t byte4;
 };
 uint32_t longint;
} sample;

garbage2 and nibble1 each get assigned 4 bits with nibble1 being [23:20].
byte4 is still [31:24]

If OP's code is taking 8.25us, something is not doing what it should.

My apologies. The 1.8.3 compiler is generating amazingly bad code. It is not even using the SWAP instruction to isolate the nibble. That's unexpected and disappointing.

LTO generated this for an unconditional local jump...

 2ec:	20 97       	sbiw	r28, 0x00	; 0
 2ee:	a9 f2       	breq	.-86     	; 0x29a <main+0xfc>

...instead of a simple relative jump.

That's also disappointing.


This was a small improvement but there is still dead code and unnecessary looping (the bulk of the trouble remains)...

  uint8_t va = product>>24;
  uint8_t vc = product>>20&0x0F;
  PORTA = va;
  PORTC = vc;

@CodingBadly : I understand the limits of 2560.

FYI, the calculation is a multiply of 2 trigonometric (sin of 2 angles). I do not use float, because of slow procedure. First I "mapped" 0..1 to 0..255 using int(255*sin(x)) (so the max product is 65535- uint16) and putting values for all interested angles in array . Working this way whole main calculation runs in about 16usec., but there is some error because of low resolution of "mapping". Then I used uint16, so the product goes to uint32 and...you know what happens. I think its a dissaster of loosing 8usec in shifting. (in respect to the time for the rest calculation). Thats the story (and of course "thats the limits" is accepted)

@DKWatson : Ill give a try to your proposal and measure.

As good as it gets...

typedef union
{
  uint32_t v;

  struct 
  {
    unsigned v00:4;
    unsigned v01:4;
    unsigned v10:4;
    unsigned v11:4;
    unsigned v20:4;
    unsigned v21:4;
    unsigned v30:4;
    unsigned v31:4;
  }
  as_nibbles;

  uint8_t as_bytes[4];
}
uint32_split_t, *uint32_split_p;

static uint32_split_t product;

void setup( void ) 
{
  product.v = 0x1A2B3C4D; //for test purposes
}

void loop( void )
{
  ++product.v;
/**/
  // uint8_t va = product>>24;// get 8 msb 24-31 // (uint8_t)((product>>24) & 0xFF);// get 8 msb 24-31
  // uint8_t vc = product>>20&0x0F; //get 8lsb 20-27and "zero out" 24-27  
  uint8_t va = product.as_bytes[3];
  uint8_t vc = product.as_nibbles.v21;
  PORTA = va;
  PORTC = vc;
/**/
}

I may have gotten the wrong nibble / byte. Correct as needed.

There are still two redundant / unnecessary loads from memory. I cannot get the compiler to drop them. The optimizer appears to have gone insane.

Now Im at the lab and did all the respective measurements.
Main procedure, just calculation as it was formed yesterday, runs exactly at 19.65usec.
Adding ONE only output :

a. 24+20 shifts, launches time to 29.98usec.
b. First union approach, just split to bytes and output 2 whole ports, measures total 20.03.
c. Second union approach, get and output 8+4 , measures total 20.16.

@ Coding Badly : I had almost written this postreply when you last posted. did not check the validity, but now I can handle it

the power of "UNION"?

excellent
LOTS of thanks