People might ask: "is using memcpy() faster compared to looping over single char's to copy?" (esp. when it comes in chunks of many bytes)
The answer is:
No, memcpy() can add "penalties" (a performance decrease).
memcpy is only faster if:
-
BOTH buffers, src AND dst, are 4-byte aligned
-
if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length)
-
if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop
-
if length is not a multiple of 32bit words does not matter really - the alignment does
Thinking about how to increase speed on Portenta H7 Web Client (using MbedClient.*) I have seen that memcpy() does not help to speed up.
Why?
Let's assume you want to handle network receiver traffic. You know, it comes in bursts (fragments of TCP packets), even you loop until you have received all - you want to handle the chunks via memcpy() - because you know it comes in chunks (not single char's).
Example code:
char rxBuffer[8092];
void ReceiveFromNetwork(void) {
char chunkData[256];
int len;
int idx = 0;
while (idx < EXPECTED_LENGTH) {
len = sock.recv(chunkData, 256);
memcpy(&rxBuffer[idx], chunkData, len);
idx += len;
}
...
}
So, you know, sock.recv() reads N bytes. You copy the chunk to the other buffer. And you keep going until EXPECTED_LENGTH was received.
Let's assume, sock.recv() would always return an available length (len =) of 127 bytes. So, you try to memcpy() 127 bytes.
The code is fine, it will work. But the performance is bad!
Why?
-
assume that rxBuffer[] and chunkData[] are not 32bit word aligned: sure, they are of type char[] and no need to align by the compiler
-
memcpy() cannot make use of 32bit word transfer: both buffers, at least one, is not 32bit aligned
-
even you would force to align these buffers - e.g. via declaring as uint32_t or using (GNU) keyword attribute ((aligned (4))), e.g.:
char rxBuffer[8092] __attribute__ ((aligned (4)));
it does not help: after the 1st chunk, when you got 127 bytes - the next buffer start for memcpy() is not aligned anymore (the chunks are not multiples of 4 bytes).
memcpy() had to check on every call if it could use 32bit transfer (or how bring it to alignment) or not. If not aligned then memcpy() had to use anyway a loop with handling single char objects and looping in the same way with single char's.
And due to fact, that you call here memcpy() in a loop, it has to check this in every iteration again and again.
This makes it slower compared to use a loop with single char handling, like:
char rxBuffer[8092];
void ReceiveFromNetwork(void) {
char chunkData[256];
int len;
int idx = 0;
while (idx < EXPECTED_LENGTH) {
int i;
len = sock.recv(chunkData, 256);
for (i = 0; i < len; i++)
rxBuffer[idx++] = chunkData[i];
}
...
}
At the end: the single loop copy will be faster as using memcpy(), even it looks "strange" to handle all as single char (bytes).
Even the code would work fine with memcpy() - you add on every loop iteration the (hidden) overhead of memcpy() just to let it figure out how to do the copy.