Is memcpy() faster as a loop with single char?

People might ask: "is using memcpy() faster compared to looping over single char's to copy?" (esp. when it comes in chunks of many bytes)
The answer is:
No, memcpy() can add "penalties" (a performance decrease).
memcpy is only faster if:

  • BOTH buffers, src AND dst, are 4-byte aligned

  • if so, memcpy() can copy a 32bit word at a time (inside its own loop over the length)

  • if just one buffer is NOT 32bit word aligned - it creates overhead to figure out and it will do at the end a single char copy loop

  • if length is not a multiple of 32bit words does not matter really - the alignment does

Thinking about how to increase speed on Portenta H7 Web Client (using MbedClient.*) I have seen that memcpy() does not help to speed up.
Why?

Let's assume you want to handle network receiver traffic. You know, it comes in bursts (fragments of TCP packets), even you loop until you have received all - you want to handle the chunks via memcpy() - because you know it comes in chunks (not single char's).

Example code:

char rxBuffer[8092];

void ReceiveFromNetwork(void) {
    char chunkData[256];
    int len;
    int idx = 0;
    while (idx < EXPECTED_LENGTH) {
        len = sock.recv(chunkData, 256);
        memcpy(&rxBuffer[idx], chunkData, len);
        idx += len;
   }
   ...
}

So, you know, sock.recv() reads N bytes. You copy the chunk to the other buffer. And you keep going until EXPECTED_LENGTH was received.

Let's assume, sock.recv() would always return an available length (len =) of 127 bytes. So, you try to memcpy() 127 bytes.

The code is fine, it will work. But the performance is bad!
Why?

  • assume that rxBuffer[] and chunkData[] are not 32bit word aligned: sure, they are of type char[] and no need to align by the compiler

  • memcpy() cannot make use of 32bit word transfer: both buffers, at least one, is not 32bit aligned

  • even you would force to align these buffers - e.g. via declaring as uint32_t or using (GNU) keyword attribute ((aligned (4))), e.g.:

char rxBuffer[8092] __attribute__ ((aligned (4)));

it does not help: after the 1st chunk, when you got 127 bytes - the next buffer start for memcpy() is not aligned anymore (the chunks are not multiples of 4 bytes).

memcpy() had to check on every call if it could use 32bit transfer (or how bring it to alignment) or not. If not aligned then memcpy() had to use anyway a loop with handling single char objects and looping in the same way with single char's.

And due to fact, that you call here memcpy() in a loop, it has to check this in every iteration again and again.

This makes it slower compared to use a loop with single char handling, like:

char rxBuffer[8092];

void ReceiveFromNetwork(void) {
    char chunkData[256];
    int len;
    int idx = 0;
    while (idx < EXPECTED_LENGTH) {
        int i;
        len = sock.recv(chunkData, 256);
        for (i = 0; i < len; i++)
            rxBuffer[idx++] = chunkData[i];
   }
   ...
}

At the end: the single loop copy will be faster as using memcpy(), even it looks "strange" to handle all as single char (bytes).
Even the code would work fine with memcpy() - you add on every loop iteration the (hidden) overhead of memcpy() just to let it figure out how to do the copy.

My summary is:
Do NOT use mempy() if:

  • src OR dst buffers are not aligned

  • you use memcpy() in a loop but the buffer addresses get unaligned during the iteration (due to len is not aligned)
    If you know that the buffers are not aligned with larger type as char (e.g. not with int or uint32_t) or the length to handle in chunks is not aligned with larger type (e.g. numbers of bytes to copy is multiple of 4)
    use your own char based loop. Avoid the overhead in memcpy() just to figure out it had to copy single bytes anyway.

Can you, by using timestamps before and after various memcpy() operations, add some more detail to your description? .

I wasn't aware that AVRs had alignment issues.

2 Likes

I looked a bit more at what you are doing. You appear to have a loop which is effectively blocking during the recipt of a network packet. It is likely that only up to a small number of new characters are available on each iteration. I don't know the Portena architecture but maybe you can you use a DMA channel instead to offload the filling of the memory buffer.

This is very CPU specific.

The SAM and SAMD processors always use a byte copy loop (because they're size-optimized) an of course AVR doesn't have any alignment issues, nor any advantage to 32bit moves (other than loop un-rolling)

Now, I THOUGHT that Cortex-M4 and Cortex-M7 actually support unaligned memory accesses in hardware (unless its specifically turned off), and it looks like the ARM-optimized memcpy function should take that into account.
But that doesn't seem to be what the Portenta builds are using. It looks like they might have the "default" (implemented in C) memcpy, that DOES go to a lot of trouble to align source/dest. Some Portenta images I built seem to contain both a memcpy() function, AND a thumb2_memcpy() function that assumes that the hardware is friendlier... I don't know where it's getting those implementations (newlib? Some ST library? Sigh.)

32-bit alignment only matters on 32-bit processors. It is all about how memory is fetched. IN GENERAL, on a 16-bit processor it is better off 16-bit aligned, 64-bit processor 64-bit aligned, etc. When memory is fetched across alignment boundaries it requires more than one memory fetch to read the native word size. In some cases cache might alleviate some overhead.

My post is related to Portenta H7, a CM7 (or CM4), as a 32bit ARM architecture.
Non-aligned access can be enabled in ARM but I do not like to do (it slows down anyway).
Sure, it is CPU dependent.

Measuring the time: not directly done around memcpy. Instead, I have measured performance (throughput) on entire ETH web server:
with original code (doing char wise operations) vs. using memcpy (my changes) - for a 64 KB TCP request plus 64 KB response from server (2x64 KB as round-trip) it changes from 0.18 seconds to 0.45 seconds.
So, in my case, 2x 64 KB network traffic, handling in chunks of 4 KB - with memcpy: it is 2.5 half times slower.

I was not talking about issues (you can use always memcpy without to take care alignment). My concern was a real-time system and performance optimized coding.

One reason of a bit slower performance is: if the function called in a loop (here memcpy) will do some checks (inside the called function), e.g. if pointers are NULL, addresses are not aligned etc. - this is done in any iteration. Even you make sure never to call a function with invalid parameters or the buffers are aligned, the length a multiple of 4 bytes etc. - this code is still done on every loop (even it is effectively never "used/needed").
I fight really for tiny milli-seconds speed improvements because I need this board in order to test other microchips, their performance.

Why are you copying the chunks at all? It seems completely stupid to me and the job may be done much simpler..

#define RX_BUFSIZE 8092
char rxBuffer[RX_BUFSIZE];
int bufpos = 0;

void ReceiveFromNetwork(void)
{
  while (bufpos < RX_BUFSIZE)
    bufpos += sock.recv(&rxBuffer[bufpos], RX_BUFSIZE - bufpos);
  //Do magic with data 
}

Your topic was MOVED to its current forum category as it is more suitable than the original

Very reasonable and valuable question: true, why copying all the time?
sock.recv() will do already the copy to my my final buffer.

There are anyway so many buffer copies involved in WebClient:

  • it copies from sock receiver buffer to rxBuffer (RingBuffer)

  • it copies from rxBuffer (RingBuffer) to my receiver Buffer
    Actually, I do this already:

len = client.read(&b[idx], HTTPD_RXCHUNK_LEN);
idx += len;

I tried to get rid of this RingBuffer stuff (coming from the WebClient.* wrapper). But using memcpy there has not make it faster. And it was the result of my thinking why memcpy instead of this

for (int i = 0; i < ret; i++) {
        rxBuffer.store_char(Sdata[i]);
      }

does not make it faster (slower instead).

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.