USB performance (again) - Slow native USB PC to Due

I have read most if not all the threads on USB performance. I saw one posted claiming 892KBps which I would kill for right now. My application shoots 32 byte packets over the bus and needs to do so reliably and bidirectionally at least 2800 times per second which should be very achievable.

Granted, it was on an older kernel and using the 1.5.2 version of Arduino but I did have this working at one point with right around 9000 32 byte packets per second bi-directionally. Now I'm using 1.6.4 and I'm on linux kernel 3.8.13.

When I set this up with the Arduino sending data as fast as it can pump it out and the client receiving as fast as it can receive (with VMIN = 32), I get what I'm looking for around 18,300 pps. However, when I do it the other way around, I get a measly 3700 pps.

I am dying to know why this is the case. I've think I've reduced my program down to the most minimal possible:

Here is the sketch for the arduino:

void setup() {
  // SerialUSB.begin is a no op anyway so why bother?
}
void loop() {
  uint8_t buf[64];
  while (SerialUSB.available())
    SerialUSB.readBytes(buf, sizeof(buf));
}

Here is the client program, it can be run with "iotest <path_to_serial_device_file>"

Makefile:

iotest:	iotest.c
	gcc -m64 -o iotest iotest.c -I.

debug:	iotest.c
	gcc -g -m64 -o iotest iotest.c -I.

clean:
	rm iotest

iotest.c:

#include <stdio.h>
#include <termios.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <time.h>

#define timersub(e, b) ((double)(e.tv_sec - b.tv_sec) + (double)(e.tv_nsec - b.tv_nsec)/(double)1000000000)

int main(int argc, char *argv[]) {
  
  struct termios oldtio, newtio;
  struct timespec begin, end;
  unsigned char buf[64];
  int fd = 0, bytes = 0, rc = 0, totalBytes = 0, errors = 0;

  if (argc < 2) {
    printf("\nUsage: iotest <path_to_serial_device>\n\n");
    return(0);
  }

  memset(buf, 0xAA, sizeof(buf));
  memset(&begin, 0, sizeof(struct timespec));
  memset(&end, 0, sizeof(struct timespec));

  errno = 0;
  fd = open(argv[1], O_RDWR | O_NOCTTY);
  if (fd <= 0)  {
    if (errno != ENOENT)
      printf("ERROR: Unable to open port %s, err(%d): %s\n",
	     argv[1], errno, strerror(errno));
    else
      printf("ERROR(%d): %s could not be found: %s", errno, argv[1], strerror(errno));
    return(1);
  }

  printf("Successfully opened %s, setting raw mode...\n", argv[1]);  
  tcgetattr(fd, &oldtio);
  memcpy(&newtio, &oldtio, sizeof(struct termios));
  cfmakeraw(&newtio);
  newtio.c_cc[VMIN] = 0; //sizeof(struct DataFromDue);
  newtio.c_cc[VTIME] = 1;
  newtio.c_cflag &= ~CSTOPB;
  newtio.c_cflag &= ~CRTSCTS;
  newtio.c_cflag |= (CREAD | CLOCAL);
  tcsetattr(fd, TCSANOW, &newtio);
  tcflush(fd, TCIOFLUSH);

  printf("Sending %d byte chunks as fast as possible for 60 seconds...\n", sizeof(buf));
  clock_gettime(CLOCK_MONOTONIC, &begin);
  int i = 0;
  do {
    i++;
    errno = 0;
    if ((rc = write(fd, buf+bytes, sizeof(buf) - bytes)) >= 0) {
      bytes += rc;
    } else {
      errors++;
    }
    if (bytes == sizeof(buf)) { totalBytes += bytes; bytes = 0; }
    clock_gettime(CLOCK_MONOTONIC, &end);
  } while (timersub(end, begin) < 60.0f);

  printf("Sent %d bytes in %f seconds at a rate of %f bytes per second with %d errors\n",
	 totalBytes, timersub(end, begin), (double)totalBytes / timersub(end, begin), errors);

  tcsetattr(fd, TCSAFLUSH, &oldtio);
  close(fd);
  return 0;
}

I'd sure love to know why the performance writing to the Due is so trashy now compared to the performance of reading from the Due. This has put a major major kink in the project :frowning:

Thanks in advance for any insight you can provide!

BTW, I've also done this:

cat /dev/random > data.bin
(then ctrl-c after a while)
stty -F /dev/ttyACM1
time cat data.bin > /dev/ttyACM1

If I take whatever the size of the file was and then divide it by the time reported, I can still only get about 122KBps.

Im not going to pretend to understand, but shouldn't

while (SerialUSB.available())

be

while (SerialUSB.available() >= sizeof(buf))

Because if it isnt, then when 1 byte gets received, it tried to read 64 bytes, 63 of which are garbage or 0

Ps991:
Im not going to pretend to understand, but shouldn't

while (SerialUSB.available())

be

while (SerialUSB.available() >= sizeof(buf))

Because if it isnt, then when 1 byte gets received, it tried to read 64 bytes, 63 of which are garbage or 0

Fair point :stuck_out_tongue: , but I too would like to chip in to this long running theme with DUE serial performance.

It was not until recently I discovered some INCREDIBLE difference between MEGA and DUE serial performance and I am at a loss what to do to improve the situation!! If there are any simple ideas I can try that will not render my sketch hardware specific I would be interested to hear them.

To set the scene I have a Winbond W25Q128FV SPI flash ic fitted to a DUE CTE TFT shield, so obviously I can fit it directly to the DUE, but have to manually wire up the SPI wires to the MEGA.

I was/am new to SPI useage and learnt along the way how to use SPI and eventually came up with my own version of a PROCESSING/ARDUINO combination to program the flash ic. I got it working on the MEGA first, which used SBI / CBI port manipulation for the CS, but this was not transferable to the DUE without modification, so I changed the SBI / CBI to digitalWrite commands, which gave me a sketch that can run unmodified on a DUE or a MEGA.

I don't claim the code is pretty, but I do claim it is solid and reliable!! :wink: :stuck_out_tongue:

The interaction between processing and arduino is single byte transfer from PC to Arduino, until the buffer is full, buffer size is 256 bytes (1 flash page), on successful page written to flash, a '.' char is sent back to the PC, if page write fails, an 'X' is returned to the PC. There is no other data integrity check beyond the correct number of bytes (my file size is always a multiple of 256 which could be restrictive but that's not the issue here).

The file I have been using as my benchmark for these tests is 0x2fd000 (3133440) bytes or 0x2fd0 pages.

Now, this surprised the hell out of me since the DUE is typically faster than a MEGA at EVERYTHING...... I even tried different SPI_CLOCK_DIV rates to try to get the bottom of where the bottle neck was, but clearly the bottle neck is not the SPI clock divider!!

Here is my complete set of results for all tests ran.

MEGA 115200baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :4:26 min:sec

Bytes per second :11779
////////////////////////////////////////////
MEGA 230400baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :2:21 min:sec

Bytes per second :22222
////////////////////////////////////////////
MEGA 460800baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :1:45 min:sec

Bytes per second :29842
////////////////////////////////////////////
MEGA 921600baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :1:44 min:sec

Bytes per second :30129
////////////////////////////////////////////
MEGA 115200baud SPI_CLOCK_DIV4

Done!

Size ok :002FD000

Time taken :4:27 min:sec

Bytes per second :11735
///////////////////////////////////////////
MEGA 230400baud SPI_CLOCK_DIV4

Done!

Size ok :002FD000

Time taken :2:21 min:sec

Bytes per second :22222
///////////////////////////////////////////
MEGA 460800baud SPI_CLOCK_DIV4

Done!

Size ok :002FD000

Time taken :1:44 min:sec

Bytes per second :30129
///////////////////////////////////////////
MEGA 921600baud SPI_CLOCK_DIV4

Done!

Size ok :002FD000

Time taken :1:39 min:sec

Bytes per second :31650
///////////////////////////////////////////
MEGA 921600baud SPI_CLOCK_DIV8

Done!

Size ok :002FD000

Time taken :1:50 min:sec

Bytes per second :28485
///////////////////////////////////////////
DUE (Programming port) 460800baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :52:13 min:sec

Bytes per second :1000
///////////////////////////////////////////
DUE (Native port) 460800baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :6:34 min:sec

Bytes per second :7952
///////////////////////////////////////////

Just to reiterate, the Arduino sketch and PROCESSING sketch remained unchanged, the only change made was the Arduino board. Does ANYBODY have anything to say about why the DUE is so crap?? I appreciate better use of buffering could be made blah blah, but then why does the MEGA make a half decent job of it using the SAME code??

Regards,

Graham

Flash_Uploader.ino (5.11 KB)

HMMMMMM now I feel stupid!! :-[ :blush:

So when you tried everything you can think of, try using THE SAME USB port the MEGA was using, that way EVERYTHING is the same except the board.....

Funniest thing!! The MEGA was plugged into a USB 3.0 port, the DUE was plugged into a USB 2.0 port!

Having rectified that minor oversight we now get this.....

///////////////////////////////////////////
DUE (Native port) 460800baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :1:9 min:sec

Bytes per second :45412
///////////////////////////////////////////
DUE (Native port) 921600baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :1:9 min:sec

Bytes per second :45412

///////////////////////////////////////////
DUE (Programming port) 460800baud SPI_CLOCK_DIV2

Done!

Size ok :002FD000

Time taken :1:41 min:sec

Bytes per second :31024
//////////////////////////////////////////

Which also proves the baud setting for the native port is a waste of time..... I am just happy I managed to resolve the problem!

Regards,

Graham

Great to get the feedback and confirmation that a USB3.0 port makes a difference.

Just in case anybody wants the Processing AND Arduino portions of the Flash_Uploader and Flash_Verifier I have been playing with, they are attached, I know is a little off topic but 2 people currently have downloaded half of the story.... Thought there might be some interest. Works with Winbond 25Q08, 25Q16, 25Q32, 25Q64 and 25Q128.

Regards,

Graham

WinBond Flash Uploader & Verifier.zip (8.52 KB)

Ps991:
Im not going to pretend to understand, but shouldn't

while (SerialUSB.available())

be

while (SerialUSB.available() >= sizeof(buf))

Because if it isnt, then when 1 byte gets received, it tried to read 64 bytes, 63 of which are garbage or 0

Yeah, fair point indeed. I only wish it made a significant difference. I've tried a number of different approaches to address this problem and none have produced significant results. It's just as slow even with your recommended changes. I sure wish I know how I had this working so well and so fast previously. I was pulling well over 8000 pps previously and that was both directions. I know they say to avoid chatty request/response like protocols that pass short messages but in this application, unfortunately, I have little choice.

I have tried higher baud rates than 115200 on the Due Programming port and they produce errors, are you sure those baud rates even work, like have you verified that the correct information was being received?

Yeah, of course, as implied by the code and subject of the post, I'm using the native USB port of the Due so not sure that even matters. Regardless, I do have a version where I'm running a CRC check which passes without issue. I'm seeing no data corruption, it's just slow as molasses in Alaska in the dead of winter.

Ps991:
I have tried higher baud rates than 115200 on the Due Programming port and they produce errors, are you sure those baud rates even work, like have you verified that the correct information was being received?

Yes, and yes or at least on my board up to 460800 works fine (921600 = board not found on processing).

After my initial disillusionment with poor pc->DUE speeds, I have been making progress playing with some optimisation of my transfer method. Transferring 256 bytes at a time has the impact of producing 45912 bytes per second on the 'programming port' at 460800 baud (pretty close to ideal!). Also to note, much faster than this and the speed of the SPI Flash starts to have an impact.

I tried using the native port with the same technique but something is not happy, when I have more time to pursue this I will feed back, but as a tentative non-error checked transfer, I hit 110,000 bytes per second on the native port for my 3MB benchmark file (reducing to approx 80,000 when the SPI Flash is actively being programmed).

In a real life scenario, has anybody else got much higher than these speeds? And I don't mean loop back tests and such like.

Regards,

Graham