Serial TX buffer not working?

Hello,
I am using an Arduino DUE and Arduino IDE 1.5.6 (have also tried nightly build).
I am using serial communication to communicate with another device (Igep board) at a baudrate of 500000. It works fine.

When trying to optimize my timings I discovered that Serial.write() is blocking everytime I send more than 2 bytes, so when I send my vector of 64 bytes, it blocks for almost 1.2ms, that is about 20nanos for every byte, corresponding to 500000baudrate.
Between sending vectors there are other tasks that take more than 2ms.

I believed this Serial.write() shouldn't block, but put the vector inside TX buffer and let the UART send it at the baudrate configured, but then I read that the UART buffer is only 1 byte (is it in DUE?) so maybe I was wrong.

I did a new sketch to try this. What I discovered is that when I send 1 or 2 bytes, it takes only 2nanoseconds per byte. But when I send more than 2 bytes, it takes almost 20nanoseconds per byte.

My question: Is this normal behaviour? Is there a way to buffer TX data and let the program keep runing while data is sent?

My sketch:

void setup() {
  // put your setup code here, to run once:
  SerialUSB.begin(115200);
  Serial3.begin(500000);
}

void loop() {
  // put your main code here, to run repeatedly:
  const int datos_enviar = 15;
  byte tramae[datos_enviar * 4 + 1 + 2];

  float vector_enviar[datos_enviar] = {1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.10, 11.11, 12.12, 13.13, 14.14, 15.15};

  tramae[0] = '#';
  tramae[1] = 'a';

  memcpy(&tramae[2], vector_enviar, datos_enviar * 4);

  calccrc(tramae, datos_enviar);

  int a = micros();
  //  Serial3.write(tramae, datos_enviar * 4 + 1 + 2); //This blocks for 1.2ms
  for (int i = 0; i < 2; i++) {
    Serial3.write(tramae[i]);// When sending 1 or 2 bytes it takes 2 nanoseconds per byte. Extra bytes take 20nanoseconds.
  }
  int b = micros() - a;
  SerialUSB.println(b);
  delay(200);
}

void calccrc(byte* tramavar, int datos ) {
  byte crc = 0;
  for (int i = 0 + 2; i < (datos * 4 + 2); i++) crc = crc + tramavar[i];
  tramavar[datos * 4 + 2] = crc;
}

Just a small correction: Everytime I said nanos or nanoseconds I meant micros or microseconds.

Anyone has any hint about my question? Thank you

After some days investigating and reading datasheets I got to the following conclusions:

  1. A non blocking Serial TX function could be implemented, but Arduino libraries do not support it rigth now.

  2. So, using actual Arduino libraries, it is normal that the processor blocks when you try to send more than 2 bytes, until data has been sent.

  3. This non blocking function could be implemented using a txbuffer and interruptions everytime the 1-byte UART buffer is empty, to put the next byte of the TX buffer.

My questions now:
A) Could somebody tell me if these conclusions are rigth or am I missing something?

B) How could I do an interruption that jumps everytime a register is empty or everytime a register's bit is set? Is this even possible?

Thank you very much.