How about DMA Serial?

I'm thinking about how to hook a UART up to the DMAC. Right now the Serial implementation in the core is blocking. That's right, blocking.

/* -------------------------------------------------------------------------- */
size_t UART::write(uint8_t c) {
/* -------------------------------------------------------------------------- */  
  if(init_ok) {
    tx_done = false;
    R_SCI_UART_Write(&uart_ctrl, &c, 1);
    while (!tx_done) {}     ///   <<<<<<-----  EMPTY WHILE LOOP!
    return 1;
  }
  else {
    return 0;
  }
}
/* -------------------------------------------------------------------------- */
size_t UART::write_raw(uint8_t* c, size_t len) {
/* -------------------------------------------------------------------------- */
  size_t i = 0;
  while (i < len) {
    uart_ctrl.p_reg->TDR = *(c+i);
    while (uart_ctrl.p_reg->SSR_b.TEND == 0) {}  ///<<<<<  BLOCKS FOR THE WHOLE TRANSMISSION
    i++;
  }
  return len;
}

There's an interrupt running:

/* -------------------------------------------------------------------------- */
void UART::WrapperCallback(uart_callback_args_t *p_args) {
/* -------------------------------------------------------------------------- */  

  uint32_t channel = p_args->channel;
  
  UART *uart_ptr = UART::g_uarts[channel];

  if(uart_ptr == nullptr) {
    return;
  }
  

  switch (p_args->event){
      case UART_EVENT_ERR_PARITY:
      case UART_EVENT_ERR_FRAMING:
      case UART_EVENT_ERR_OVERFLOW:
      case UART_EVENT_RX_COMPLETE: // This is called when all the "expected" data are received
      {
          break;
      }
      case UART_EVENT_TX_COMPLETE:
      case UART_EVENT_TX_DATA_EMPTY:
      {
        //uint8_t to_enqueue = uart_ptr->txBuffer.available() < uart_ptr->uart_ctrl.fifo_depth ? uart_ptr->txBuffer.available() : uart_ptr->uart_ctrl.fifo_depth;
        //while (to_enqueue) {
        uart_ptr->tx_done = true;
        break;
      }
      case UART_EVENT_RX_CHAR:
      {
        if (uart_ptr->rxBuffer.availableForStore()) {
          uart_ptr->rxBuffer.store_char(p_args->data);
        }
        break;
      }
      case UART_EVENT_BREAK_DETECT:
      {
          break;
      }
  }

}

but all it does is set the tx_done value to true. And write waits on that.

Really???

I see comments there where someone thought about using a FIFO. Serial already has a ring buffer that could just as easily be used. That's how it's done in the AVR world.

But we've also got a DMA controller. And that doesn't even require interrupts. The block mode is perfect for a ring buffer. Whenever you write to the buffer, you also set the counter register in the DMAC to send out that many bytes. When it gets to the end of the ring buffer in block mode it will automatically go back to the start.

It can be triggered off of the uart finishing the transmission. I'll have to experiment with how to start the initial transfer. But that's just code.

I know how to use the DMAC directly. I submitted a library for it. But the FSP... Should I use it? Should I even bother to figure it out? I think I want to just because it's something I don't fully understand how to use.

It will take two DMA channels out of the 4 we have available. That's a pretty high resource cost, but there's also the DTC. Since it handles more complicated chain transfers and such, I think I'll use the DMA channels.

I think I'll make copies of Serial.h and Serial.cpp and make a branch in my development core where Serial1 is turned off. Then I'll try to pick up those pins with new code. It shouldn't be that hard, but those are famous last words I know.

Any thoughts on getting started?

2 Likes

Oh it might be easier than I though if I stick with the FSP code.

The uart_cfg_t struct already takes an instance of transfer_instance_t which I can get directly from the FSP code to set up the DMAC.

typedef struct st_uart_cfg
{
    /* UART generic configuration */
    uint8_t          channel;          ///< Select a channel corresponding to the channel number of the hardware.
    uart_data_bits_t data_bits;        ///< Data bit length (8 or 7 or 9)
    uart_parity_t    parity;           ///< Parity type (none or odd or even)
    uart_stop_bits_t stop_bits;        ///< Stop bit length (1 or 2)
    uint8_t          rxi_ipl;          ///< Receive interrupt priority
    IRQn_Type        rxi_irq;          ///< Receive interrupt IRQ number
    uint8_t          txi_ipl;          ///< Transmit interrupt priority
    IRQn_Type        txi_irq;          ///< Transmit interrupt IRQ number
    uint8_t          tei_ipl;          ///< Transmit end interrupt priority
    IRQn_Type        tei_irq;          ///< Transmit end interrupt IRQ number
    uint8_t          eri_ipl;          ///< Error interrupt priority
    IRQn_Type        eri_irq;          ///< Error interrupt IRQ number

    /** Optional transfer instance used to receive multiple bytes without interrupts.  Set to NULL if unused.
     * If NULL, the number of bytes allowed in the read API is limited to one byte at a time. */
    transfer_instance_t const * p_transfer_rx;

    /** Optional transfer instance used to send multiple bytes without interrupts.  Set to NULL if unused.
     * If NULL, the number of bytes allowed in the write APIs is limited to one byte at a time. */
    transfer_instance_t const * p_transfer_tx;

    /* Configuration for UART Event processing */
    void (* p_callback)(uart_callback_args_t * p_args); ///< Pointer to callback function
    void const * p_context;                             ///< User defined context passed into callback function

    /* Pointer to UART peripheral specific configuration */
    void const * p_extend;                              ///< UART hardware dependent configuration
} uart_cfg_t;

Could this be plug and play? Or is it just using the same code to set up a buffer?

Oh I wish I could see the implementation...

I don't think that DMA really is helpful. First all chars to transmit have to be copied into the ring buffer, so that these chars can no more be modified in user code. That's blocking code again if the buffer gets full. DMA has to be enabled with every new char in the buffer, causing further overhead.

A ring buffer (Stream class) with Rx and Tx seems sufficient to me.

When you care about blocking... Serial.availableForWrite() is your friend.

Is it safe to modify when a transfer is already running?

As I have mentioned before I have implementation that uses their ring buffer, although does not use DMA, it does use the TxFIFO to reduce the number of interrupts... There is a very stale PR on it that has not been touched.

Not sure if it still works. For the most part not playing with these boards, these days.

It is if it is implemented. Current release, not implemented as part of the Serial code, so defaults back to Stream (or was it Print) class that returns 0.

If only that worked on the R4.

I don't think so. I think you have to stop the DMAC and restart it.

Even if you use the FIFO, the Serial is waiting for tx_done which is set in the interrupt which now won't be called until 15 bit times after the FIFO is emptied. So if you send 16 bytes, they all get loaded into the FIFO and you could go on with your life but instead you have to sit and wait blocking until all 16 bytes go out and the interrupt fires to set tx_done to true.

That already happens in the Serial class. When you call Serial.print your string gets copied into a ring buffer. I can point the DMAC directly at that buffer. If you were intending on modifying those characters then you should have done that before you called print.

Yeah, but the current Serial is blocking even when it's not full.

No, just once per write. There is a repeat mode and I can use the extended repeat area to make it loop over the ring buffer.

1 Like

Yes and NO.

The code is/was setup to use the buffer already built into the software code. That is the Serial object has:
arduino::SafeRingBufferN<SERIAL_BUFFER_SIZE> rxBuffer;
arduino::SafeRingBufferN<SERIAL_BUFFER_SIZE> txBuffer;

Whereby default the size is 512 bytes...

Where the FIFO comes into usage is for reducing how many interrupts have to be serviced for what you output. Without fifo (typically then double buffered), your interrupt handler is called for each character. Should mention only some of the UARTS have FIFO... If my memory is correct the Serial object (on Wifi) does not use one with FIFO so only double buffered.

The only time you need to or should wait is when your code calls something like Serial1.flush(). Not sure if that part was fully working or not, as at least before, the underlying code would return once the stuff was placed into the double buffer.

Note: I also ran into bugs in their implementation (unless it has been changed) that at times your output loses data. Why?
When the code output multiple bytes it returned as soon as it placed that last byte into the output register. If you immediately tried to start another output, it would place the first byte into the output register without checking to see if it was empty or not and if not empty, that last byte was lost.

Or if you output more stuff than can fit into the ring buffer. Note: I did have availableForWrite implemented...

Sorry, I have now sort of given up on these boards. Maybe at some later date, I might pull them out of storage...

That's not my reading of the current code.

If you call Serial.print("Something"); I don't see where it loads anything into the SafeRingBuffer.

/* -------------------------------------------------------------------------- */
size_t UART::write(uint8_t c) {
/* -------------------------------------------------------------------------- */  
  if(init_ok) {
    tx_done = false;
    R_SCI_UART_Write(&uart_ctrl, &c, 1);
    while (!tx_done) {}
    return 1;
  }
  else {
    return 0;
  }
}

It calls R_SCI_UART_Write directly. Unfortunately we can't see the implementation of that. But I don't see where it uses the ring buffer at all.

I see where the rxBUffer is used:

case UART_EVENT_RX_CHAR:
      {
        if (uart_ptr->rxBuffer.availableForStore()) {
          uart_ptr->rxBuffer.store_char(p_args->data);
        }
        break;
      }

But I can't find anywhere here that txBuffer is referenced except for a code fragment in a comment.

case UART_EVENT_TX_DATA_EMPTY:
      {
        //uint8_t to_enqueue = uart_ptr->txBuffer.available() < uart_ptr->uart_ctrl.fifo_depth ? uart_ptr->txBuffer.available() : uart_ptr->uart_ctrl.fifo_depth;
        //while (to_enqueue) {
        uart_ptr->tx_done = true;
        break;
      }

But in any case, you have to wait for tx_done to go true to get out of the empty while loop in write and that doesn't happen until the interrupt is called. And that interrupt is only called when TDR is empty or in our case 15 bit times after the FIFO is empty.

Unless they changed it recently, that interrupt is called in two cases:
the case you mentioned and the case of double buffer, when the write register is not full...

At least before it was:

      case UART_EVENT_TX_COMPLETE:
      case UART_EVENT_TX_DATA_EMPTY:
      {
        //uint8_t to_enqueue = uart_ptr->txBuffer.available() < uart_ptr->uart_ctrl.fifo_depth ? uart_ptr->txBuffer.available() : uart_ptr->uart_ctrl.fifo_depth;
        //while (to_enqueue) {
        uart_ptr->tx_done = true;
        break;
      }

Exactly they are not using it... They waste all of the memory for it, and don't use it...

I suppose I should sync up my dev branch and see if they have changed anything... But for now
I punted...

Good luck

Side note: That is one of the reasons I converted the main WIFI board I was using, to use native USB for the main Serial object. The UART that communicates between the ESP32 and it is only double buffer, and if they have not changed things, writing one byte to Serial while in a normal ISR hung the board... I tried to at least get them to up the priority of the Serial interrupt (lower the value) and in most cases it would not hang the processor (one line change...)

My problem is that when I was trying to serve a webpage the calls to client.print which goes out on Serial2 were blocking for hundreds of milliseconds. I was able to trim it some by combining print statements so there would be less AT commands, but it still hangs there for a long time.

I should be satisfied to just fix it like the AVR version where the interrupt moves things from the ring buffer to the transmit buffer. But DMA would eliminate that interrupt entirely and I want to play with it.

Crap. The FSP doesn't support extended repeat area. There is no code anywhere to set the SARA or DARA bits. And since I can't see the implementation of any of the functions, I don't know if I can work around that or not.

Everytime I touch this HAL I think it will be the last time. I'm a hardware hacker. HALs just get in my way.

This one?

I wonder what's holding it back.

If you do anything serious with this platform and have problems, and wish to understand what the underlying stuff is doing, I highly suggest that you install a second install of the platform with the developers setup... Here is a link to a thread where @ptillisch give very good instructions on how to do this.

Again it has been awhile since I updated it. But a nice thing is that you can then browse through more of the sources to find out what some of the underlying code is doing.

That is probably the right one. My guess is that it is not a high enough priority, to assign someone to fully go through this. There was a reasonable number of changes that were needed, so would take a bit of work to double check everything.

At one point there were suggestions, that my leaving a debug code in it, should not be there, so I removed that. Also earlier I replaced their Safe Ring buffer with my own slightly modified version. I did not like that their implementation requires all interrupts to be disabled on each add and remove from the queue. Why? Because they share a common count variable, instead of simply having only one function touch the head pointer and another touch the tail pointer and compute count only when needed... So I tore that back out.

1 Like

So here's some model code for DMAC on the Tx buffer. I'm not actually writing to the TDR register here, just transferring to a memory slot in an array to simulate that. I'm using a whole array because I've learned that it's really easy to get off the memory location you think you're writing and this way I'll be able to see it.

The source array represents the txBuffer. I've got the extended repeat set up so that the source address will increment until it hits the end of that buffer and then roll back over.

I hooked the DMAC to the SCI2_TXI even signal, as that's what will be used to trigger moving data to TDR in the real code. But I'm using Serial1.print() to actually trigger writes to produce the signal at the moment.

The code does a couple of transfers in setup and then will read any String from serial and perform the same number of transfers as there are characters in the String. This simulates moving data one at a time to the TDR register.

The things I wanted to test were whether the extended repeat area would work as I expected in normal mode and it does.

I also wanted to test to see if I could add more transfers while a transfer was happening. And I can.

#define PRINT_REG(BUCKET, REGISTER) \
  do { \
    uint32_t t = BUCKET->REGISTER; \
    Serial.print(#REGISTER " : 0x"); \
    Serial.println(t, HEX); \
  } while (false)

#include "r_dmac.h"
#include "IRQManager.h"

alignas(1024) uint8_t source[64];
uint8_t dest[64];

int eventLinkIndex;

dmac_instance_ctrl_t control;

transfer_info_t info;
dmac_extended_cfg_t extCfg;

transfer_cfg_t config;

void printDest(int i) {

  PRINT_REG(R_DMAC0, DMCRA);
  Serial.print("dest");
  Serial.print(i);
  Serial.print("  [");
  for (int i = 0; i < 64; i++) {
    Serial.print(dest[i]);
    Serial.print(" ");
  }
  Serial.print("]");
  Serial.println();
}
void printSource() {
  Serial.print("source  [");
  for (int i = 0; i < 64; i++) {
    Serial.print(source[i]);
    Serial.print(" ");
  }
  Serial.print("]");
  Serial.println();
}

void setup() {
  Serial.begin(115200);
  Serial1.begin(115200);
  while (!Serial && (millis() < 1000))
    ;
  Serial.println("\n\n** R4_FSPDMA_Test.ino ** \n\n");
  // some initial data in source.
  for (int i = 0; i < 64; i++) {
    source[i] = i + 1;
  }


  initDMAC();
  PRINT_REG(R_DMAC0, DMCRA);
  PRINT_REG(R_DMAC0, DMCRB);

  printSource();

  printDest(1);
  delay(500);
  printDest(2);
  Serial1.print("12");  // does it work at all
  delay(500);
  printDest(3);
  Serial1.print("34567");  // what happens if I fire more times than expected
  delay(500);
  printDest(4);
  delay(500);
}

void loop() {
  static int printCount = 1;

  // read in a String from Serial and set up for that number of transfers
  if (Serial.available()) {
    delay(50);
    String readStr = Serial.readString();
    int count = readStr.length();
    Serial.print("Printing ");
    Serial.println(count);
    printDest(printCount++);
    printSome(count);
    Serial1.print(readStr);
    delayMicroseconds(100);
    printSome(5);               // Can I add more in middle of a transfer?
    
    delay(500);
    printDest(printCount++);
    Serial1.print("12345");    
    printDest(printCount++);
  }
}

void printSome(int howMany) {
  R_DMAC0->DMCNT = 0;                         // stop DMA
  int remaining = R_DMAC0->DMCRA & 0xFFFF;    // get the number of transfers still waiting
  int total = remaining + howMany;            
  R_DMAC0->DMCRA = total;                     // put the new total in for number of transfers
  R_ICU->DELSR[0] = ELC_EVENT_SCI2_TXI;       // Set the trigger source in case it isn't already
  R_DMAC0->DMCNT = 1;                         // restart DMA
}

void initDMAC() {
  info.transfer_settings_word_b.dest_addr_mode = TRANSFER_ADDR_MODE_FIXED;
  info.transfer_settings_word_b.repeat_area = TRANSFER_REPEAT_AREA_SOURCE;
  info.transfer_settings_word_b.irq = TRANSFER_IRQ_END;
  info.transfer_settings_word_b.src_addr_mode = TRANSFER_ADDR_MODE_INCREMENTED;
  info.transfer_settings_word_b.size = TRANSFER_SIZE_1_BYTE;
  info.transfer_settings_word_b.mode = TRANSFER_MODE_NORMAL;

  info.p_src = source;
  info.p_dest = dest + 5;

  info.length = 5;   // For testing.  This should be zero in production

  extCfg.channel = 0;
  extCfg.irq = FSP_INVALID_VECTOR;
  extCfg.ipl = 12;
  extCfg.offset = 0;
  extCfg.src_buffer_size = 64;
  extCfg.activation_source = ELC_EVENT_SCI2_TXI;
  extCfg.p_callback = other;

  config.p_info = &info;
  config.p_extend = &extCfg;


  if ((IRQManager::getInstance().addDMA(extCfg, dmaCallback))) {
    eventLinkIndex = extCfg.irq;
  } else {
    Serial.println("Failed IRQ set.");
  }
  R_DMAC_Open(&control, &config);
  R_DMAC0->DMAMD |= (6ul << 8);  // Set SARA for extended repeat on source of 64 bytes. 
  R_DMAC0->DMCRA = 5;            // Set initial number of transfers for testing.  Zero in production. 
  R_DMAC_Enable(&control);       // Enable the DMAC
}

void dmaCallback() {
  R_ICU->IELSR[eventLinkIndex] &= ~(R_ICU_IELSR_IR_Msk);   // Clear interrupt in interrupt controller
  R_ICU->DELSR[0] = 0;                                     // Turn off trigger signal (will hang program if not)
}
void other(dmac_callback_args_t* args) {
  (void)args;
}

I don't think the Rx side would benefit from DMA. On the receive side you still want to be notified whenever a new byte arrives, so you still want a receive interrupt to fire and move things into the receive buffer the normal way. On the Tx side it eliminates all but the last interrupt. But on the receive side we want an interrupt every time.

I think the only place DMA makes sense on the receive side would be if you were receiving regularly sized packets using flow control and were only concerned with the case where the entire packet has arrived.

As a side note, in that case where you're receiving regularly sized packets of data, the datasheet has a neat example of using repeat areas with offsets greater than 1 unit so that you can split the incoming data into a series of arrays. The first transmission gets broken up into the first slot of each array, the second transmission goes into the second slot of each array. You just have to make sure that the arrays are placed together in memory contiguously.

I'll keep that in mind for some future projects. Receiving regular packets of data is a pretty common situation.

I have that and I guess I forgot about the Extras folder. I don't think we can recompile it easily, but at least we can see it.

r_sci_uart_c is already set up to connect to DTC but I'm not clear yet on how it works. I think it's more setup for block transfers.

Can somebody test the following code snippet to clarify the (non)blocking handling of Tx:

Serial.println(Serial.availableForWrite());
Serial.println("Some not overly long message");
Serial.println(Serial.availableForWrite());

Provided that availableForWrite() is implmenented correctly, this should show whether Tx uses a FIFO buffer.

availableForWrite just returns 0 on the R4 because it doesn't use the buffer.

There's no reason for a test like that when we can read the source code. There's no need to empirically determine that the buffer isn't used when we can clearly look at the source code and see that it isn't.

uint32_t start;
uint32_t finish;

void setup() {
  Serial.begin(9600);
  Serial1.begin(9600);

  while(!Serial);

  Serial.println("\n\n** Speed Test **\n\n");

  start = micros();
  Serial1.print("Twas brillig and the slithy toves did gyre and gimbel in the wabe. ");
  finish = micros();

  Serial.print("Total Time : ");
  Serial.println(finish - start);
}

void loop() {}

On the R4 takes almost 69ms.

uint32_t start;
uint32_t finish;

void setup() {
  Serial.begin(9600);
  // Serial1.begin(9600);

  // while(!Serial);

  Serial.println("\n\n** Speed Test **\n\n");

  start = micros();
  Serial.print("Twas brillig and the slithy toves did gyre and gimbel in the wabe. ");
  finish = micros();

  Serial.print("Total Time : ");
  Serial.println(finish - start);
}

void loop() {}

On the R3 takes 25ms. Less than half the time. Because the R3 only needs to copy the data into the buffer and then it goes out automatically driven by interrupts. But on the R4 you're stopping and blocking for the whole transmission.