How to make Reliable Serial communication

Hey there,

It seems my Arduino sometimes stops receiving Serial data after a while (hours). My guess is a buffer overflow, or a leak in my code (writing outside the array index). So, got to improve my code, and eventually build a "reset" if we didn't receive anything for X seconds.

However, in order to create a buffer overflow or my code to crash, the Arduino must receive garbage of some sort I think. Having code to catch garbage is a must, but how to prevent garbage in the first place?

  • Can a Serial connection (2 wires) be made garbage-free at all?
  • What are typical factors to generate problems?
  • Any software/hardware/wiring/general hints to prevent problems?

In other words, I want reliable communication, but I'm not sure if I'm using the right tools...

Some more info

  • The hanging connection is Serial3 on a Arduino Mega, 9600 baudrate
  • Serial(1) is used for a XBee
  • Sender device sends messages every second. Equal length, but some content may differ.
  • The serial wire is just a normal 9-pins connector & cable on one side. 2 wires from this cable are connected to a screw terminal on the Mega. No "heavy duty" cables are running nearby, although the Arduino is in a metal box, right next to a large packaging device. The Arduino itself is on a piece of plastic, on a DIN rail.

Cheers,
Rick

RickN:
I want reliable communication

You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors.

PeterH:

RickN:
I want reliable communication

You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors.

oo, theres a thing.
Personally I like to have a well planned serial system that wont screw up after time, then apply error correction as belt and braces.

To put things into perspective, I've got a mega linked with a uno using the uarts over an RS485 link, before applying any sort of error correction I had the mega requesting data from the uno every 500ms with no error in the logged data over 3 days, the link was running at 57600bps down 20 meters of cat5e.

Now, in the case of the OP, if there is any sort of distance involved using the uarts at 5v levels isn't ideal, and could very well be where the errors are coming from, so a more robust link would be helpful, such as the above mentioned RS485. Nick Gammon has a rather good intro here - Gammon Forum : Electronics : Microprocessors : RS485 communications

use start characters and stop characters and occasionally a checksum (e.g. CRC ) to detect begin and end of messages and the correctness of the communication.

if you really want to dig into it - Serial Programming/Complete Wikibook - Wikibooks, open books for an open world - is a extended reader

achieve it by detecting and recovering from errors.
I added a time-out system that closed and re-open the connection when receiving nothing, or "junk". Detecting what is junk or not is a bit more complicated, but thats something I could work on.

I'm just wondering if the combination Arduino & a shortrange 2-wired RS232 cable is reliable in general. Of course, everything can be disrupted. And we're not talking about a critical application here. Nevertheless, it would suck if the controller has to recover 1 or more times a day, as it would miss a few things.

Some are suggesting to use a PLC for this program instead, but I don't think that would change eventual serial problems. Unless I made dumb bugs myself in the Arduino code of course.

Distance
The distance is minimal, the cable is 15 cm or something.

Checksum / start / stop characters
There should be indeed. Although the formatting options are limited, as the sender (a camera system) isn't fully programmable. Inserting start/stop characters is possible, although that's still doesn't make things waterproof of course. But at least its an easy start.

Thanks,
Rick

My guess is a buffer overflow, or a leak in my code (writing outside the array index).

I agree, could be a code issue, or somebody starting to vacuum the carpet near by, or... lot of undisclosed possible issues.

RickN:
However, in order to create a buffer overflow or my code to crash, the Arduino must receive garbage of some sort I think. Having code to catch garbage is a must, but how to prevent garbage in the first place?

First, it would be nice to see this code. Second, serial cannot be guaranteed to be perfect, as the others said. You code to allow for garbage. It's hard to believe that for data sent every second, it can't quickly recover. Say, for example, each "packet" starts with an asterisk. When you receive an asterisk you "reset" to "start of packet" regardless of whether the previous one completed.

Are you using the String class?

Code down below. The "Serial" part at least. Showing the entire code would be quite big. I have to notice the Arduino only receives data, it doesn't write. Normally a message appears each second. But... since the message is triggered on the other system by sensors, it may happen it could be triggered twice. A rare occasion, but not impossible.

I haven't made the suggested start/stop character changes yet, it's just "as it is" right now. It wouldn't surprise me if there is something stupid inside hehe. One thing it does wrong already, is trying to convert the received data directly to work variables. This could cause an error when for example "atoi( )" is called. But I don't think that would disrupt the communication though...

Resetting or skipping junk should be possible. In fact I changed the code earlier today so I could pull out the cable, re-insert, and just continue. Nevertheless, being disrupted means the machine misses a few camera-checks. As said, its not critical (we're not producing Rocket components). If it would happen a few times a week, well, I can live with that. If it happens once per few hours or so, it becomes a different story. I know you guys can't just tell me the chance on faults, as it really depends on the environment, wiring, the other device, and so on. But maybe some general hints? Also, I got to convince the Arduino isn't worse than a PLC for this particular case... unless the hardware (or Arduino Serial library) actually could be a problem source...

Not using a string library btw, just char buffers.

const int SERIAL_TIMEOUT = 15; // = 3 seconds time out
      int offline        = SERIAL_TIMEOUT;
      void  pollNVisionCommunication()
      {
          /**** 1. TIME OUT CHECK ****/
          // This function is called each 200 msec. Normally we receive something each 500 to 1400 msec.
          // If nothing was received for 3 seconds, either the machine conveyor belt has been
          // paused, or the communication is disrupted
          if (progState != PROGSTATE_PAUSED)
          {
              if (--offline < 0)
              {
                  offline   = 0;
                  progState = PROGSTATE_IDLE;  // Lost connection, or still waiting
                  // If the belt is still turning, it means we're not receiving (valid) messages
                  // Ifso, reset the connection
                  if ( !machineStopped ) 
                  {
                      Serial.println( "ERROR, Serial3 Reset" );  // Send diagnostic message via XBee
                      Serial3.end();
                      delay(25);
                      Serial3.begin( 9600 );
                      Serial3.flush();
                      offline = SERIAL_TIMEOUT; // 3 seconds              
                  } // Reset Serial 3                  
              } // 
          } // belt not paused
          
          
          /**** 2. TIME OUT CHECK ****/
          // After each trigger (belt must move), the camera sends this string:
          //      0      1          2              3        4       5           6          7          8          9
          //    jobname,photoDelay,blisterDiscard,rowCount,charge,row1Result,row2Result,row3Result,row4Result,row5Result
          // for example
          //    "ProductX,250,23,2,4731ab-cx,1,1,1,0,0"
          //
          //  * The last 3 bits (row3/4/5) are optional and may not be included          
          char  txBuffer[MAXJOBNAMELEN+1];
          int   comma  = 0;
          int   indx   = 0;
          bool  jobOrChargeChanged = false;
          bool  validMessage  = false;

          for (int i=0; i<=MAXJOBNAMELEN; i++) txBuffer[i] = 0; // Clear tx buffer
          
          int avail = Serial3.available();
          while ( avail > 0 && indx < MAXJOBNAMELEN && 
                  comma < 10   // If the message has more than 10 comma's, its garbage for sure
                 ) 
          {
              char c = Serial3.read();

              if ( c == 13 || c == 10 || c == 0 ) break; /* terminator, exit receive loop */  else
              if ( c == ',') 
              {   // In case of numeric data, convert here
                  // Otherwise, just reset the "index", and increment "comma" to fill the next field
                  if (comma==1) 
                  {
                      cPhotoDelay = max( 0, min( 2000, atoi( txBuffer) )); // <-- ! goes wrong possibly
                      for (int i=0; i<=MAXJOBNAMELEN; i++) txBuffer[i] = 0; // clean
                  } else
                  if (comma==2) 
                  {   
                      cDiscardBlst = max( 1, min(  atoi( txBuffer ), MAX_ROWBLISTERS-2 )); <-- ! goes wrong possibly
                  }
                  ++comma;
                  indx = 0;
              } else
                  switch (comma)
                  {
                    case 0  : { // Read chars for "JobName"
                                  if (cJobName[indx] != c) jobOrChargeChanged = true;
                                  cJobName[indx++] = c; 
                                  break;
                              }    
                    case 1  : { txBuffer[indx++] = c; break; } // Read chars for PhotoDelay
                    case 2  : { txBuffer[indx++] = c; break; } // Read chars for Blister discard count
                    case 3  : { cRowCount = max( 1, min( 5, byte(c) - 48)); break; // single char rowcount } 
                    case 4  : { 
                                  if (cCharge[indx] != c) jobOrChargeChanged = true;
                                  cCharge[indx++] = c;
                                  offline         = SERIAL_TIMEOUT;      // Received something with at least 4 commas, thus a valid message probably
                                  validMessage    = true;
                                  break;
                              }
                    case 5  : { cRow1Fault = (c != 48); break; }
                    case 6  : { cRow2Fault = (c != 48); break; }
                    case 7  : { cRow3Fault = (c != 48); break; }
                    case 8  : { cRow4Fault = (c != 48); break; }
                    case 9  : { cRow5Fault = (c != 48); break; }
                  }
              // Next char
              avail = Serial3.available();
              if ( avail==0 ) // wait a short moment, maybe the host is still sending
              {
                  delay(150);
                  avail = Serial3.available();
              }
          }
          
          /**** 3. FINALIZE ****/
          // Change job in case the received jobName was different than the previous one
          Serial3.flush();          
          if (jobOrChargeChanged && validMessage)
          {
              resetJob( );
          }
          
      } // pollNVisionCommunication

Don't like this sort of stuff:

              avail = Serial3.available();
              if ( avail==0 ) // wait a short moment, maybe the host is still sending
              {
                  delay(150);
                  avail = Serial3.available();
              }

Delay "just in case".

Read up on reading without delays. And state machines:

I really think you should split functions appropriately. At present you are doing everything in on huge loop with all sorts of embedded tests, it's too hard to follow and debug and therefore prone to unreliability. It's clearer to read the packet, test the packet, then parse the packet. They are three separate jobs.

If you add the start/stop/check bytes as has been suggested this will be easy to implement.

do (until_the_cows_come_home) {
wait for SOP
read until EOP
check data
parse data
action data
}


Rob

                      cPhotoDelay = max( 0, min( 2000, atoi( txBuffer) )); // <-- ! goes wrong possibly

Where does txBuffer ever get NULL terminated? The atoi() function expects a NULL terminated array of characters, NOT just an array of characters.

The txBuffer gets "nulled" just before the read loop starts, and in the "comma ==1" part, so that shouldn't be a problem. A problem that could occur is that the txBuffer doesn't contain numeric data. I guess it will make the atoi function throw an exception, but that is not handled (does Arduino support try - catch?).

True that this function has become too large to read properly. Splitting it up is put on the list. Nevertheless, I don't see big bugs that could disrupt a serial connection.

Well, thanks for the tips so far!

I guess it will make the atoi function throw an exception, but that is not handled (does Arduino support try - catch?).

No, and the atoi() function doesn't throw exceptions, either. It is a C function; only C++ supports try/catch.

The function will simply scan the string until it encounters a non-digt, and return the number value that preceded the non-digit. So, "123Joe" is valid input, and 123 will be returned. "Joe" is valid input, too, and 0 will be returned.