Go Down

Topic: How to make Reliable Serial communication (Read 2424 times) previous topic - next topic

RickN

Hey there,

It seems my Arduino sometimes stops receiving Serial data after a while (hours). My guess is a buffer overflow, or a leak in my code (writing outside the array index). So, got to improve my code, and eventually build a "reset" if we didn't receive anything for X seconds.

However, in order to create a buffer overflow or my code to crash, the Arduino must receive garbage of some sort I think. Having code to catch garbage is a must, but how to prevent garbage in the first place?
- Can a Serial connection (2 wires) be made garbage-free at all?
- What are typical factors to generate problems?
- Any software/hardware/wiring/general hints to prevent problems?

In other words, I want reliable communication, but I'm not sure if I'm using the right tools...

Some more info
* The hanging connection is Serial3 on a Arduino Mega, 9600 baudrate
* Serial(1) is used for a XBee
* Sender device sends messages every second. Equal length, but some content may differ.
* The serial wire is just a normal 9-pins connector & cable on one side. 2 wires from this cable are connected to a screw terminal on the Mega. No "heavy duty" cables are running nearby, although the Arduino is in a metal box, right next to a large packaging device. The Arduino itself is on a piece of plastic, on a DIN rail.

Cheers,
Rick
Game hobby project: Tower22

PeterH


I want reliable communication


You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors.
I only provide help via the forum - please do not contact me for private consultancy.

beige



I want reliable communication


You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors.

oo, theres a thing.
Personally I like to have a well planned serial system that wont screw up after time, then apply error correction as belt and braces.


To put things into perspective, I've got a mega linked with a uno using the uarts over an RS485 link, before applying any sort of error correction I had the mega requesting data from the uno every 500ms with no error in the logged data over 3 days, the link was running at 57600bps down 20 meters of cat5e.

Now, in the case of the OP, if there is any sort of distance involved using the uarts at 5v levels isn't ideal, and could very well be where the errors are coming from, so a more robust link would be helpful, such as the above mentioned RS485. Nick Gammon has a rather good intro here - http://www.gammon.com.au/forum/?id=11428

robtillaart


use start characters and stop characters and occasionally a checksum (e.g. CRC ) to detect begin and end of messages and the correctness of the communication.

if you really want to dig into it - http://en.wikibooks.org/wiki/Serial_Programming/Complete_Wikibook - is a extended reader
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

RickN

>> achieve it by detecting and recovering from errors.
I added a time-out system that closed and re-open the connection when receiving nothing, or "junk". Detecting what is junk or not is a bit more complicated, but thats something I could work on.

I'm just wondering if the combination Arduino & a shortrange 2-wired RS232 cable is reliable in general. Of course, everything can be disrupted. And we're not talking about a critical application here. Nevertheless, it would suck if the controller has to recover 1 or more times a day, as it would miss a few things.

Some are suggesting to use a PLC for this program instead, but I don't think that would change eventual serial problems. Unless I made dumb bugs myself in the Arduino code of course.


>> Distance
The distance is minimal, the cable is 15 cm or something.

>> Checksum / start / stop characters
There should be indeed. Although the formatting options are limited, as the sender (a camera system) isn't fully programmable. Inserting start/stop characters is possible, although that's still doesn't make things waterproof of course. But at least its an easy start.

Thanks,
Rick

Game hobby project: Tower22

zoomkat

Quote
My guess is a buffer overflow, or a leak in my code (writing outside the array index).


I agree, could be a code issue, or somebody starting to vacuum the carpet near by, or... lot of undisclosed possible issues.
Google forum search: Use Google Advanced Search and use Http://forum.arduino.cc/index in the "site or domain:" box.

Nick Gammon


However, in order to create a buffer overflow or my code to crash, the Arduino must receive garbage of some sort I think. Having code to catch garbage is a must, but how to prevent garbage in the first place?


First, it would be nice to see this code. Second, serial cannot be guaranteed to be perfect, as the others said. You code to allow for garbage. It's hard to believe that for data sent every second, it can't quickly recover. Say, for example, each "packet" starts with an asterisk. When you receive an asterisk you "reset" to "start of packet" regardless of whether the previous one completed.

Are you using the String class?
Please post technical questions on the forum, not by personal message. Thanks!

More info:
http://www.gammon.com.au/electronics

RickN

#7
Nov 09, 2012, 09:41 pm Last Edit: Nov 09, 2012, 09:48 pm by RickN Reason: 1
Code down below. The "Serial" part at least. Showing the entire code would be quite big. I have to notice the Arduino only receives data, it doesn't write. Normally a message appears each second. But... since the message is triggered on the other system by sensors, it may happen it could be triggered twice. A rare occasion, but not impossible.

I haven't made the suggested start/stop character changes yet, it's just "as it is" right now. It wouldn't surprise me if there is something stupid inside hehe. One thing it does wrong already, is trying to convert the received data directly to work variables. This could cause an error when for example "atoi( <invalid number string> )" is called. But I don't think that would disrupt the communication though...

Resetting or skipping junk should be possible. In fact I changed the code earlier today so I could pull out the cable, re-insert, and just continue. Nevertheless, being disrupted means the machine misses a few camera-checks. As said, its not critical (we're not producing Rocket components). If it would happen a few times a week, well, I can live with that. If it happens once per few hours or so, it becomes a different story. I know you guys can't just tell me the chance on faults, as it really depends on the environment, wiring, the other device, and so on. But maybe some general hints? Also, I got to convince the Arduino isn't worse than a PLC for this particular case... unless the hardware (or Arduino Serial library) actually could be a problem source...


Not using a string library btw, just char buffers.
Code: [Select]

const int SERIAL_TIMEOUT = 15; // = 3 seconds time out
     int offline        = SERIAL_TIMEOUT;
     void  pollNVisionCommunication()
     {
         /**** 1. TIME OUT CHECK ****/
         // This function is called each 200 msec. Normally we receive something each 500 to 1400 msec.
         // If nothing was received for 3 seconds, either the machine conveyor belt has been
         // paused, or the communication is disrupted
         if (progState != PROGSTATE_PAUSED)
         {
             if (--offline < 0)
             {
                 offline   = 0;
                 progState = PROGSTATE_IDLE;  // Lost connection, or still waiting
                 // If the belt is still turning, it means we're not receiving (valid) messages
                 // Ifso, reset the connection
                 if ( !machineStopped )
                 {
                     Serial.println( "ERROR, Serial3 Reset" );  // Send diagnostic message via XBee
                     Serial3.end();
                     delay(25);
                     Serial3.begin( 9600 );
                     Serial3.flush();
                     offline = SERIAL_TIMEOUT; // 3 seconds              
                 } // Reset Serial 3                  
             } //
         } // belt not paused
         
         
         /**** 2. TIME OUT CHECK ****/
         // After each trigger (belt must move), the camera sends this string:
         //      0      1          2              3        4       5           6          7          8          9
         //    jobname,photoDelay,blisterDiscard,rowCount,charge,row1Result,row2Result,row3Result,row4Result,row5Result
         // for example
         //    "ProductX,250,23,2,4731ab-cx,1,1,1,0,0"
         //
         //  * The last 3 bits (row3/4/5) are optional and may not be included          
         char  txBuffer[MAXJOBNAMELEN+1];
         int   comma  = 0;
         int   indx   = 0;
         bool  jobOrChargeChanged = false;
         bool  validMessage  = false;

         for (int i=0; i<=MAXJOBNAMELEN; i++) txBuffer[i] = 0; // Clear tx buffer
         
         int avail = Serial3.available();
         while ( avail > 0 && indx < MAXJOBNAMELEN &&
                 comma < 10   // If the message has more than 10 comma's, its garbage for sure
                )
         {
             char c = Serial3.read();

             if ( c == 13 || c == 10 || c == 0 ) break; /* terminator, exit receive loop */  else
             if ( c == ',')
             {   // In case of numeric data, convert here
                 // Otherwise, just reset the "index", and increment "comma" to fill the next field
                 if (comma==1)
                 {
                     cPhotoDelay = max( 0, min( 2000, atoi( txBuffer) )); // <-- ! goes wrong possibly
                     for (int i=0; i<=MAXJOBNAMELEN; i++) txBuffer[i] = 0; // clean
                 } else
                 if (comma==2)
                 {  
                     cDiscardBlst = max( 1, min(  atoi( txBuffer ), MAX_ROWBLISTERS-2 )); <-- ! goes wrong possibly
                 }
                 ++comma;
                 indx = 0;
             } else
                 switch (comma)
                 {
                   case 0  : { // Read chars for "JobName"
                                 if (cJobName[indx] != c) jobOrChargeChanged = true;
                                 cJobName[indx++] = c;
                                 break;
                             }    
                   case 1  : { txBuffer[indx++] = c; break; } // Read chars for PhotoDelay
                   case 2  : { txBuffer[indx++] = c; break; } // Read chars for Blister discard count
                   case 3  : { cRowCount = max( 1, min( 5, byte(c) - 48)); break; // single char rowcount }
                   case 4  : {
                                 if (cCharge[indx] != c) jobOrChargeChanged = true;
                                 cCharge[indx++] = c;
                                 offline         = SERIAL_TIMEOUT;      // Received something with at least 4 commas, thus a valid message probably
                                 validMessage    = true;
                                 break;
                             }
                   case 5  : { cRow1Fault = (c != 48); break; }
                   case 6  : { cRow2Fault = (c != 48); break; }
                   case 7  : { cRow3Fault = (c != 48); break; }
                   case 8  : { cRow4Fault = (c != 48); break; }
                   case 9  : { cRow5Fault = (c != 48); break; }
                 }
             // Next char
             avail = Serial3.available();
             if ( avail==0 ) // wait a short moment, maybe the host is still sending
             {
                 delay(150);
                 avail = Serial3.available();
             }
         }
         
         /**** 3. FINALIZE ****/
         // Change job in case the received jobName was different than the previous one
         Serial3.flush();          
         if (jobOrChargeChanged && validMessage)
         {
             resetJob( );
         }
         
     } // pollNVisionCommunication
Game hobby project: Tower22

Nick Gammon

Don't like this sort of stuff:

Code: [Select]

              avail = Serial3.available();
              if ( avail==0 ) // wait a short moment, maybe the host is still sending
              {
                  delay(150);
                  avail = Serial3.available();
              }


Delay "just in case".

Read up on reading without delays. And state machines:

http://www.gammon.com.au/serial
Please post technical questions on the forum, not by personal message. Thanks!

More info:
http://www.gammon.com.au/electronics

Graynomad

#9
Nov 10, 2012, 02:48 am Last Edit: Nov 10, 2012, 03:06 am by Graynomad Reason: 1
I really think you should split functions appropriately. At present you are doing everything in on huge loop with all sorts of embedded tests, it's too hard to follow and debug and therefore prone to unreliability. It's clearer to read the packet, test the packet, then parse the packet. They are three separate jobs.

If you add the start/stop/check bytes as has been suggested this will be easy to implement.

do (until_the_cows_come_home) {
    wait for SOP
    read until EOP
    check data
    parse data
    action data
}

______
Rob
Rob Gray aka the GRAYnomad www.robgray.com

PaulS

Code: [Select]
                      cPhotoDelay = max( 0, min( 2000, atoi( txBuffer) )); // <-- ! goes wrong possibly
Where does txBuffer ever get NULL terminated? The atoi() function expects a NULL terminated array of characters, NOT just an array of characters.

RickN

The txBuffer gets "nulled" just before the read loop starts, and in the "comma ==1" part, so that shouldn't be a problem. A problem that could occur is that the txBuffer doesn't contain numeric data. I guess it will make the atoi function throw an exception, but that is not handled (does Arduino support try - catch?).

True that this function has become too large to read properly. Splitting it up is put on the list. Nevertheless, I don't see big bugs that could disrupt a serial connection.

Well, thanks for the tips so far!
Game hobby project: Tower22

PaulS

Quote
I guess it will make the atoi function throw an exception, but that is not handled (does Arduino support try - catch?).

No, and the atoi() function doesn't throw exceptions, either. It is a C function; only C++ supports try/catch.

The function will simply scan the string until it encounters a non-digt, and return the number value that preceded the non-digit. So, "123Joe" is valid input, and 123 will be returned. "Joe" is valid input, too, and 0 will be returned.


Go Up