Arduino Randomly Freezing During Long Jobs

Okay, despite this code having run for three and a half days without issue (when only connected to the USB, anyway =()...

What would be the best approach? Add a conditional to check each byte is good before adding it to the array?

aibonewt:
Okay, despite this code having run for three and a half days without issue (when only connected to the USB, anyway =()...

What would be the best approach? Add a conditional to check each byte is good before adding it to the array?

Add a check that 'index' has not reached the end of the array. If it has, you'll need to decide what to do, e.g. wait until you receive the terminating character, then ignore that data and start again by resetting index to zero.

aibonewt:
Thanks for the suggestions!

I'd suspected that there wasn't enough smoothing going on, so last night I started a long test with a 470?F cap across the 5V/GND on the Arduino, which seemed to work better (ie. it stalled after nearly 15.5hrs).

@dc42 - I have no String class in my code, it works fine without my driver board attached and it's well within my available RAM. Thanks for asking.

I understand the comments about motor transients, but I'm still a bit worried that a test I did earlier with only the driver board's 5V/GND connected to the Arduino still managed to fail after a few hours, despite the complete absence of switching loads and motor supplies. I just can't understand how that would happen, especially as there was no other power going to the board.

I will add caps to all my ICs today, and see if that helps...

Firstly those decoupling capacitors are always needed with every logic chip. Here the ULN's aren't logic chips, they are just amplifiers
in effect so lack of decoupling can't glitch them into the wrong state, but its important to have them to reduce the switching noise on
the supplies, as they are switching large currents.

Secondly we know nothing about the cabling between boards - logic signals should not be routed over long wires without taking
appropriate steps to prevent crosstalk, reflections, etc - so how is everything connected?

Another thing that might occasionally be needed in very noisy environments is adding an extra pull-up resistor on the reset pin (1k or so).

And also have you checked the supply voltages are correct when its operating? - always worth checking just in case there's an
unexpected issue there.

Okay,

I've tweaked the code so index won't overrun, and put a 10k pull-up on the reset. Still not working for more than a few hours.

const unsigned int maxIn = 50;
char inStr[maxIn];              // More than enough for longest command

void serialEvent(){
  while (Serial.available())
  {
    char inByte = Serial.read ();
    switch (inByte)
    {
    case '\n':            // end of command
      inStr [index] = 0;  // terminating null
      stringComplete = true;
      index = 0;  
      break;
    case '\r':            // discard CR
      break;
    default:
      if (index < (maxIn - 1))
        inStr [index++] = inByte;
      break;
    }
  }
}

The thing that still bugs me is that the machine only ever stops AFTER carrying out a command. It gets it from VB, parses it, sends it back for logging, carries it out completely and only THEN freezes. I would've thought that if there were a problem with the wiring it'd just stall at any time, rather than neatly between commands.

I've wasted weeks blaming the serial, and the code, the OS (XPsp3) even the USB drivers themselves. Perhaps I need to put a loop of commands into the Arduino code and run it without using the serial and see if that manages to stay running?

aibonewt:
The thing that still bugs me is that the machine only ever stops AFTER carrying out a command. It gets it from VB, parses it, sends it back for logging, carries it out completely and only THEN freezes. I would've thought that if there were a problem with the wiring it'd just stall at any time, rather than neatly between commands.

Does the PC receive the "OK" after the last command that it carries out completely?

Ah, now THIS is why I went investigating the Serial port in the first place...

No it doesn't!

The Arduino receives the instruction, parses it, bounces it back to the VB's logfile, carries out the command and then just sits there. What happens internally at this point is still a mystery as the port is locked at this point, and the USB has to be physically un/replugged to restore it. On the other thread I even went to the extent of filming the Tx/Rx lights, but this proved fruitless as the Tx will not flash anyway if the port is locked by the laptop. I quit this line of investigation when I found that the code would run continuously without any attachment to the Arduino other than the USB. Here's my other thread, just for a different angle on the problem.

http://arduino.cc/forum/index.php/topic,129286.0.html

This is an interesting bug. It smells like a memory leak or corruption crash to me. From the other thread, it appears you were using Strings - no longer, right?

To detect a possible leak, it might be worth instrumenting how much RAM is free and printing that out every half hour. There is a magic function for calculating free ram if you search the forums.

Memory corruption is more likely than memory exhaustion if you have stopped using Strings. I would Serial.print() the living devil out of the code path between finishing the command successfully and printing OK. You know it falls off the rails in there somewhere. One Serial.print per line if that's what it takes. (Or binary search, if you have a lot of 8-hour test windows…) If you can figure out which line it fails on, it might help.

-br

I've just looked at the HardwareSerial code for writing characters, and I think there may be a race condition bug in it. Here is the code from HardwareSerial.cpp (Arduino 1.02):

size_t HardwareSerial::write(uint8_t c)
{
  int i = (_tx_buffer->head + 1) % SERIAL_BUFFER_SIZE;
	
  // If the output buffer is full, there's nothing for it other than to 
  // wait for the interrupt handler to empty it a bit
  // ???: return 0 here instead?
  while (i == _tx_buffer->tail)
    ;
	
  _tx_buffer->buffer[_tx_buffer->head] = c;
  _tx_buffer->head = i;
	
  sbi(*_ucsrb, _udrie);
  // clear the TXC bit -- "can be cleared by writing a one to its bit location"
  transmitting = true;
  sbi(*_ucsra, TXC0);
  
  return 1;
}

The first problem is that the 2-byte volatile variables tx_buffer and tx_buffer->tail are written and read without disabling interrupts. However, as the sizes of the transmit buffer are less than 256 bytes, the upper byte will always be zero, so the situation of inconsistent upper/lower bytes will not arise.

The second possible problem is the instruction to clear the TXC (transmit complete) bit. I can't see what this is for and I think it is harmful. Suppose this line is executed around the time that the UART has just finished sending a character. Then the TXC interrupt may never occur, and the data in the ring buffer will never be sent. So the ring buffer will become full and Serial.print calls will block.

If I'm right, then your code is blocking while trying to write "OK" to the serial port. You might like to try lighting a LED just before making the call and turning it off immediately after it returns. That will tell you whether the program is locking up during that call or somewhere else.

Are you definitely receiving "OK" at the PC for the penultimate command that gets executed?

EDIT: after looking at the code some more, I see that it is the Data Register Empty interrupt that is being used, not the Transmit Complete Interrupt. So my analysis above is not correct. However, I still don't see the point of clearing the TXC bit, and I still think it may be worth your while using an LED to see whether the lockup is inside Serial.print.

Interesting is one word! :0

Serial.print() everywhere sounds like a good plan right now, and it allows me to procrastinate further in the knowledge that I really need to rebuild my electronics from scratch. I'm a neat, careful worker, but the stripboard has been modified a few times since the original design, and I must've made a mistake somewhere. While building the machine I learned how to make custom boards anyway, hopefully something will show up soon before I get to that stage. Next time I fancy either using custom drivers (Allegro A4988) or Darlington arrays fed from a pair of 74HC595s to keep my pin-count low and allow me half-wave control instead of the klutzy two-wire config I'm using now. I'm also tempted to use opto-isolators between the Arduino and the control board, though I'm sure that's just overkill, an no-one would ever bother?

Without the servo control board connected to the Arduino the code runs fine, which suggests it's a hardware problem.

When everything's connected it works for anything up to 15 hours, then stalls between commands, which suggests it's a software problem.

:roll_eyes:

@dc42

Your analysis of the HardwareSerial code is a little beyond my comprehension I'm afraid, but I have a pin free so I'll try your suggestion in a day or so. Thank you.

Okay, here's a game-changer...

Out of sheer desperation I hard-coded all my serial commands into the Arduino sketch itself and removed all references to the serial bus. Still had the laptop connected, but only as a power source. The commands are quite repetitive for the first job I want to do, so it just took a couple of nested loops to replicate them exactly. Started it off yesterday lunchtime with exerything connected as normal, and a few minutes ago it finished, parked-up and powered-down.

Wow

So, can we assume it is the serial after all? I might try another run now with just Serial.begin() in the startup, to see if that works too.

Note: Just tried to upload new code, and the Arduino IDE halted with Serial Port 'COM7' already in use. Could this be a clue?

aibonewt:
Out of sheer desperation I hard-coded all my serial commands into the Arduino sketch itself and removed all references to the serial bus. Still had the laptop connected, but only as a power source. The commands are quite repetitive for the first job I want to do, so it just took a couple of nested loops to replicate them exactly. Started it off yesterday lunchtime with exerything connected as normal, and a few minutes ago it finished, parked-up and powered-down.

Try re-adding the printing of "OK" to the serial port after each command. That will tell you whether is it the serial receive or transmit that is causing the problem (or it could be an interaction between the two).

Puzzled here. Doesn't this data support another hypothesis: Maybe the arduino is working perfectly, and the bug is in the windows usb serial driver somehow, or the connection, going dead?

Didn't you say the symptom was that it finished the job just fine but the "OK" never gets to the PC?

I'm sure I'm missing something...

-br

@dc42 - Yeah, I like the sound of that.

@billroy - Earlier I was losing the 'OK' after a single command was received and executed, which stalled the job. Without serial, all 7000+ commands are executed and the job finishes.

Okay, here's what happened with all commands internalised and just 'OK' sent out from the Arduino after each command/scanline. Here's the output from Comport Toolkit, so I can't even blame VB any more :wink:

Time(GMT) Time(Job) Comment
15:07:12 00:00:00 Start of Job
19:51:16 04:44:04 Last 'OK' received (1089th)
11:48:55 20:41:43 End of job

After the 'OK's stopped appearing, the Tx light no longer flashed for the rest of the job. But the Serial.println("OK"); command must've been parsed by the Arduino as it was in the main loop. So, what's happening here?

Oh, and something else that's interesting is the fact that without the Serial exchange between the two machines it cut nearly 4 hours off the job!

Your persistence and tenacity are commendable.

Suppose a ram corruption issue (wild pointer or stack overflow) is smashing the 'OK' string in RAM. A well-placed zero written over the first character would make the string zero length, which would replicate the "OK stops coming" part of the bug.

You could test this by using the OK string as a canary for ram contamination:

char *canary = "OK";

…check in command loop…
if (canary[0] != 'O') the canary is dead

It should never change, but it would be interesting if it did, right? It would mean the last command did something to corrupt ram.

If you can pin it down to a single/certain command, that would be progress...

-br

@ billroy - Interesting idea, I'll stick it on my 'to do' list.

For now, I'm trying something I should've years ago and running my original serial-based setup from a completely different machine. I've been using a little EeePC up to now, which is beautifully compact, but had so much added/subtracted/plugged/unplugged over the years I can't help feeling there's something up with its USB ports. I've recently restored an old Acer Aspire 1350, so we'll see how that fares.

Onward...

Well (to paraphrase myself) that worked, TWICE!

Yup, it was the laptop closing the port. I don't blame the Eee PC1000H itself, it probably just needs a fresh install of the OS. Dunno why, but I'm using a different computer now (also XP SP3) and everything's running like butter wouldn't melt.

Sincere thanks for all the suggestions which have helped to stabilise my electronics and tighten my coding.

Thank you!

[RESOLVED] Near-enough :wink:

Laptop going into sleep mode after a while, perhaps?

Absolutely no idea!

I disabled sleep mode/power-saving (even the screen saver), culled all non-essential processes, pulled the network, switched off Wifi/Bluetooth and disabled SelectiveSuspend on the USB ports. I would do a fresh install, but it's such a faff on the Eee PC without a CD-ROM drive. I'll keep the old Acer in the workshop now, the Eee PC can go back to being what it was originally bought for, being m-o-b-i-l-e!