Pages: 1 [2]   Go Down
Author Topic: Arduino Randomly Freezing During Long Jobs  (Read 2444 times)
0 Members and 1 Guest are viewing this topic.
United Kingdom
Offline Offline
Tesla Member
***
Karma: 224
Posts: 6613
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

I've just looked at the HardwareSerial code for writing characters, and I think there may be a race condition bug in it. Here is the code from HardwareSerial.cpp (Arduino 1.02):

Code:
size_t HardwareSerial::write(uint8_t c)
{
  int i = (_tx_buffer->head + 1) % SERIAL_BUFFER_SIZE;

  // If the output buffer is full, there's nothing for it other than to
  // wait for the interrupt handler to empty it a bit
  // ???: return 0 here instead?
  while (i == _tx_buffer->tail)
    ;

  _tx_buffer->buffer[_tx_buffer->head] = c;
  _tx_buffer->head = i;

  sbi(*_ucsrb, _udrie);
  // clear the TXC bit -- "can be cleared by writing a one to its bit location"
  transmitting = true;
  sbi(*_ucsra, TXC0);
  
  return 1;
}

The first problem is that the 2-byte volatile variables tx_buffer and tx_buffer->tail are written and read without disabling interrupts. However, as the sizes of the transmit buffer are less than 256 bytes, the upper byte will always be zero, so the situation of inconsistent upper/lower bytes will not arise.

The second possible problem is the instruction to clear the TXC (transmit complete) bit. I can't see what this is for and I think it is harmful. Suppose this line is executed around the time that the UART has just finished sending a character. Then the TXC interrupt may never occur, and the data in the ring buffer will never be sent. So the ring buffer will become full and Serial.print calls will block.

If I'm right, then your code is blocking while trying to write "OK" to the serial port. You might like to try lighting a LED just before making the call and turning it off immediately after it returns. That will tell you whether the program is locking up during that call or somewhere else.

Are you definitely receiving "OK" at the PC for the penultimate command that gets executed?

EDIT: after looking at the code some more, I see that it is the Data Register Empty interrupt that is being used, not the Transmit Complete Interrupt. So my analysis above is not correct. However, I still don't see the point of clearing the TXC bit, and I still think it may be worth your while using an LED to see whether the lockup is inside Serial.print.
« Last Edit: December 07, 2012, 08:27:43 am by dc42 » Logged

Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Interesting is one word! smiley-mad

Serial.print() everywhere sounds like a good plan right now, and it allows me to procrastinate further in the knowledge that I really need to rebuild my electronics from scratch. I'm a neat, careful worker, but the stripboard has been modified a few times since the original design, and I must've made a mistake somewhere. While building the machine I learned how to make custom boards anyway, hopefully something will show up soon before I get to that stage. Next time I fancy either using custom drivers (Allegro A4988) or Darlington arrays fed from a pair of 74HC595s to keep my pin-count low and allow me half-wave control instead of the klutzy two-wire config I'm using now. I'm also tempted to use opto-isolators between the Arduino and the control board, though I'm sure that's just overkill, an no-one would ever bother?

Without the servo control board connected to the Arduino the code runs fine, which suggests it's a hardware problem.

When everything's connected it works for anything up to 15 hours, then stalls between commands, which suggests it's a software problem.

 smiley-roll

@dc42

Your analysis of the HardwareSerial code is a little beyond my comprehension I'm afraid, but I have a pin free so I'll try your suggestion in a day or so. Thank you.
Logged

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Okay, here's a game-changer...

Out of sheer desperation I hard-coded all my serial commands into the Arduino sketch itself and removed all references to the serial bus. Still had the laptop connected, but only as a power source. The commands are quite repetitive for the first job I want to do, so it just took a couple of nested loops to replicate them exactly. Started it off yesterday lunchtime with exerything connected as normal, and a few minutes ago it finished, parked-up and powered-down.

Wow

So, can we assume it is the serial after all? I might try another run now with just Serial.begin() in the startup, to see if that works too.

Note: Just tried to upload new code, and the Arduino IDE halted with Serial Port 'COM7' already in use. Could this be a clue?
« Last Edit: December 09, 2012, 08:25:38 am by aibonewt » Logged

United Kingdom
Offline Offline
Tesla Member
***
Karma: 224
Posts: 6613
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Out of sheer desperation I hard-coded all my serial commands into the Arduino sketch itself and removed all references to the serial bus. Still had the laptop connected, but only as a power source. The commands are quite repetitive for the first job I want to do, so it just took a couple of nested loops to replicate them exactly. Started it off yesterday lunchtime with exerything connected as normal, and a few minutes ago it finished, parked-up and powered-down.

Try re-adding the printing of "OK" to the serial port after each command. That will tell you whether is it the serial receive or transmit that is causing the problem (or it could be an interaction between the two).
Logged

Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

0
Offline Offline
God Member
*****
Karma: 39
Posts: 988
Get Bitlash: http://bitlash.net
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Puzzled here.  Doesn't this data support another hypothesis: Maybe the arduino is working perfectly, and the bug is in the windows usb serial driver somehow, or the connection, going dead?

Didn't you say the symptom was that it finished the job just fine but the "OK" never gets to the PC?

I'm sure I'm missing something...

-br
Logged

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

@dc42 - Yeah, I like the sound of that.

@billroy - Earlier I was losing the 'OK' after a single command was received and executed, which stalled the job. Without serial, all 7000+ commands are executed and the job finishes.
Logged

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Okay, here's what happened with all commands internalised and just 'OK' sent out from the Arduino after each command/scanline. Here's the output from Comport Toolkit, so I can't even blame VB any more  smiley-wink

Time(GMT)Time(Job)Comment
15:07:1200:00:00Start of Job
19:51:1604:44:04Last 'OK' received (1089th)
11:48:5520:41:43End of job

After the 'OK's stopped appearing, the Tx light no longer flashed for the rest of the job. But the Serial.println("OK"); command must've been parsed by the Arduino as it was in the main loop. So, what's happening here?

Oh, and something else that's interesting is the fact that without the Serial exchange between the two machines it cut nearly 4 hours off the job!
« Last Edit: December 10, 2012, 07:23:34 am by aibonewt » Logged

0
Offline Offline
God Member
*****
Karma: 39
Posts: 988
Get Bitlash: http://bitlash.net
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Your persistence and tenacity are commendable.

Suppose a ram corruption issue (wild pointer or stack overflow) is smashing the 'OK' string in RAM.  A well-placed zero written over the first character would make the string zero length, which would replicate the "OK stops coming" part of the bug.

You could test this by using the OK string as a canary for ram contamination:

Code:
char *canary = "OK";

…check in command loop…
if (canary[0] != 'O') the canary is dead

It should never change, but it would be interesting if it did, right?  It would mean the last command did something to corrupt ram.

If you can pin it down to a single/certain command, that would be progress...

-br
Logged

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

@ billroy - Interesting idea, I'll stick it on my 'to do' list.

For now, I'm trying something I should've years ago and running my original serial-based setup from a completely different machine. I've been using a little EeePC up to now, which is beautifully compact, but had so much added/subtracted/plugged/unplugged over the years I can't help feeling there's something up with its USB ports. I've recently restored an old Acer Aspire 1350, so we'll see how that fares.

Onward...
Logged

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Well (to paraphrase myself) that worked, TWICE!

Yup, it was the laptop closing the port. I don't blame the Eee PC1000H itself, it probably just needs a fresh install of the OS. Dunno why, but I'm using a different computer now (also XP SP3) and everything's running like butter wouldn't melt.

Sincere thanks for all the suggestions which have helped to stabilise my electronics and tighten my coding.

Thank you!

[RESOLVED] Near-enough smiley-wink
Logged

United Kingdom
Offline Offline
Tesla Member
***
Karma: 224
Posts: 6613
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Laptop going into sleep mode after a while, perhaps?
Logged

Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

United Kingdom
Offline Offline
Newbie
*
Karma: 0
Posts: 43
Some things are difficult, nothing's impossible
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Absolutely no idea!

I disabled sleep mode/power-saving (even the screen saver), culled all non-essential processes, pulled the network, switched off Wifi/Bluetooth and disabled SelectiveSuspend on the USB ports. I would do a fresh install, but it's such a faff on the Eee PC without a CD-ROM drive. I'll keep the old Acer in the workshop now, the Eee PC can go back to being what it was originally bought for, being m-o-b-i-l-e!
Logged

Pages: 1 [2]   Go Up
Jump to: