Anyone have any Memory tests?

Hi all, I have a project I am deploying on Arduino Ethernet (w POE) (Pretty much a Uno R3 with built in Ethernet board).
Everything was going fine until I programmed up the 4th unit.
On board 4 I get different responses from boards 1-3 and the prototype.
The code has not changed. The responses in question aren't dependent on external hardware (its basically an SNMP OID database walk to generate the response).
Two other identical boards programmed with the same code/data today are working just fine when given identical tasks to the problem board.
The fault is 100% repeatable, but to find out whats going wrong I potentially have to add a huge amount of debug as its bound to be in the SNMP library stack somewhere. It would save a lot of time to run a memory checker and be able to discover some memory bit was DOA.

So.. has anyone written any memory tests? they would make my life easier at this stage.
... alternate suggestions also welcome of course. :slight_smile:

you can add something like the following lines in setup (for an uno)

//
//    FILE: memoryCheckSum.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: demo / prototype
//    DATE: 2014-02-16
//     URL:
//
// Released to the public domain
//

uint16_t  checksum()
{
  uint16_t  sum = 0;
  for (uint16_t addr = 0; addr < 32767; addr += 2) 
  {
    sum += pgm_read_word_near(addr);  // more complex CRC possible
  }
  return sum;
}

void setup()
{
  Serial.begin(115200);
  uint32_t check = checksum();
  Serial.println(check);
}

void loop()
{
}
  • not tested if previous sketches influence the checkSum.

please let us know the results!

Which memory is of concern here? I thought that AVRDUDE verified the flash after an upload. That being the case, is the concern with SRAM, or maybe EEPROM?

good point Jack!

Could it also be a runtime allocation problem?
(depending on runtime behaviour of apps that are near the limit one can get intermittent out of memory errors)

Rob: I am currently running a RAM tester and I will push in a variant of your checksum after I have done that.
I think the previous sketches will have an effect so I am going to tweak it to include a huge data segment and also to report checksums every 256bytes.

Jack: Part of my problem is I don't know which memory might be a problem :slight_smile:
I had wondered if the FLASH was checked but I haven't seen anything to confirm. I will have another dig as that removes a large possible fault possibilities.

The code is big (24Kish), which is why I want to go the system tester route. I wouldn't except that if I follow identical process from boot , one board fails and 5 others don't. The fail vs succeed is totally repeatable.

I do have a freemem check in the code and its showing 490 bytes of headroom.

at what moment in time the system fails?

  • directly after loading
  • in setup()
  • in loop()
  • in a specific function?

can you call freemem() in different places to see patterns?

"In the loop()."
To be a bit more detailed, the Arduino is receiving an SNMP getnext command, acting on that command apparently successfully, and sending a good reply. But it MAY be returning a corrupt packet containing the target for the next command. OR it may be corrupting the packet of the next command when it receives it.
Either way when it comes to acting on the following GetNext command the OID I receive in my Arduino code is truncated and fails its database lookup. It looks like the corruption is occurring in the Agentuino SNMP/IP stack but I have already fixed a few things in their and as its not my code I would rather avoid if its a hardware problem.
Other GetNext sequences are completing successfully on the failing device.
The SNMP client end is the same at all times and so I currently don't want to look into that as a possible problem area.
I could do some more freemem and I will do proper debugging if the memory tests don't help.

So I modified Robs checksum code.
Its now 32,024 in size so I can ignore flash pollution from smaller projects. Most of the size is a lot of 256byte PROGMEM char arrays. I was going to attach it but there is max attachment size of 4Kb

Unfortunately the checksums from my good board and from the bad board are the same.

Next I will run the RAM test.

Fafhrd:
Unfortunately the checksums from my good board and from the bad board are the same.

Think that is a good sign as you eliminated it as the cause of the bug you're hunting !

Keep us informed of the progress.

I just realised I did not fully answer Jacks point.

  • RAM - yes - 430 bytes of headroom but I will add some asserts() to check that.
  • FLASH - about 24Kbyte, 29K with all the debug switched on
  • EEPROM - 17 x 32 bit values
    So I am using EEPROM but only as a non volatile store for config. During the testing that fails only the initial config load is happening. If it loaded a bad IP address I wouldnt be getting anywhere. The other values do not effect the failing SNMP traffic parts. Sometime I will load up my EEPROM reset s/w, just to remove EEPROM as a possible problem.

robtillaart:
Think that is a good sign as you eliminated it as the cause of the bug you're hunting !

Agreed... but I am always happiest when I can see the smoke coming out of the bug.

RAM check just finished. No reported errors so it looks like I am going to have to do this the hard way :frowning:

Many thanks to both of you for comments.

Fafhrd:
Jack: Part of my problem is I don't know which memory might be a problem :slight_smile:

Fafhrd:
Its now 32,024 in size so I can ignore flash pollution from smaller projects.

I fail to see how you have arrived at the conclusion that there is a memory integrity problem. Obviously there is some sort of issue, but quite frankly, memory integrity would be very, very low on my list of suspects.

I have no idea what "flash pollution" is or how "smaller" projects would cause it.

With verbose output enabled in preferences, the last few lines:

avrdude: Send: t [74] . [00] . [18] F [46]   [20] 
avrdude: Recv: . [14] 
avrdude: Recv: . [e6] . [0a] . [ef] . [0a] . [04] . [0b] . [15] . [0b] . [00] . [00] . [00] . [00] . [d8] . [10] } [7d] . [11] n [6e] . [10] . [9f] . [10] . [7f] . [10] . [c8] . [10] 
avrdude: Recv: . [10] 
# | 100% 3.50s

avrdude: verifying ...                        <------------
avrdude: 10392 bytes of flash verified        <------------
avrdude: Send: Q [51]   [20] 
avrdude: Recv: . [14] 
avrdude: Recv: . [10] 

avrdude done.  Thank you.

When one board fails and 5 work I start to suspect hardware.
Doesn't mean the problem is hardware but if the hardware is easy to eliminate then it makes other testing easier. Testing memory validity/functionality should be relatively easy.
Nothing worse than chasing software problems only to then to find out it is faulty hardware. I spent 2 days last month trying to figure out why some auto calibration s/w updates were not working, only to discover my 4.8v reference zeners were not.

Thanks for the confirm that avrdude does validate.

Oh I am so NOT happy!

This morning I thought I would validate all the remaining Arduino boards I have (seemed like a good idea in case there was anything else wrong with any others, or another showed the problem from board #4)
They all worked. Good.

So I go back to my problem board, which I will point out was consistently failing multiple times yesterday, and being given long periods between tests while I tried other boards that all passed with the same code base.

And today... the horrid little thing is working fine.

I can even confirm the ambient temperature is within 1 Degree C of yesterday.

So clearly I now have an intermittent problem where yesterday it was nicely reproducible. That means there is a much wider range of potential causes and it raises the specter that some of the software that has been in use for months have a potential problem. :frowning:

Oh well, that's life. Thanks again for the comments.

Fafhrd:

[quote author=Jack Christensen link=topic=218231.msg1594775#msg1594775 date=1392592330]
I fail to see how you have arrived at the conclusion that there is a memory integrity problem.

When one board fails and 5 work I start to suspect hardware.
[/quote]

Couldn't disagree with that, but memory failure is quite a leap without additional specific symptoms.

Testing memory validity/functionality should be relatively easy.

Perhaps, but also a waste of time and effort. Pretty much last on the list like I said. Well, good luck, let us know what you find.