EEPROM Strange ERROR

Hi Guys!

I have a "small" problem with my device

Environment:

  • Arduino Mega 2560 Pro
  • EEPROM library

I am making a target hardware. On startup, I enter its programming interface, where I set some variables for the operation of the machine, which I then store in EEPROM memory.
The machine is then put into use. It is switched on and off several times a day.
On power-up, I use EEPROM.get() to read the values of the variables and the machine operates with them.

It works perfectly so far!

Days, weeks, months go by, when one of the machines goes crazy, does not do what it should, or not as it should. It is almost immediately clear that there is a problem with one of the values read from the EEPROM. These values stored in the EEPROM are typically between 0 and 3000, and sometimes I can get values around 5000, 23000, which is clearly out of range. And if you turn the device off and then back on again, it will read the same erroneous value. So we can say for sure that the data stored in the EEPROM is corrupted. There is no EEPROM write on power on/off, only read. In this case, if I enter the setting menu (it is locked with code) and reset the value, the program works perfectly again!

Why can this EEPROM corruption happen?

  • So far I have turned off the BOD on the Fuse bits (Brown Out Level Disabled). I know it's silly, but during normal use there is never any writing, yet the EEPROM is corrupted. During reading, can memory areas be corrupted with BOD off?
  • Is there a solution to redundant storage? That I store all values in two memory areas? On startup I compare them and if they don't match I rewrite the bad one with the good one. I know it's bad because its contents are outside the value range. If it is greater than 3000 or less than 0 then its value is overwritten by the value on the copy.
  EEPROM.get(805, value1); 
  EEPROM.get(1805, permavalue);
  if ((value1<0) || (value1>3000)) {value1=permavalue;EEPROM.put(805, permavalue);} 
  if (permavalue != value1) {EEPROM.put(1805, value1);}

Will this work well? Am I right in thinking that there is little chance of damaging two value pairs at the same time? So theoretically I can keep checking to see if a piece of data is corrupted, and then I can write it upside down for good

Or can you imagine that just turning on the BOD would solve the problem? Is it possible that no EEPROM writing can corrupt the data due to the lack of BOD?
Or could it be that I'm using a Chinese arduino clone and the EEPROM is of poor quality?

First of all, it would be good to have absolute evidence of corrupted EEPROM. You can read out EEPROM with the avrdude. Use CMD and read the avrdude help for this.

Yes, it can be EEPROM corruption. Why? Maybe power issues, poor quality. Hard to say.
The solution for redundant copy of data is up to you. You can write it to two areas, or use an external storage like external EEPROM chip or SD card.

And …..Given that it’s fine for a while , I would suspect it’s an electrical issue - noise , spikes etc .
Might useful to see a wiring diagram and picture if the layout .

How would you know which was the bad one, besides the obvious "out of range" sanity check?

You could have subtle corruptions in the main and backup parameter data blocks.

I would suggest you implement a simple checksum system on your EEPROM data to check for even the slightest "corruption".

I suppose the "Arduino Mega 2560 Pro" is the knock off of the MEGA which is smaller in size but still based on the ATMega2560?

Just to remove any doubt from a bad batch of boards with "crappy" ATMega2560, have you tried with just a regular official MEGA to see if you would have the same issue?

Power loss during EEPROM writing ?
The most obvious suggestion.

Some players ‘write’ just before shutdown…. dangerous.
EEPROM writes/puts can take milliseconds to complete.

At this point, a schematic is necessary, as I suspect something in your power arrangements may be the cause. A verbal description of how you wired it won't suffice, as English is often imprecise.
And yes, a checksum of both your data and backup data is required. And a fallback plan if both are corrupt.
Thanks.

I would not underestimate the chance of corrupting the EEPROM content from the sketch.

A simple buffer overflow or array subscript error can mess up the whole sketch.

OP said the sketch is only reading eeprom whilst in production so not likely unless users have fun and go in admin mode just to mess up the values :wink:

Since a reboot leads to the same wrong data it’s not likely that the ram representation of what was in eeprom got overwritten since it’s working for a while

So it doesn’t catch any persistent runtime values while it’s running… ok.

There is the very rare possibility of radiation affecting the eeprom, usually only something to worry about in a space environment, but there is background radiation, cosmic rays, etc.

I would think it much more likely that something crashed the code and caused an erroneous write to eeprom. Possibly a brown out, or some error in the code that requires a very specific set of circumstances to occur.

Has this problem only happened once? If it has occurred multiple times, it is a specific machine that has the problem, or is it distributed randomly amount multiple machines?

Hi all!

Thank for the replies!
Here are the schematic:

  • The error happened rarely
  • The error happened on different devices, on different time

Some device never get error until now. From about 30 device only 2-3 reported back as deflected. All of the 30 device got the Brown-Out-Detection was turned off.
At the first start i calibrate all, this is the last time when EEPROM write happened. After the configuration there was only EEPROM read.

After the device power on there is a 2000ms delay in the sketch (wait for the proper power input, and to fully charge the capacitators) then begin the EEPROM read in.This works perfectly well in most cases. However, if it does go wrong, one of the variables will get the wrong value from memory, and this is not a one-time read error, because it will return the same wrong number after a restart. So in this case the wrong number is already in memory. If I reconfigure it, it will be right again, so the memory itself is not physically corrupted (too many writes, etc.) but only the data stored in it.

Could this be caused by the lack of Brown-Out, if the voltage drops too low, the program does something stupid and starts writing to memory for some reason?

A pretty big number will be in there, around 23000. It's possible that it's a negative number, so you can't just display it like that.