Pages: [1] 2   Go Down
Author Topic: strange array corruption  (Read 2723 times)
0 Members and 1 Guest are viewing this topic.
UK
Offline Offline
Jr. Member
**
Karma: 2
Posts: 90
It was like it when I found it
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi, anyone got any ideas about this one? – I’ve got a array of characters that I’m using to create and display various error messages. It’s all very basic, I’m not doing rocket science here but it’s getting corrupted straight away.
Here’s the code that populates and displays the array….
Code:
  if (ret_val != RX_FINISHED_STR_RECV)
  {
    // Strip all the control characters from the output buffer, and use this to populate clean_buff...
    cleanBuffer();
    info[0]=0;   // reset string
    sprintf(info,"WaitResp. Expected <%s>",expected_resp_string);
    if (ret_val == RX_FINISHED_STR_NOT_RECV)
    {
      strcat(info,", got <");     // this is the bit that's getting corrupted
      strcat(info,clean_buff);  // add the actual error message
      strcat(info,">");
    }
    else if (ret_val == RX_TMOUT_ERR)
    {
      strcat(info,". Timeout");
    }
    Serial.println(info);                            // display message.
    loggerMiniGSM->writeLn(info);  // write message to log file (and echo to screen)

  }

And here’s what I see on the screen…..
Quote
WaitResp. Expected <>>,çgotñ<+CMS ERROR: 310>
33:24:27 WaitResp. Expected <>>,çgotñ<+CMS ERROR: 310>

As you can see the characters immediately before and after the word ‘got’ have somehow been corrupted. The second line (with the timestamp) is also written to the logfile. (I check the logfile, and those characters are also corrupted there) So whats going on?

One of the things I don’t get how the array can get corrupted so quickly, - the error message is formed, and then it’s displayed straight away, no other functions are getting called in the meantime so what’s causing the corruption?
Also, the ‘info’ array is constantly being reset (i.e. info[0]=0;) and constantly been used by other functions without problems, and yet the same corruption is always appearing in the same place.

Other points….
Info is declared as “char info[90+1];”
The error is repeatable.
The array ‘info’ is available to all functions in the class, each of the functions use ‘info’ to display the error message in a similar manner, but none of those functions is there any similar corruption.
The arduino application is cycling through the same code, but the corruption doesn’t appear straight away. In the above example, the corruption after ‘got’ appeared after about five minutes, and the corruption before it about 5 minutes later.
The corruption doesn’t get any worse – the above example is after 33 hours of running.
The whole application involves a dozen librarys and external hardware, so I can’t really post the whole thing.
It’s running on an Mega2560

Logged

Seattle, WA USA
Offline Offline
Brattain Member
*****
Karma: 614
Posts: 49372
Seattle, WA USA
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

The string ", got <" is stored in memory at compile time. Some time during execution of your program, something is overwriting parts of that memory space. Most likely this is a buffer overrun error somewhere. Tricky to try to find. You have to look at all of your array utilizations, and make sure that you are not exceeding the limits of any array anywhere.
Logged

Left Coast, USA
Offline Offline
Sr. Member
****
Karma: 7
Posts: 499
Sometimes I just can't help myself.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset


Code:
.
.
.
    cleanBuffer();

I can’t really post the whole thing.
How about posting the code for cleanBuffer()?

Regards,

Dave
Logged

UK
Offline Offline
Jr. Member
**
Karma: 2
Posts: 90
It was like it when I found it
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Thanks Paul, I know that buffer overflows are the traditional cause of problems like this, but buffers are linear aren't they? so how does a buffer overflow corrupt only the 2nd and 5th character, without effecting others?

My application uses lots of char arrays (which may explain why I'm having such trouble tracking this problem down) but they all contain only printable characters, if one off these buffers is overflowing into the memory allocated for ", got <" then wouldn't the corruption take the form of ascii characters?
Dave - cleanBuffer() takes the output from my SMS module and strips out all the non-printing characters. heres the code...

Code:
void MiniGSM::cleanBuffer()
{
 
  clean_buff[0]=0;  //empty the array
 
  int i=0;
  int y=0;
  while (comm_buf[i] != 0)  //check all characters until the end
  {
    if ((comm_buf[i] >= 32) && (comm_buf[i] <= 126))
    {
      clean_buff[y] = char(comm_buf[i]);
      y++;
    }
    i++;
  }

  clean_buff[y] =0;  //stick a null terminator on the end
}
Thanks for your help guys
Logged

Global Moderator
Offline Offline
Brattain Member
*****
Karma: 485
Posts: 18806
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

This code reproduces the corruption in the middle of a string:

Code:
char mybuf [90] = {0};

void setup () {
  Serial.begin(9600);
 
  // corrupt some literal strings in memory
  memcpy ((void *) 257, "\xE7", 1);
  memcpy ((void *) 261, "\xF1", 1);
 
}

void loop ()
{
 
  strcat (mybuf, ", got <");
 
  Serial.println (mybuf);


  Serial.println (", got <");
  Serial.println (", got <");
  Serial.println (", got <");

  while (true);
}

Output is:

Code:
,çgotñ<
,çgotñ<
,çgotñ<
,çgotñ<

Note that no char array has gone out of bounds. However if something else went wrong (eg. a byte array went negative, or something along those lines) then we see here that the string ", got <" has become permanently corrupted.

Obviously you won't have the egregious memcpy that I have there, but an array out of bounds, somewhere, could cause the same thing.
Logged


Global Moderator
Offline Offline
Brattain Member
*****
Karma: 485
Posts: 18806
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Thanks Paul, I know that buffer overflows are the traditional cause of problems like this, but buffers are linear aren't they? so how does a buffer overflow corrupt only the 2nd and 5th character, without effecting others?

Not necessarily. Consider:

Code:
byte foo [20];

foo [22] = 0xF1;
foo [25] = 0xE7;

There, corruption outside the bounds of foo, in a non-linear way.
Logged


London, GB
Offline Offline
Sr. Member
****
Karma: 8
Posts: 332
Nothing works.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Is a char array a contiguous series of bytes, or is it more a series of some sort of longer word length cells, each only paying attention to a bytes worth?
Logged

Global Moderator
Offline Offline
Brattain Member
*****
Karma: 485
Posts: 18806
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

A contiguous series of bytes. To be pedantic, 8-bit bytes rather than the "byte" type.

Thus:

Code:
char foo [10];

occupies 10 bytes of memory. These are signed characters (that is, their values go from -128 to +127). For unsigned bytes you want:

Code:
byte foo [10];

The difference is relevant when you are trying to do things like compare if a particular element is greater than 127 or less than zero.

It's possible the compiler word-aligns the whole lot, I'm not sure about this one.

In other words:

Code:
char foo [9];
// possible gap here to align the next variable
char bar [7];
Logged


Left Coast, USA
Offline Offline
Sr. Member
****
Karma: 7
Posts: 499
Sometimes I just can't help myself.
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

An array of "anything" occupies a number of contiguous memory locations. The memory location associated with address of an element of an array of "anything"s can hold the value of a variable of type "anything."

For numeric data types:
Standard C++ has built-in integer data types of char, short int, int and long int.  There are built-in floating point data types  float and double.  For each of the integer data types, there is a corresponding unsigned variety.  Variables of unsigned data types occupy the same amount of memory as their signed counterparts.

Additionally...
Arduino has a typedef named "byte" which is a C++ unsigned char.  That is, the Arduino byte is not a new or different data type; it is just another name for a standard C++ unsigned char.

Now, the C++ standard does not require that ints, long ints, etc. have any specific lengths.  For a given C++ compiler, the sizeof() operator returns how many bytes of memory will be occupied by a given variable of a given data type.  The C++ standard does specify that sizeof(char) always returns 1.  I think that most computer architectures these days use one byte of memory to store a char.  They didn't have C or C++ back in the days we were running FORTRAN programs on CDC6600 computers (which had a memory granularity of 60 bits, not today's puny eight bits).

The C++ standard also specifies that the size of a short int is greater than or equal to the size of a char, the size of an int is greater than or equal to the size of a short int, and the size of a long int is greater than or equal to the size of an int.

For avr-gcc, which is Arduino's compiler, sizeof(int) returns a value of 2 and sizeof(long) returns a value of 4.  So---an int takes two bytes and a long int takes four bytes.

If you have an array of, say, 10 ints, then with avr-gcc, the array takes up 20 bytes.  For compilers that you are likely to run across on your desktop or laptop workstation these days, sizeof(int) will probably return a value of 4 (but that's not guaranteed).  For example, on my Linux computer, an array of 10 ints takes up 40 bytes of memory.

Bottom line: No matter what kind of C++ array you have, the number of elements in the array can be calculated (at compile time) by dividing the sizeof the array by the sizeof an element of the array.  The code that calculates the number of elements in an array of a given data type is portable to other compilers even if the sizes of the particular data types are different.

Code:
  number_of_elements_in_array = sizeof(array) / sizeof(array[0]);


Regards,

Dave


Footnote:
The current Standard C++ language definition does not allow variable-length arrays.  However, many C++ compilers do allow this (they got it from the current ISO C language standard).  If a compiler does allow this, then evaluations of expressions containing the sizeof operator for a variable-length array are done at run time, not compile time.  Since I am often interested in portability to other compilers, I usually eschew non-standard language features for routine calculations, so I rarely use variable-length arrays.  But that's just me.  I'm funny that way.
« Last Edit: February 04, 2011, 05:27:07 pm by davekw7x » Logged

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 207
Posts: 12928
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
It's possible the compiler word-aligns the whole lot

Nope.
Logged

nr Bundaberg, Australia
Offline Offline
Tesla Member
***
Karma: 128
Posts: 8524
Scattered showers my arse -- Noah, 2348BC.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

It seems a heck of a coincidence that it's only the two spaces that are corrupted.

Try moving the array somewhere else (or change the size of the preceeding array) and see if the error is the same or pops up elsewhere.


______
Rob
Logged

Rob Gray aka the GRAYnomad www.robgray.com

Global Moderator
Offline Offline
Brattain Member
*****
Karma: 485
Posts: 18806
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

They are four bytes apart, which is interesting. Also interesting is that if it corrupts every 4th byte, then that is now outside that particular literal. So a runaway nearby array might be saving values at 4, 8, 12, 16 bytes past its end, for example.

I would be looking at the values (0xF1 and 0xE7) (241 and 231). Are they something you might store in a byte array somewhere? Maybe you are counting how many times certain transactions arrive (eg. 241 lots of transaction type 200). If so, is it possible that indexing into a nearby array, by transaction type, could exceed its boundaries?

Also be cautious how you are indexing into an array.

Notice this example:

Code:
byte counts [256];

void setup () {
Serial.begin(9600);
}

void loop () {
  for (int i = 0; i < sizeof (counts); i++)
    counts [i] = 0;
   
  char x = 200;
 
  counts [x] = counts [x] + 1;
 
  Serial.print ("count is ");
  Serial.println (counts [200], DEC);
 
  while (true);
 
}

Output is: "count is 0"

Now make a minor change:

Code:
byte counts [256];

void setup () {
Serial.begin(9600);
}

void loop () {
  for (int i = 0; i < sizeof (counts); i++)
    counts [i] = 0;
   
  byte x = 200;   // byte is unsigned
 
  counts [x] = counts [x] + 1;
 
  Serial.print ("count is ");
  Serial.println (counts [200], DEC);
 
  while (true);
 
}

Output is: "count is 1"

So, indexing into an array with a char as the index (eg. if you are counting how many times a particular byte arrives) can give you negative indices, thus corrupting nearby memory.
Logged


Global Moderator
Offline Offline
Brattain Member
*****
Karma: 485
Posts: 18806
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

By the way, this is why I was asking in another thread about changing compiler switches. Take this example program:

Code:
unsigned char counts [256];
char x = 200;

int main ()
  {
  counts [x]++;
  return 0;
  }

Compiled with default warnings, this gives no errors nor warnings. But compiled with -Wall you get:

Code:
test.cpp:6: warning: array subscript has type ‘char’

A little hint there that things might go wrong.
Logged


UK
Offline Offline
Jr. Member
**
Karma: 2
Posts: 90
It was like it when I found it
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Thanks guys, lots of interesting stuff here, much of it is raising more questions.

The application I’m having trouble with involves sending and receiving SMS’s, thus I’m doing a lot of fiddling about with strings. I’m using an SMS shield from the HW kitchen, and my code is based a much-modified version of the SMS library they provide. The library involves sending text commands to the SMS module, and processing the text that comes back, I was curious from day one as to why the incoming data from the SMS module is held in a byte array, and not a char array, to me it seemed to me more sensible to hold and process ascii text in character array, is this not the case?

If I’m going to be processing text (in the form of SMS commands) should I be storing that text in byte arrays, or char arrays? The basic string manipulation commands ( strpy(), strcmp(), strcat() ) seem to work with either, or am I playing with fire here?

Several of the above post have shown how isolated elements within an array can be corrupted by explicitly setting an array element outside of the bounds of that array (i.e. trying to set the 12th element in a 10 element array). I assume this sort of corruption can’t occur from the normal string manipulation commands (ie strcat() and strcpy() ) because even if you do attempt to move/copy a 20 element array into a 10 element array, then the corruption has to be linear – by this I mean the first 10 elements will be handled correctly, and the next 10 memory locations will be corrupted in order. Is this right?

Cheers

Mike
Logged

nr Bundaberg, Australia
Offline Offline
Tesla Member
***
Karma: 128
Posts: 8524
Scattered showers my arse -- Noah, 2348BC.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
by this I mean the first 10 elements will be handled correctly, and the next 10 memory locations will be corrupted in order. Is this right?
I think so, none of the standard strxxx funcs will be that selective.

Which means it's your code smiley

Just doing some thinking out aloud here in the hope that something will click.

Code:
    sprintf(info,"WaitResp. Expected <%s>",expected_resp_string);
results in

Quote
WaitResp. Expected <>

so expected_resp_string is NULL, is that expected/reasonable?

then

Code:
strcat(info,", got <");

eventually shows

Quote
>,çgotñ<

where does the > come from?

Still no obvious reason to me.

Have you tried moving info to see if the error moves with it?

______
Rob

Logged

Rob Gray aka the GRAYnomad www.robgray.com

Pages: [1] 2   Go Up
Jump to: