strange array corruption

Hi, anyone got any ideas about this one? – I’ve got a array of characters that I’m using to create and display various error messages. It’s all very basic, I’m not doing rocket science here but it’s getting corrupted straight away.
Here’s the code that populates and displays the array….

  if (ret_val != RX_FINISHED_STR_RECV)
  {
    // Strip all the control characters from the output buffer, and use this to populate clean_buff...
    cleanBuffer();
    info[0]=0;   // reset string
    sprintf(info,"WaitResp. Expected <%s>",expected_resp_string);
    if (ret_val == RX_FINISHED_STR_NOT_RECV)
    {
      strcat(info,", got <");     // this is the bit that's getting corrupted
      strcat(info,clean_buff);  // add the actual error message
      strcat(info,">");
    }
    else if (ret_val == RX_TMOUT_ERR)
    {
      strcat(info,". Timeout");
    }
    Serial.println(info);                            // display message.
    loggerMiniGSM->writeLn(info);  // write message to log file (and echo to screen)

  }

And here’s what I see on the screen…..

WaitResp. Expected <>>,çgotñ<+CMS ERROR: 310>
33:24:27 WaitResp. Expected <>>,çgotñ<+CMS ERROR: 310>

As you can see the characters immediately before and after the word ‘got’ have somehow been corrupted. The second line (with the timestamp) is also written to the logfile. (I check the logfile, and those characters are also corrupted there) So whats going on?

One of the things I don’t get how the array can get corrupted so quickly, - the error message is formed, and then it’s displayed straight away, no other functions are getting called in the meantime so what’s causing the corruption?
Also, the ‘info’ array is constantly being reset (i.e. info[0]=0;) and constantly been used by other functions without problems, and yet the same corruption is always appearing in the same place.

Other points….
Info is declared as “char info[90+1];”
The error is repeatable.
The array ‘info’ is available to all functions in the class, each of the functions use ‘info’ to display the error message in a similar manner, but none of those functions is there any similar corruption.
The arduino application is cycling through the same code, but the corruption doesn’t appear straight away. In the above example, the corruption after ‘got’ appeared after about five minutes, and the corruption before it about 5 minutes later.
The corruption doesn’t get any worse – the above example is after 33 hours of running.
The whole application involves a dozen librarys and external hardware, so I can’t really post the whole thing.
It’s running on an Mega2560

The string ", got <" is stored in memory at compile time. Some time during execution of your program, something is overwriting parts of that memory space. Most likely this is a buffer overrun error somewhere. Tricky to try to find. You have to look at all of your array utilizations, and make sure that you are not exceeding the limits of any array anywhere.

Lima7:

.

.
.
    cleanBuffer();



I can’t really post the whole thing.

How about posting the code for cleanBuffer()?

Regards,

Dave

Thanks Paul, I know that buffer overflows are the traditional cause of problems like this, but buffers are linear aren't they? so how does a buffer overflow corrupt only the 2nd and 5th character, without effecting others?

My application uses lots of char arrays (which may explain why I'm having such trouble tracking this problem down) but they all contain only printable characters, if one off these buffers is overflowing into the memory allocated for ", got <" then wouldn't the corruption take the form of ascii characters?
Dave - cleanBuffer() takes the output from my SMS module and strips out all the non-printing characters. heres the code...

void MiniGSM::cleanBuffer()
{
  
  clean_buff[0]=0;  //empty the array
  
  int i=0;
  int y=0;
  while (comm_buf[i] != 0)  //check all characters until the end
  {
    if ((comm_buf[i] >= 32) && (comm_buf[i] <= 126))
    {
      clean_buff[y] = char(comm_buf[i]);
      y++;
    }
    i++;
  } 

  clean_buff[y] =0;  //stick a null terminator on the end
}

Thanks for your help guys

This code reproduces the corruption in the middle of a string:

char mybuf [90] = {0};

void setup () {
  Serial.begin(9600);
  
  // corrupt some literal strings in memory
  memcpy ((void *) 257, "\xE7", 1);
  memcpy ((void *) 261, "\xF1", 1);
  
}

void loop ()
{
  
  strcat (mybuf, ", got <");
  
  Serial.println (mybuf);


  Serial.println (", got <");
  Serial.println (", got <");
  Serial.println (", got <");

  while (true);
}

Output is:

,çgotñ<
,çgotñ<
,çgotñ<
,çgotñ<

Note that no char array has gone out of bounds. However if something else went wrong (eg. a byte array went negative, or something along those lines) then we see here that the string ", got <" has become permanently corrupted.

Obviously you won't have the egregious memcpy that I have there, but an array out of bounds, somewhere, could cause the same thing.

Lima7:
Thanks Paul, I know that buffer overflows are the traditional cause of problems like this, but buffers are linear aren't they? so how does a buffer overflow corrupt only the 2nd and 5th character, without effecting others?

Not necessarily. Consider:

byte foo [20];

foo [22] = 0xF1;
foo [25] = 0xE7;

There, corruption outside the bounds of foo, in a non-linear way.

Is a char array a contiguous series of bytes, or is it more a series of some sort of longer word length cells, each only paying attention to a bytes worth?

A contiguous series of bytes. To be pedantic, 8-bit bytes rather than the "byte" type.

Thus:

char foo [10];

occupies 10 bytes of memory. These are signed characters (that is, their values go from -128 to +127). For unsigned bytes you want:

byte foo [10];

The difference is relevant when you are trying to do things like compare if a particular element is greater than 127 or less than zero.

It's possible the compiler word-aligns the whole lot, I'm not sure about this one.

In other words:

char foo [9];
// possible gap here to align the next variable
char bar [7];

An array of "anything" occupies a number of contiguous memory locations. The memory location associated with address of an element of an array of "anything"s can hold the value of a variable of type "anything."

For numeric data types:
Standard C++ has built-in integer data types of char, short int, int and long int. There are built-in floating point data types float and double. For each of the integer data types, there is a corresponding unsigned variety. Variables of unsigned data types occupy the same amount of memory as their signed counterparts.

Additionally...
Arduino has a typedef named "byte" which is a C++ unsigned char. That is, the Arduino byte is not a new or different data type; it is just another name for a standard C++ unsigned char.

Now, the C++ standard does not require that ints, long ints, etc. have any specific lengths. For a given C++ compiler, the sizeof() operator returns how many bytes of memory will be occupied by a given variable of a given data type. The C++ standard does specify that sizeof(char) always returns 1. I think that most computer architectures these days use one byte of memory to store a char. They didn't have C or C++ back in the days we were running FORTRAN programs on CDC6600 computers (which had a memory granularity of 60 bits, not today's puny eight bits).

The C++ standard also specifies that the size of a short int is greater than or equal to the size of a char, the size of an int is greater than or equal to the size of a short int, and the size of a long int is greater than or equal to the size of an int.

For avr-gcc, which is Arduino's compiler, sizeof(int) returns a value of 2 and sizeof(long) returns a value of 4. So---an int takes two bytes and a long int takes four bytes.

If you have an array of, say, 10 ints, then with avr-gcc, the array takes up 20 bytes. For compilers that you are likely to run across on your desktop or laptop workstation these days, sizeof(int) will probably return a value of 4 (but that's not guaranteed). For example, on my Linux computer, an array of 10 ints takes up 40 bytes of memory.

Bottom line: No matter what kind of C++ array you have, the number of elements in the array can be calculated (at compile time) by dividing the sizeof the array by the sizeof an element of the array. The code that calculates the number of elements in an array of a given data type is portable to other compilers even if the sizes of the particular data types are different.

   number_of_elements_in_array = sizeof(array) / sizeof(array[0]);

Regards,

Dave

Footnote:
The current Standard C++ language definition does not allow variable-length arrays. However, many C++ compilers do allow this (they got it from the current ISO C language standard). If a compiler does allow this, then evaluations of expressions containing the sizeof operator for a variable-length array are done at run time, not compile time. Since I am often interested in portability to other compilers, I usually eschew non-standard language features for routine calculations, so I rarely use variable-length arrays. But that's just me. I'm funny that way.

It's possible the compiler word-aligns the whole lot

Nope.

It seems a heck of a coincidence that it's only the two spaces that are corrupted.

Try moving the array somewhere else (or change the size of the preceeding array) and see if the error is the same or pops up elsewhere.


Rob

They are four bytes apart, which is interesting. Also interesting is that if it corrupts every 4th byte, then that is now outside that particular literal. So a runaway nearby array might be saving values at 4, 8, 12, 16 bytes past its end, for example.

I would be looking at the values (0xF1 and 0xE7) (241 and 231). Are they something you might store in a byte array somewhere? Maybe you are counting how many times certain transactions arrive (eg. 241 lots of transaction type 200). If so, is it possible that indexing into a nearby array, by transaction type, could exceed its boundaries?

Also be cautious how you are indexing into an array.

Notice this example:

byte counts [256];

void setup () {
Serial.begin(9600);
}

void loop () {
  for (int i = 0; i < sizeof (counts); i++)
    counts [i] = 0;
    
  char x = 200;
  
  counts [x] = counts [x] + 1;
 
  Serial.print ("count is ");
  Serial.println (counts [200], DEC);
  
  while (true);
  
}

Output is: "count is 0"

Now make a minor change:

byte counts [256];

void setup () {
Serial.begin(9600);
}

void loop () {
  for (int i = 0; i < sizeof (counts); i++)
    counts [i] = 0;
    
  byte x = 200;   // byte is unsigned
  
  counts [x] = counts [x] + 1;
 
  Serial.print ("count is ");
  Serial.println (counts [200], DEC);
  
  while (true);
  
}

Output is: "count is 1"

So, indexing into an array with a char as the index (eg. if you are counting how many times a particular byte arrives) can give you negative indices, thus corrupting nearby memory.

By the way, this is why I was asking in another thread about changing compiler switches. Take this example program:

unsigned char counts [256];
char x = 200;

int main ()
  {
  counts [x]++;
  return 0;
  }

Compiled with default warnings, this gives no errors nor warnings. But compiled with -Wall you get:

test.cpp:6: warning: array subscript has type ‘char’

A little hint there that things might go wrong.

Thanks guys, lots of interesting stuff here, much of it is raising more questions.

The application I’m having trouble with involves sending and receiving SMS’s, thus I’m doing a lot of fiddling about with strings. I’m using an SMS shield from the HW kitchen, and my code is based a much-modified version of the SMS library they provide. The library involves sending text commands to the SMS module, and processing the text that comes back, I was curious from day one as to why the incoming data from the SMS module is held in a byte array, and not a char array, to me it seemed to me more sensible to hold and process ascii text in character array, is this not the case?

If I’m going to be processing text (in the form of SMS commands) should I be storing that text in byte arrays, or char arrays? The basic string manipulation commands ( strpy(), strcmp(), strcat() ) seem to work with either, or am I playing with fire here?

Several of the above post have shown how isolated elements within an array can be corrupted by explicitly setting an array element outside of the bounds of that array (i.e. trying to set the 12th element in a 10 element array). I assume this sort of corruption can’t occur from the normal string manipulation commands (ie strcat() and strcpy() ) because even if you do attempt to move/copy a 20 element array into a 10 element array, then the corruption has to be linear – by this I mean the first 10 elements will be handled correctly, and the next 10 memory locations will be corrupted in order. Is this right?

Cheers

Mike

by this I mean the first 10 elements will be handled correctly, and the next 10 memory locations will be corrupted in order. Is this right?

I think so, none of the standard strxxx funcs will be that selective.

Which means it's your code :slight_smile:

Just doing some thinking out aloud here in the hope that something will click.

    sprintf(info,"WaitResp. Expected <%s>",expected_resp_string);

results in

WaitResp. Expected <>

so expected_resp_string is NULL, is that expected/reasonable?

then

strcat(info,", got <");

eventually shows

,çgotñ<

where does the > come from?

Still no obvious reason to me.

Have you tried moving info to see if the error moves with it?


Rob

Lima7:
I assume this sort of corruption can’t occur from the normal string manipulation commands (ie strcat() and strcpy() ) because even if you do attempt to move/copy a 20 element array into a 10 element array, then the corruption has to be linear – by this I mean the first 10 elements will be handled correctly, and the next 10 memory locations will be corrupted in order. Is this right?

In principle, yes.

But my point was that the corruption is not necessarily caused by string manipulation commands. Any assignment to an array, particularly one that goes out of bounds, could cause it. The very fact that a single byte (well, two separate single bytes) seem to have been corrupted suggests it isn't a string manipulation (which generally would affect at least two bytes, if you include the terminating 0x00 byte), but rather an assignment to a char or byte array, or something similar.

thanks Rob, I think you're miss-quoting. expected_resp_string is actually ">" and the result is...

WaitResp. Expected <>>

that also explains the extra ">".

My application is in a state of flux, info has moved, and I'm no longer using clean_buff - but the corruption is still the same. It's nice to know there's some things you can rely on!

Any thoughts on wether I should be using char or byte arrays as per my last post?

Regards

OK, it's not obvious from the code we've seen why expected_resp_string is ">", and on the surface that doesn't make a lot of sense, but if it is it is :slight_smile:


Rob

The ">" is basically the 'cursor' - it's the response back from the SMS module that you expect to see to indicate the previous command was processed. Sorry, should have made it clearer.

Any thoughts on wether I should be using char or byte arrays as per my last post?

Signed char arrays (the default type, if you don't specify unsigned) hold values in the range -128 to 127. What purpose a -54 serves is beyond me. All the characters in the english language have ASCII values between 0 and 127, so signed char is fine for holding them. For other languages, the values between 128 and 255 ARE used, so the array type needs to change to byte or unsigned char.

The reason that the string functions don't care whether you use char or byte is because the values of interest in the arrays are all 8 bit values, and all positive. That the form of one type limits the positive values it can hold to no higher than 127 is of no concern to the string functions.