Unrolling a loop stops it crashing!

I have a very strange situation.

I have this code:

for (int b = 0 ; b <= CHANNEL_COUNT ; ++b) {  
 				deleteChannel(b);
}

deleteChannel is declared as a private function within the class as follows:

void RemoteSign::deleteChannel(int ch)

CHANNEL_COUNT is equal to 9. a const int

On an EP8266 the code crashes after the for loop. The functionality in deleteChannel() is carried out correctly.

If I replace the loop with this code:

 deleteChannel(0);
 deleteChannel(1);
 deleteChannel(2);
 deleteChannel(3);
 deleteChannel(4);
 deleteChannel(5);
 deleteChannel(6);
 deleteChannel(7);
 deleteChannel(8);
 deleteChannel(9);

Then there is no crash after doing all 10 calls.

Obviously I don't want to hard code the number of channels.

Any clues as to the possible cause?

Here is the stack trace:
18:37:04.408 -> --------------- CUT HERE FOR EXCEPTION DECODER ---------------
18:37:04.455 ->
18:37:04.455 -> Exception (29):
18:37:04.455 -> epc1=0x4000e143 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000001 depc=0x00000000
18:37:04.455 ->
18:37:04.455 -> >>>stack>>>
18:37:04.455 ->
18:37:04.455 -> ctx: cont
18:37:04.455 -> sp: 3fff26d0 end: 3fff2aa0 offset: 0190
18:37:04.455 -> 3fff2860: 00000006 00000005 3fff28c8 40231d28
18:37:04.455 -> 3fff2870: 3fff28b0 00000006 3fff0238 40231d28
18:37:04.455 -> 3fff2880: 0000000b 3fff0238 3fff0238 40231fa9
18:37:04.455 -> 3fff2890: 3ffe8a12 3fff28bc 0000000a 3ffef6dc
18:37:04.455 -> 3fff28a0: 0000000b 3fff0238 0000000a 40221f4a
18:37:04.455 -> 3fff28b0: 312f6800 00000030 80000009 00003000
18:37:04.455 -> 3fff28c0: 00000001 80000009 312f6863 40230030
18:37:04.501 -> 3fff28d0: 85ff29d0 3fff2924 3ffef6a0 3ffef6dc
18:37:04.501 -> 3fff28e0: 0000000b 0000000b 3ffef6a0 40224ae6
18:37:04.501 -> 3fff28f0: 00418937 00cb4c4a 00000000 0000000a
18:37:04.501 -> 3fff2900: 45521100 41544553 80004c4c 40100d3c
18:37:04.501 -> 3fff2910: 4000050c 00000030 00000010 00000000
18:37:04.501 -> 3fff2920: 3fff154c 45520000 41544553 80004c4c
18:37:04.501 -> 3fff2930: 40228518 3fff3604 00000001 43b35eac
18:37:04.501 -> 3fff2940: 00000000 00000000 3fff3b00 0010001f
18:37:04.501 -> 3fff2950: 80ffffff 3fffc6fc 00000001 0000000a
18:37:04.501 -> 3fff2960: 45534552 4c4c4154 88ff2a00 40231bf2
18:37:04.549 -> 3fff2970: 392f6863 3ffe8500 84ff0848 00000000
18:37:04.549 -> 3fff2980: 3fff35b4 00000010 3fff2a30 40231bf2
18:37:04.549 -> 3fff2990: 40233024 3ff00014 00000000 3fff0848
18:37:04.549 -> 3fff29a0: 0000000b 00000000 00000020 40100d07
18:37:04.549 -> 3fff29b0: 00000000 00000000 4bc6a7f0 3fff0848
18:37:04.549 -> 3fff29c0: 3fff35b4 0000000f 3fff2a30 40231cdf
18:37:04.549 -> 3fff29d0: 45521139 41544553 8a004c4c 4843447b
18:37:04.549 -> 3fff29e0: 003c007d 8500000f 3fff2a30 40231d28
18:37:04.549 -> 3fff29f0: 3fff2a30 00000000 00000000 40231f0c
18:37:04.549 -> 3fff2a00: 00000000 00000000 3fff2a30 40231f3c
18:37:04.549 -> 3fff2a10: 3fff2a48 3ffef684 3fff2a48 3fff0848
18:37:04.596 -> 3fff2a20: 3ffe8504 3ffef6a0 3ffef684 40226911
18:37:04.596 -> 3fff2a30: 3fff35cc 000f000f 00001388 4022685a
18:37:04.596 -> 3fff2a40: 007a1200 43bd3189 3fff35b4 000f000f
18:37:04.596 -> 3fff2a50: 0a005345 80efeffe 2d505300 43393830
18:37:04.596 -> 3fff2a60: 00000000 00000000 00000001 3fff1400
18:37:04.596 -> 3fff2a70: 3fffdad0 00000000 3fff13c0 40226a98
18:37:04.596 -> 3fff2a80: feefeffe feefeffe 3fff13c0 40233038
18:37:04.596 -> 3fff2a90: feefeffe feefeffe 3ffe8558 40100fcd
18:37:04.596 -> <<<stack<<<
18:37:04.596 ->

and the exception decoder results:

xception 29: StoreProhibited: A store referenced a page mapped with an attribute that does not permit stores PC: 0x4000e143 EXCVADDR: 0x00000001 Decoding stack results 0x40231d28: String::copy(char const, unsigned int)* at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 214 0x40231d28: String::copy(char const, unsigned int)* at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 214 0x40231fa9: String::operator=(char const)* at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 285 0x40221f4a: RemoteSign::deleteChannel(int) at C:\ard\common/RemoteSign.cpp line 3421 0x40230030: EspClass::getSketchSize() at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*Esp.cpp* line 582 0x40224ae6: RemoteSign::handleRS(String) at C:\ard\common/RemoteSign.cpp line 2086 0x40100d3c: malloc(size_t) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266\umm_malloc*umm_malloc.cpp* line 552 0x40228518: WiFiClient::operator=(WiFiClient const&) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\libraries\ESP8266WiFi\src*WiFiClient.cpp* line 117 0x40231bf2: String::changeBuffer(unsigned int) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 187 0x40231bf2: String::changeBuffer(unsigned int) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 187 0x40233024: loop_wrapper() at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*core_esp8266_main.cpp* line 192 0x40100d07: free(void)* at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266\umm_malloc*umm_malloc.cpp* line 398 0x40231cdf: String::reserve(unsigned int) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 146 0x40231d28: String::copy(char const, unsigned int)* at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 214 0x40231f0c: String::operator=(String const&) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 262 0x40231f3c: String::String(String const&) at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*WString.cpp* line 41 0x40226911: RemoteSign::run() at C:\ard\common/RemoteSign.cpp line 2833 0x4022685a: setup() at C:\Ard\GenericRS\GenericRS/GenericRS.ino line 43 0x40226a98: loop() at C:\Ard\GenericRS\GenericRS/GenericRS.ino line 47 0x40233038: loop_wrapper() at C:\Users\ms\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.7.4\cores\esp8266*core_esp8266_main.cpp* line 197

Try

for (int b = 0 ; b < CHANNEL_COUNT ;  b++) {  
 				deleteChannel(b);
}

0-8
or

for (int b = 0 ; b <= CHANNEL_COUNT ;  b++) {  
 				deleteChannel(b);
}

0-9

If you have a loop that does that again, try printing the control variable to verify that you are generating the correct values first.

I think your first attempt generates 1-10

No, the ++b, if that’s what you grabbing at, happens at the right time, it doesn’t matter.

Or why you think it starts at 1?

a7

1 Like

@DaleSchultz Please try to make a complete and runnable example demonstrating this.

There is something else going on somewhere in you code, and the problem is not what you think.

I’ve had cases where inserting a comment changes things - once you do something wrong, even slightly wrongish, spurious clues can start coming out.

a7

Right. An MRE!!!

yea of course... the Class file has 3441 lines at present...

something somewhere.....

That's the problem.

There is nothing obviously wrong with your snippet.

There is obviously nothing wrong with your snippet.

The problem is not in what you have posted, but what you have omitted.

I can't won't tell you how many times the act of preparing an MRE has uncovered a problem I had overlooked. Saving me the potential embarrassment of having a dumb error revealed, and saving the esteemed forum participants who give of their time and experience from having to bother.

So same story always: post your code. All of it, or prove your point by reducing it to the smallest example you can that does the same thing.

There have been many very crisply presented MREs in these fora, usually but not 100 percent of the time it turns out to be a misunderstanding, you know on whose part… sometimes it's something truly subtle, sometimes just dumb.

Let us help, we like doing.

a7

1 Like

Um... You have 9 channels and you are deleting all 10 of them?!?

The OP in the unrolled version does delete 10 channels so

CHANNEL_COUNT is badly named

OR

there are only 9 channels, and something about deleting the last (non-existent) channel fails when done in the loop, while it does not (for whoever) when just called.

@DaleSchultz how many channels you got, sir?

a7

That backtrace suggests writing a bad pointer or maybe NULL pointer.

Sometimes Heisenbugs like this are a symptom of having run out of RAM, so instrument your memory usage perhaps (is there a FreeMemory library for ESP?).

1 Like

There are 10 channels being elements zero through nine in the array.

Yes, the const would be better named if I changed it to UPPER_CHANNELS.

The name reflects the channel numbers that users can see, as I don't show end users channel zero as it confuses people (even programmers).

Sometimes I even 'waste' element zero of such arrays so that I don't have to add or subtract 1 every time I display a channel number or convert a 'user' number to the array index. Forgetting to do that can lead to nasty bugs.

Thanks that is useful input. And yes, I have been expanding the memory usage and wondered about memory. I had looked at the IDE compilation report:

Sketch uses 507020 bytes (48%) of program storage space. Maximum is 1044464 bytes.
Global variables use 45080 bytes (55%) of dynamic memory, leaving 36840 bytes for local variables. Maximum is 81920 bytes.

I took that to mean that I was using only 55% of the memory. Am I correct in understanding that all variables, including arrays, are statically allocated and are therefore included in that 45K bytes?

You are correct, it is not this code.
And I agree, indeed, I fully understand the value of creating an MRE.

In this case, I can normally only get to these lines of code once the device has used wifi manager to get online, loaded itself which is dependent on a screen graphics library, opened its own TCP connection, accepted a TCPIP client, and received and parsed a command over an API from that client.

I think that somewhere in that stack something is writing to memory that should not be used. It is most likely some of my code, but it could be in one of the libraries I am using too. I don't typically use pointers directly so I am usually not guilty of using null pointers or forgetting to dereference one.

I can predict that as soon as I start discarding functionality (by say, not going online, hard coding a command locally, etc.) the problem will appear to have gone away because the loop will run just like we think it should. Just moving the loop code to another physical address may hide the the fact that something is overwriting memory and breaking the loop code using that memory.

I have discovered another location where, in the constructor of my class, I store the value 255 in all elements of an array [0--16] and element 16 gets changed to a value of 32 by the next line of my code. If I extend the array size to 17, then element 16 and 17 stay intact.

To me this suggest some sort of memory management issue. Some code is writing to memory it does not own. In one case a variable, and in another, where my loop is trying to execute. It could be two different bugs.

In the vain hope that the bug lies in one of the libraries, I might look seeing if any have an update and see if the newer version resolves the issue. Two issues with that are; a new version may simply push code to a different address and hide the fact that memory is getting messed up, and secondly, I usually like to only change library versions when things are working well so I can detect regressions. I can also stay away from issuing the command that leads to that loop code.

I may also see what sort of memory tools can shed light on what is going on.

Certainly the exception trace mentions wifi manager doing a memory allocation and lots of string copy operations, and I am doing a lot of String operations but I don't really know what all that means.

Only global and static variables and string literals are counted. Local variables and dynamically allocated space (contents of String variables, and memory allocated with 'malloc()' or 'new') are not counted but will use parts of the remaining memory at run time.

1 Like

well, I think I found the problem!

I switch between C++ and another programming language where one declares arrays by the upper element not the count. I had thus declared some arrays like this:

const byte maxpins  16;
byte pins[maxpins];

This explains why the last element of one of my arrays was getting 'overwritten' - because that last element never even existed! What is needed 'of course' is

byte pins[maxpins+1]; // elements 0 - 16

How did I find it ?

I decided to update the ESP8266 library from 2.7.4 to 3.0.2. The new version seems to implement a more stringent level of compiler warnings so it flagged some of my loops that would iterate through my whole array as being too large, since the array was one element too small!

Yay!

So… does that other language index from 1?

If you use C long enough, you start numbering just about everything, even IRL, from zero. :expressionless:

a7

yes, those arrays are also zero based!

I also find every time I switch languages I still use the other one's comment syntax too.