Buffer overflows and should beginners be trying to learn c-string manipulations?

Short answer, I don’t think so.

Consider the latest C char processing buffer overflow security issue discovered Unix sudo

The following Arduino code illustrates the buffer overflow found in sudo

// example of buffer over flow in Unix sudo from
// https://blog.qualys.com/vulnerabilities-research/2021/01/26/cve-2021-3156-heap-based-buffer-overflow-in-sudo-baron-samedit

char otherChars[] = "1234567890";
char input[] = {'e', 's', 'c', 'a', 'p', 'e', 'd', ' ', 'c', 'h', 'a', 'r', 's', ' ', '\\', '\\', ' ', '\\', 'r', ' ', '\\', 'n', ' ', '\\', 0x00, // the real INPUT, the last \ causes the sudo overlow
                'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 0x00   // data in following memory locations
               };

void setup() {
  Serial.begin(115200); // <<<<<<<<<<< make debug output as fast as possible
  for (int i = 10; i > 0; i--) {
    Serial.print(i); Serial.print(' ');
  }
  Serial.println();

  char *result = (char*)malloc(strlen(input)+1);  // allocate space for the result and a terminating null
  if (result == NULL) {
    while (1) {
      Serial.println(" out-of-memory!!");
      delay(5000);
    }
  }
  *result = '\0'; // terminate it to start with
  
  size_t otherMemorySize = strlen(otherChars)+1;
  char *otherMemory = (char*)malloc(strlen(otherChars)+1);
  if (otherMemory == NULL) {
    while (1) {
      Serial.println(" out-of-memory!!");
      delay(5000);
    }
  }
  strncpy(otherMemory,otherChars,otherMemorySize);  // initialize some other memory
  
  Serial.println(" Initial conditions");
  Serial.print("input '"); Serial.print(input); Serial.println("'");
  Serial.print("result '");Serial.print(result); Serial.println("'");
  Serial.print("otherMemory '");Serial.print(otherMemory); Serial.println("'");
  Serial.println();
  
  char* from = input;
  char* to = result;
  while (*from) {
    if (from[0] == '\\' && !isspace((unsigned char)from[1]))
      from++;
    *to++ = *from++;
  }
  *to = '\0'; // terminate result

  Serial.println(" After Escape processing");
  Serial.print("input '"); Serial.print(input); Serial.println("'");
  Serial.print("result '");Serial.print(result); Serial.println("'");
  Serial.print("otherMemory '");Serial.print(otherMemory); Serial.println("'");
}
void loop() {
}

The output is

10 9 8 7 6 5 4 3 2 1 
 Initial conditions
input 'escaped chars \\ \r \n \'
result ''
otherMemory '1234567890'

 After Escape called
input 'escaped chars \\ \r \n \'
result 'escaped chars \ r n '
otherMemory 'xyzxyzxyzxyzxyzxyz'

Notice how the otherMemory has been overwritten.

This ‘bug’, presumably by a experienced programmer and review by others, when undiscovered for over 10 years.
C++ Strings were introduced to avoid many of the c-string problems. However the use of Strings in Arduino is advised against.

SafeStrings (my library) provides a practical alternative to Arduino Strings.
Here is the above code re-written using SafeStrings

// example of buffer over flow in Unix sudo from
// https://blog.qualys.com/vulnerabilities-research/2021/01/26/cve-2021-3156-heap-based-buffer-overflow-in-sudo-baron-samedit
// using SafeString for the processing

#include <SafeString.h>
// available from Arduino library manager, tutorial at https://www.forward.com.au/pfod/ArduinoProgramming/SafeString/index.html

char otherChars[] = "1234567890";
char input[] = {'e', 's', 'c', 'a', 'p', 'e', 'd', ' ', 'c', 'h', 'a', 'r', 's', ' ', '\\', '\\', ' ', '\\', 'r', ' ', '\\', 'n', ' ', '\\', 0x00, // the real INPUT, the last \ causes the sudo overlow
                'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 0x00   // data in following memory locations
               };

void setup() {
  Serial.begin(115200); // <<<<<<<<<<< make debug output as fast as possible
  for (int i = 10; i > 0; i--) {
    Serial.print(i); Serial.print(' ');
  }
  Serial.println();
  SafeString::setOutput(Serial); // turn on error msgs

  size_t resultSize = strlen(input) + 1;
  char *result = (char*)malloc(resultSize);  // allocate space for the result and a terminating null
  if (result == NULL) {
    while (1) {
      Serial.println(" out-of-memory!!");
      delay(5000);
    }
  }
  *result = '\0'; // terminate it to start with

  size_t otherMemorySize = strlen(otherChars) + 1;
  char *otherMemory = (char*)malloc(strlen(otherChars) + 1);
  if (otherMemory == NULL) {
    while (1) {
      Serial.println(" out-of-memory!!");
      delay(5000);
    }
  }

  cSFPS(sfOtherMemory, otherMemory, otherMemorySize); // wrap in a SafeString
  sfOtherMemory = otherChars; // copy in chars with bounds checking and error msgs // initialize some other memory

  Serial.println(" Initial conditions");
  Serial.print("input '"); Serial.print(input); Serial.println("'");
  Serial.print("result '"); Serial.print(result); Serial.println("'");
  Serial.print("otherMemory '"); Serial.print(otherMemory); Serial.println("'");
  Serial.println();

  cSFA(sfInput, input);
  cSFPS(sfResult, result, resultSize);

  size_t idx = 0;
  while (sfInput[idx] != '\0') {
    if (sfInput[idx] == '\\' && !isspace((unsigned char)sfInput[idx + 1])) { //<< error msg for idx+1 here
      idx++;
    }
    sfResult += sfInput[idx++];// << error msg for idx here (incremented above)
  }

  Serial.println();
  Serial.println(" After Escape processing");
  Serial.print("input '"); Serial.print(input); Serial.println("'");
  Serial.print("result '"); Serial.print(result); Serial.println("'");
  Serial.print("otherMemory '"); Serial.print(otherMemory); Serial.println("'");

}
void loop() {
}

Here is the output

10 9 8 7 6 5 4 3 2 1 
 Initial conditions
input 'escaped chars \\ \r \n \'
result ''
otherMemory '1234567890'


 After Escape called
input 'escaped chars \\ \r \n \'
result 'escaped chars \ r n '
otherMemory '1234567890'

Which is the expected ‘correct’ result.
The code is not as ‘optimized’ as the top example but it has the great advantage that it works!

Turning on the SafeString error msgs gives

10 9 8 7 6 5 4 3 2 1 
 Initial conditions
input 'escaped chars \\ \r \n \'
result ''
otherMemory '1234567890'

Error:  sfInput[] index 24 >=  sfInput.length() : 24
        sfInput cap:49 len:24 'escaped chars \\ \r \n \'
Error:  sfInput[] index 24 >=  sfInput.length() : 24
        sfInput cap:49 len:24 'escaped chars \\ \r \n \'
Error: sfResult.concat() of '\0'
        sfResult cap:24 len:20 'escaped chars \ r n '

 After Escape called
input 'escaped chars \\ \r \n \'
result 'escaped chars \ r n '
otherMemory '1234567890'

Which lets you know you have an idx out of bounds problem.

So my question is, given the decades of buffer overflow security errors that c-string manipulations have caused, not to mention errors in user’s Arduino projects, should we be advising Arduino beginners to use them at all?

If they are practicing coding then using Strings is a better choice.
If they are just do an Arduino project then SafeStrings is a better choice.

i would argue that it is the expected output for the code that is written. it may not be the intended output by the author, but the code is contrived and ignores standard, if not "common" practices.

C, or any languages including C++, needs to be used properly. Stroustrup has suggested it takes 10 years to be proficient in C++. at qualcomm, we were given training to recognize poorly written code that nefarious actors can take advantage of

the irony is you suggest there is a significant "flaw" in a mainstream programming language that continues to be in widespread use after nearly a half century.

you might similar argue -- should beginners be trying to learn c pointers? -- because of the potential for misuse

so yes, i believe beginners should learn the proper use of the features of any software tool

If they are just do an Arduino project then SafeStrings is a better choice.

I somehow thought that might get a mention when I saw the topic title and author

"code is contrived"

Sadly the code that causes the problem is not contrived

864                 for (to = user_args, av = NewArgv + 1; (from = *av); av++) { 
865                     while (*from) { 
866                         if (from[0] == '\\' && !isspace((unsigned char)from[1])) 
867                             from++; 
868                         *to++ = *from++; 
869                     } 
870                     *to++ = ' '; 
871                 }

It is extracted from the Unix code base that has been in commercial release for 10 years

"the irony is you suggest there is a significant "flaw" in a mainstream programming language that continues to be in widespread use after nearly a half century"

No irony here. The 'flaws' in C/C++ are well known and documented.
See How security flaws work: The buffer overflow
" It's an exaggeration (but only a slight one) to lay the blame entirely on the C programming language and its more or less compatible offshoots, namely C++ and Objective-C. The C language is old, widely used, and essential to our operating systems and software. It's also appallingly designed, and while all these bugs are avoidable, C does its damnedest to trip up the unwary."

"should beginners be trying to learn c pointers?"

I thought that was already answered NO see Arduino Style Guide for Writing Libraries.
"Don’t assume knowledge of pointers. Beginning users of C find this the biggest roadblock, and get very confused"

Just because a feature is available does not mean it should be suggested to beginners

The Arduino delay() feature is widely advised against, as is Stream's readBytes, readBytesUntil() and readString() and Strings in general

On the other hand Arduino is not C (or C++). Arduino has it own libraries which don't match C's.
Most boards do not implement printf() or sprintf() or ungetc() which are 'standard' C methods and so Arduino is incomplete as a tool for learning C programming.

Arduino was designed, I believe, to make micro-programming easy and accessiable.

Writing reliable code to do c-string manipulation is not easy for beginners (or it has been proved over and over, for experienced system programmers either) and the general lack of a debugger for the intro Arduino boards makes it hard to find the problems.

So why would we point beginners to using functions that are know to be fragile and will cause future problems? It appears we already advise against delay()'s and readUntil() and String. I believe c-string manipulation methods should be added to that list.

Most questions I see on the forum are from non-programmers that just want something to work, not from budding micro-C programming students.

drmpf:
It is extracted from the Unix code base that has been in commercial release for 10 years

i saw some of the unix kernel when i worked at bell labs. from which part of the code does this come from?

drmpf:
Arduino was designed, I believe, to make micro-programming easy and accessiable.

[url=http://

drmpf:
Explainer: What is Arduino?[/url] suggests it was intended for "physical computing".

drmpf:
So why would we point beginners to using functions that are know to be fragile and will cause future problems?

maybe beginners should not be exposed to professional languages.

It’s not part of the kernel. It’s part of the “sudo” program.

So, drmpf - what 3rd party security experts have reviewed your “safestring” code?

"what 3rd party security experts have reviewed your "safestring" code?"

The library includes an extensive test suite which was used to find the initial bugs and to check the changes.
Are you offering to review the code? Your critique would be most welcome.

which was used to find the initial bugs

What about the other bugs?

It would be fool hardy to say any library is without bugs.

But I keep using SafeString in my projects and try applying it to various forum questions to see if I come across any bugs or limitations or confusing method descriptions.
There have been more than 20 revisions / bug fixes / extensions so far. The library is up to V3.0.3 but nothing outstanding at the moment.

Try it and let me know if you find anything.

I learned C back in '85, and had to learn to use CStrings right. If' its good enough for me, then it's good enough for todays whippersnappers. All them vidyagames have made them soft (and don't even get me started on 5th ed).

Having said that:

If you are going to the trouble of defining a class, why not bite the bullet and add println as a method? A no-arg println, and one that takes a lambda? Oh, add support for __flashstringhelper too, so that F() gets supported out-of-the-box.

...
char SafeString::nullBufferSafeStringBuffer[1] = {'\0'}; // use if buf arg is NULL
...
  buffer = nullBufferSafeStringBuffer;
...
  buffer[0] = '\0';
...

Why?

UKHeliBob:
I somehow thought that might get a mention when I saw the topic title and author

Likewise.

I don't think anyone can criticise cstrings until his own code has been in equally widespread use for 10 years without any problems being found.

I have two concerns about the Safestring library (which the OP is already aware of)

  • Its documentation
  • It contains extra features (for example serial input) that should be in a separate library so as not confuse newbies who just want the convenience of the Safestrings

...R

@PaulMurrayCbr

“If you are going to the trouble of defining a class, why not bite the bullet and add println as a method? A no-arg println, and one that takes a lambda? Oh, add support for __flashstringhelper too, so that F() gets supported out-of-the-box.”

Actually there is a lot of print() support in SafeString including support for F()
Apart from Serial.println(sf);
you can print directly to a SafeString as in
sf.print(5); sf.print(" test "); sf.print(5.542,3); and sf.println(F(" test2"));

As well as that there is a debug() print method that can be switched on and off
sf.debug(); sf.debug(F(" in setup()")); and sf.debug(false); // short debug output

Here is a short example sketch

#include <SafeString.h>
cSF(sf, 100);
void setup() {
  Serial.begin(115200);
  for (int i = 10; i > 0; i--) {
    Serial.print(i); Serial.print(' ');
    delay(500);
  }
  Serial.println();

  SafeString::setOutput(Serial); // turn on error msgs and debug output

  sf.print(5);  sf.print(" test "); sf.print(5.542, 3); sf.println(F(" test2"));
  Serial.println("      Serial.print(sf) output");
  Serial.print(sf);
  Serial.println();
  Serial.println("      sf.debug() output");
  sf.debug();
  Serial.println();
  Serial.println("     sf.debug with title output");
  sf.debug(F(" in setup()"));
  Serial.println();
  Serial.println("     sf.debug with verbose false output");
  sf.debug(false); // turn off verbose output
}
void loop() {
}

Here is the output

10 9 8 7 6 5 4 3 2 1 
      Serial.print(sf) output
5 test 5.542 test2

      sf.debug() output
SafeString sf cap:100 len:20 '5 test 5.542 test2
'

     sf.debug with title output
 in setup() sf cap:100 len:20 '5 test 5.542 test2
'

     sf.debug with verbose false output
SafeString sf cap:100 len:20

@Coding Badly
buffer[0] = '\0';
Yes a good question.

I am somewhat paranoid about terminating the internal buffer to prevent overflows.
buffer[0] = '\0';
sets the pre-condition for the internal buffer

There is a possible problem where SafeString c_str() returns const char pointer to the internal private buffer.
If a nefarious, or just naive, user does something like

char * buf = (char*)sf.c_str(); // where sf is an empty SafeString i.e. buffer[0] = '\0';
buf[0] = '1';

Then they can corrupt the internal buffer's termination.

You see a lot of buffer[0] = '\0'; in the code to at least overwrite possible corruption to the null and empty buffers.

Writing this response I realize I could 'protect' against c_str() corruptions by having two internal buffers and copying
from the private to the public one before return it.

Seems a bit excessive in memory use, but may be it could be added it as an option?

@Coding Badly
There is another alternative to protect the internal buffer against alternation via c_str() and that is to mark this SafeString as subject to external modification when c_str() is called.

At present you can 'wrap' a char or char* in a SafeString for processing using
createSafeStringFromCharArray or cSFA
createSafeStringFromCharPtr or cSFP
createSafeStringFromCharPtrWithSize or cSFPS

In those cases, it is very easy for the user to modify the wrapped array outside the SafeString methods, so for SafeStrings that wrap external char buffers, each time a SafeString method is called the first line calls cleanUp();

void SafeString::cleanUp() {
  if (!fromBuffer) {
    return; // skip scanning for length changes in the buffer in normal SafeString
  }
  buffer[_capacity] = '\0'; // make sure the buffer is terminated to prevent memory overruns
  len = strlen(buffer);  // scan for current length to pickup an changes made outside SafeString methods
}

So a solution to the danger of users modifying the c_str() return would be to set fromBuffer true in c_str() and then all subsequent calls to SafeString methods would perform a cleanUp.

I would expect that normally there would be no performance impact from this because c_str() is usually the last method called to pass the final SafeString contents to some other method expecting a char*
I think I have just talked myself in to adding this change. A big thank you to @Coding Badly

drmpf:
I am somewhat paranoid about terminating the internal buffer to prevent overflows.

@drmpf, you're making me nervous. The null terminator has nothing to do with preventing buffer overflows unless you've made a terrible design mistake. (All operations should be size checked making the null terminator essentially superfluous.)

There is a second much more important point. As coded you are not preventing a problem but are covering it up. The correct choice is to assert the null terminator is present in debug builds.

I reckon this Thread should be merged with the OP's other Thread titled " A Safe Alternative to using Strings in Arduino"

The Law of Leaky Abstractions is also worth reading.

...R

drmpf:
So a solution to the danger of users modifying the c_str() return would be to set fromBuffer true in c_str() and then all subsequent calls to SafeString methods would perform a cleanUp.

i don’t think using a library will prevent (or minimize) programmer mistakes. mistakenly using a “+” instead of a “-” can result in a lengthy debug, just as much if not more.

a programmer can’t be prevented from using language features by adding a library and told not to use language features. to prevent mis-use, the language must not have those features.

niklaus worth describes the use of pointers with Pascal to support algorithms using linked-lists and trees. valid pointers are always allocated because they can only defined using language constructs. this is not as flexible as in C where they are essential to accessing memory mapped hardware as well as for more sophisticated uses

awk is my go to language for basic text processing or algorithm development, but is not suited for embedded applications. i think it would be better suited for beginners

a more constrained language would be easier for beginners. but it seems like a lost cause to argue that some library will minimize beginner programming mistakes; there are no guard rails restricting a beginner’s naive use of other language features. of course beginner’s are pushing the envelop.

experienced programmers make mistakes too. we learn from them

@Coding Badly
You are correct I should not be papering over an error. I will catch it and log it.
Null terminators are essential within the library since is uses c-string methods (with a lot of checking) internally.

@Robin2
The Law of Leaky Abstractions was well worth a read. Thanks
I not believe SafeString has any 'leakeage', but let me know if you find any.

SafeString does not replace the need to understand char and null terminators and memory allocation, but I still believe that we should not be promoting c-string methods to beginners for string processing and parsing, just as we don't recommend delay() and readUntil() and Strings. Sure they work but give rise to problems later.
At the very least the use of c-string methods in code examples should come with warnings about the dangers.

I will post the library changes on the " A Safe Alternative to using Strings in Arduino" thread

Sure they work but give rise to problems later.

Only if they are not used properly

By your logic we should not suggest that beginners use arrays of any kind