Compiler treating distinct Unicode characters as equivalent

I had some code which takes different actions based on the contents of a string. Specifically, I have a switch statement which catches particular characters. Unfortunately, when trying to compile, I'm told that I have a "duplicate case value". It seems to think that "?" and "»" are the same character, as well as "?" and "?".

If you can't see those, that's U+263B and U+00BB as well as U+2620 and U+3020.

How can I tell the compiler that these really are different characters?

Here's the minimum version of my sketch to reproduce the error:

void setup() {
char c = 'A';
int a = 0;
      switch (c) {
      case '«': 
        a = 104; 
        break;
      case '´': 
        a = 105; 
        break;
      case '»': 
        a = 106; 
        break;
      case 'ÿ': 
        a = 167; 
        break;
      case '€': 
        a = 168; 
        break;
      case '?': 
        a = 169; 
        break;
      case '?': 
        a = 170; 
        break;
      case '?': 
        a = 171; 
        break;
      case '?': 
        a = 172; 
        break;
      case '?': 
        a = 173; 
        break;
      case '?': 
        a = 174; 
        break;
      case '?': 
        a = 175; 
        break;
      case '?': 
        a = 176; 
        break;
      case '?': 
        a = 177; 
        break;  
      default : 
        a = 0; 
        break;
      }
}

I don't know that Arduino can handle Unicode characters or not.

But I do know that the Unicode characters you're trying to use are 16bit. So there's no way they're ever going to fit into an 8 bit char.

This compiles, but won't do what you might expect:

void setup ()
  {
  char c = 'a';
  
  switch (c)
    {
    case 'a' : break;
    case 'ab' : break;
    case 'abc' : break;
    case 'abcd' : break;
    }  // end of switch
  
  }  // end of setup

void loop () { }

Why? Only 'a' fits into a char, not 'ab' or 'abc' etc. They are truncated to fit, so this is equivalent to:

void setup ()
  {
  char c = 'a';
  
  switch (c)
    {
    case 'a' : break;
    case 'b' : break;
    case 'c' : break;
    case 'd' : break;
    }  // end of switch
  
  }  // end of setup

void loop () { }

Now, try things that only differ in the high-order bytes and you get that error:

void setup ()
  {
  char c = 'a';
  
  switch (c)
    {
    case 'a' : break;
    case 'ba' : break;
    case 'cba' : break;
    case 'dcba' : break;
    }  // end of switch
  
  }  // end of setup

void loop () { }
sketch_jul13c.ino: In function 'void setup()':
sketch_jul13c:8: error: duplicate case value
sketch_jul13c:7: error: previously used here
sketch_jul13c:9: error: duplicate case value
sketch_jul13c:7: error: previously used here
sketch_jul13c:10: error: duplicate case value
sketch_jul13c:7: error: previously used here

Thanks for the clear, detailed response, Nick. I'm now off to research other ways of holding my characters & strings.

WhiteHotLoveTiger: Thanks for the clear, detailed response, Nick. I'm now off to research other ways of holding my characters & strings.

If they're coming in over serial, pretend they're numbers and hold them in an int. Compare with the hex codes when you do the comparison in the switch statements.

Actually, it starts off as a char array which you define before compiling. The array gets looped through, and each character corresponds to a pattern for a series of leds to display. (the leds display a pixelized version of the character)

Personally, I have no problem with storing my string as ints rather than characters from the beginning, however this part of the code needs to be pretty easy for other people with various levels of experience to modify. (In other words, enter whatever text they want, without having to lookup values for special characters)

I haven't finished reading about this yet, but I've seen a few references to wchar. Perhaps this is what I need to use.

Delta_G: I don't know that Arduino can handle Unicode characters or not.

I don't know either, but I've seen nothing to suggest that it supports multibyte characters; character literals, string literals, the base char types all use single byte ascii characters. Unless somebody says otherwise I would assume all text handling is single byte ascii, which means there is some indirection needed between the unicode code points and ascii. UTF-8 encoding would be one way to bridge the gap, but I suspect UTF-16 would make the comparisons simpler. Either way your character comparisons would need to be aware of multi-byte characters and not simple byte/char switch cases and == operations.

I don't understand why they have to be smiley face and other weird symbols. Why can't you just use regular letters? If the user will have to type in smiley face unicode characters, then he will have to know the hex codes. So if you insist on the funny symbols, why not just use the hex codes and be done.

Are there really more than 256 cases in that switch? If not why can't you use say 'k' instead of some symbol I don't even know how to make. Is there something wrong with normal letters?

I think we need to step back a bit and ask two questions:

  • Where are these Unicode characters coming from (what input device)?
  • What output device is going to display them?

Try set “preproc.substitute_unicode” to false.