How many bits is the char data type?

JChristensen · July 22, 2013, 2:30am

Would someone please try the sketch below. I must be missing something, the variable x is acting as though it were a signed 16-bit integer. I thought char was a signed 8-bit integer.

//Arduino 1.0.5, Arduino Uno.

void setup(void)
{
    Serial.begin(115200);
    char x = 0;
//    int8_t x = 0;    //gives same results as char
//    uint8_t x = 0;   //works as expected
    
    for (long i=0; i<32800; i++) {
        Serial.print(x, DEC);
        Serial.println();
        ++x;
    }
}

void loop(void)
{
}

The output I see:

liuzengqiang · July 22, 2013, 4:24am

Just did the test and got the same result. I then added Serial.print(x); and the Serial.print(x,DEC); becomes capped between -128 and 127.

My code that reveals the cause:

void setup(void)
{
    Serial.begin(115200);
    volatile char x=0;
    char y=0;
    for (long i=0; i<512; i++) {
        Serial.print(x, DEC);
        Serial.print('\t');
        Serial.print(y, DEC);
        Serial.println();
        ++x;
        ++y;
    }
}

void loop(void)
{
}

Some output:

Same result if I do (int)y before sending it to print.
I suspect compiler optimization bug. BTW, I read the definitions of print. There is no print(char c, int base). The closest is print (unsigned char, int).

JChristensen · July 22, 2013, 4:34am

@liudr, thanks for the test, at least it's not just me Sounds like a good theory, I wonder if all char or int8_t variables are taking up twice the amount of storage that people think.

I tried it on a few different releases. On Arduino 0022, it works as expected:

liuzengqiang · July 22, 2013, 4:45am

In a variation of the code I ran, I printed the address of two char array elements and they differ by 1, as expected. But the result is the same as using two char variables. I don't have 0022 any more. You should report this as a bug.

sigilwig444 · July 22, 2013, 5:03am

I am reading a book about programming arduino, and there is a chart with all of the intiger types and the char is 8-bit.

econjack · July 22, 2013, 5:06am

Could it be that the compiler is using Unicode for the character set?

liuzengqiang · July 22, 2013, 5:22am

econjack:
Could it be that the compiler is using Unicode for the character set?

I don't think so. The char array element address I printed out told me it's 8-bit.

MichaelMeissner · July 22, 2013, 5:44am

liudr:
Same result if I do (int)y before sending it to print.
I suspect compiler optimization bug. BTW, I read the definitions of print. There is no print(char c, int base). The closest is print (unsigned char, int).

Umm, technically, it is a program error and the program is undefined since the value overflows. The ISO standard defines the behavior for unsigned types (i.e. the expression is done in modulo arithmetic), but it is undefined if a signed type overflows.

nickgammon · July 22, 2013, 6:31am

The generated code is below:

000000c0 <setup>:
  c0:	cf 93       	push	r28
  c2:	df 93       	push	r29
  c4:	88 e9       	ldi	r24, 0x98	; 152
  c6:	91 e0       	ldi	r25, 0x01	; 1
  c8:	40 e0       	ldi	r20, 0x00	; 0
  ca:	52 ec       	ldi	r21, 0xC2	; 194
  cc:	61 e0       	ldi	r22, 0x01	; 1
  ce:	70 e0       	ldi	r23, 0x00	; 0
  d0:	0e 94 09 01 	call	0x212	; 0x212 <_ZN14HardwareSerial5beginEm>
  d4:	c0 e0       	ldi	r28, 0x00	; 0
  d6:	d0 e0       	ldi	r29, 0x00	; 0
  d8:	88 e9       	ldi	r24, 0x98	; 152
  da:	91 e0       	ldi	r25, 0x01	; 1
  dc:	be 01       	movw	r22, r28
  de:	4a e0       	ldi	r20, 0x0A	; 10
  e0:	50 e0       	ldi	r21, 0x00	; 0
  e2:	0e 94 a9 03 	call	0x752	; 0x752 <_ZN5Print5printEii>
  e6:	88 e9       	ldi	r24, 0x98	; 152
  e8:	91 e0       	ldi	r25, 0x01	; 1
  ea:	0e 94 c9 02 	call	0x592	; 0x592 <_ZN5Print7printlnEv>
  ee:	21 96       	adiw	r28, 0x01	; 1
  f0:	80 e8       	ldi	r24, 0x80	; 128
  f2:	c0 32       	cpi	r28, 0x20	; 32
  f4:	d8 07       	cpc	r29, r24
  f6:	81 f7       	brne	.-32     	; 0xd8 <setup+0x18>
  f8:	df 91       	pop	r29
  fa:	cf 91       	pop	r28
  fc:	08 95       	ret

As you can see, the value of x is kept in a register (R28/R29) and it does an adiw (add immediate word) at address 0xEE. This seems to me to be the bug. The fact that it is using a register means it is bypassing the normal truncation it would get if it actually put the data back into an 8-bit field.

nickgammon · July 22, 2013, 6:37am

Here you are, it's not a bug.

In the C programming language, signed integer overflow causes undefined behavior, while unsigned integer overflow causes the number to be reduced modulo a power of two, meaning that unsigned integers "wrap around" on overflow.

Do a search for "c++ signed overflow undefined".

Basically since it is undefined the compiler is entitled to generate whatever code it wants to.

(edit) Like MichaelMeissner said.

TCWORLD · July 22, 2013, 8:03am

The solution then is to simply replace:

++x;

With either this;

x = (byte)x + 1;

Or this:

x = (x+1) & 0xFF;

Both result in identical code due to compiler optimisation.

JChristensen · July 22, 2013, 1:19pm

Thanks everyone! Wow, very interesting. I found that declaring the variable as volatile makes it behave as I expected. In case you're wondering, I was just trying to demonstrate for a friend what happens when signed and unsigned integers overflow. The behaviour I expected for signed integers was for it to overflow from the maximum value (127) to the minimum value (-128). Kind of funny as I was ad-libbing at the time and of course got totally confused. I continue to be amazed at the optimization this compiler will do.

While I certainly cannot argue that the observed behaviour does not fit the definition of "undefined" I never would have expected "We'll promote your variable from 8 bits to 16, and continue to increment it, but when the 16 bits overflows, then we'll just let it go from the maximum value to the minimum value." The joke is certainly on me, hahaha

system · July 22, 2013, 1:22pm

An earlier thread on the same subject

JChristensen · July 22, 2013, 2:18pm

AWOL:
An earlier thread on the same subject

Thanks. I'd done some searching but didn't find that thread. I agree that from a purely theoretical viewpoint of the language, the behaviour is somewhat surprising. But given the specific implementation, and considering the compiler optimizations and hardware characteristics (instruction set), "undefined" in this case just turns out to be this really weird thing. I assumed that I knew what was going to happen, and I did not realize that I was treading into "undefined" territory. Couple lessons there for sure!

liuzengqiang · July 22, 2013, 3:32pm

Nick,

Thanks for showing the assembly. It's clear they didn't do any truncating or else on registers. Does ATMEGA328 not have an "inc" or "dec" command for such simple and often-needed incrementing and decrementing by 1?

Could you also demonstrate how the volatile keyword makes different assembled code? That would be great!

JChristensen · July 22, 2013, 3:45pm

Starting at address 40, ldd loads a byte from SRAM (forced there as a result of volatile), subi adds one (by subtracting -1, welcome to RISC!) and std puts the result back into SRAM.

00000000 <setup>:
   0:   0f 93           push    r16
   2:   1f 93           push    r17
   4:   df 93           push    r29
   6:   cf 93           push    r28
   8:   0f 92           push    r0
   a:   cd b7           in      r28, 0x3d       ; 61
   c:   de b7           in      r29, 0x3e       ; 62
   e:   80 e0           ldi     r24, 0x00       ; 0
  10:   90 e0           ldi     r25, 0x00       ; 0
  12:   40 e0           ldi     r20, 0x00       ; 0
  14:   52 ec           ldi     r21, 0xC2       ; 194
  16:   61 e0           ldi     r22, 0x01       ; 1
  18:   70 e0           ldi     r23, 0x00       ; 0
  1a:   0e 94 00 00     call    0       ; 0x0 <setup>
  1e:   19 82           std     Y+1, r1 ; 0x01
  20:   00 e0           ldi     r16, 0x00       ; 0
  22:   10 e0           ldi     r17, 0x00       ; 0
  24:   69 81           ldd     r22, Y+1        ; 0x01
  26:   77 27           eor     r23, r23
  28:   67 fd           sbrc    r22, 7
  2a:   70 95           com     r23
  2c:   80 e0           ldi     r24, 0x00       ; 0
  2e:   90 e0           ldi     r25, 0x00       ; 0
  30:   4a e0           ldi     r20, 0x0A       ; 10
  32:   50 e0           ldi     r21, 0x00       ; 0
  34:   0e 94 00 00     call    0       ; 0x0 <setup>
  38:   80 e0           ldi     r24, 0x00       ; 0
  3a:   90 e0           ldi     r25, 0x00       ; 0
  3c:   0e 94 00 00     call    0       ; 0x0 <setup>
  40:   89 81           ldd     r24, Y+1        ; 0x01
  42:   8f 5f           subi    r24, 0xFF       ; 255
  44:   89 83           std     Y+1, r24        ; 0x01
  46:   0f 5f           subi    r16, 0xFF       ; 255
  48:   1f 4f           sbci    r17, 0xFF       ; 255
  4a:   81 e0           ldi     r24, 0x01       ; 1
  4c:   0c 32           cpi     r16, 0x2C       ; 44
  4e:   18 07           cpc     r17, r24
  50:   01 f4           brne    .+0             ; 0x52 <setup+0x52>
  52:   0f 90           pop     r0
  54:   cf 91           pop     r28
  56:   df 91           pop     r29
  58:   1f 91           pop     r17
  5a:   0f 91           pop     r16
  5c:   08 95           ret

void setup(void)
{
    Serial.begin(115200);
    volatile char x = 0;
    
    for (long i=0; i<300; i++) {
        Serial.print(x, DEC);
        Serial.println();
        ++x;
    }
}

void loop(void)
{
}

liuzengqiang · July 22, 2013, 3:51pm

That is illuminating! Thanks Jack. I never learned assembly for any RISC system, just x386 assembly. When I used assembly, I do tend to keep things in registers if I can. I read disassembled Turbo C code back in 90's. It was moving between memory and register so much that I couldn't stop laughing

JChristensen · July 22, 2013, 3:58pm

liudr:
That is illuminating! Thanks Jack. I never learned assembly for any RISC system, just x386 assembly. When I used assembly, I do tend to keep things in registers if I can. I read disassembled Turbo C code back in 90's. It was moving between memory and register so much that I couldn't stop laughing

Yeah same here, I've done a fair amount of assembler in the past on various CISC machines, but never on a RISC machine. So I'm just feeling my way along the walls in the dark here

liuzengqiang · July 22, 2013, 4:14pm

My signature used to be rep movsd; //do it
For anyone that programmed assembly on 386 before, in real mode, this improves data transfer rate by 100% via 32-bit operations. Works great if you were coding a 320*200 256 color mode game and try to copy your buffer onto the video card. When I was playing with the 32-bit stuff, I had no access to 32-bit assemblers, just 16-bit with real mode debuggers. So once the CPU enters 32-bit "protected mode", I was running blind. Can't count how many times I had to restart my 486, like every minute.

JarkkoL · July 22, 2013, 9:12pm

Pfft, you were spoiled with 386 & VGA with all its fancy 32 bits & 256 colors

Topic		Replies	Views
bit shift and char cast Programming	19	5586	May 5, 2021
char datatype actually unsigned? Programming	11	3049	May 5, 2021
Byte to Char Programming	5	1453	May 5, 2021
Serious 'char' compiler problem. Suggestions for the Arduino Project	9	2701	May 6, 2021
int and unsigned int Programming	28	14679	May 5, 2021

How many bits is the char data type?

Related topics