avr compiler not using cbi instructions

Simple code:

DDRB=0xff;
for(unsigned char i=0;i<4;i++) PORTB|=1<<i;
for(unsigned char i=0;i<4;i++) PORTB&=~(1<<i);

after disassembly

DDRB=0xff;
00000011  SER R24		Set Register 
00000012  OUT 0x17,R24		Out to I/O location 
for(unsigned char i=0;i<4;i++) PORTB|=1<<i;
00000013  SBI 0x18,0		Set bit in I/O register 
00000014  SBI 0x18,1		Set bit in I/O register 
00000015  SBI 0x18,2		Set bit in I/O register 
00000016  SBI 0x18,3		Set bit in I/O register 
00000017  LDI R24,0x00		Load immediate 
00000018  LDI R25,0x00		Load immediate 
for(unsigned char i=0;i<4;i++) PORTB&=~(1<<i);
00000019  LDI R22,0x01		Load immediate 
0000001A  LDI R23,0x00		Load immediate 
--- No source file -------------------------------------------------------------
0000001B  IN R18,0x18		In from I/O location 
0000001C  MOVW R20,R22		Copy register pair 
0000001D  MOV R0,R24		Copy register 
--- No source file -------------------------------------------------------------
0000001E  RJMP PC+0x0003		Relative jump 
0000001F  LSL R20		Logical Shift Left 
00000020  ROL R21		Rotate Left Through Carry 
00000021  DEC R0		Decrement 
00000022  BRPL PC-0x03		Branch if plus 
00000023  COM R20		One's complement 
00000024  AND R18,R20		Logical AND 
00000025  OUT 0x18,R18		Out to I/O location 
00000026  ADIW R24,0x01		Add immediate to word 
00000027  CPI R24,0x04		Compare with immediate 
00000028  CPC R25,R1		Compare with carry 
00000029  BRNE PC-0x0E		Branch if not equal

in first loop its using SBI instructions but when I`m writing 0s to portb pins via for loop there is too much overhead in the code. Tried all optimization levels and output is without CBI.

code without 2-nd for loop

DDRB=0xff;
for(unsigned char i=0;i<4;i++) PORTB|=1<<i;
PORTB&=~(1<<PB3);
PORTB&=~(1<<PB2);
PORTB&=~(1<<PB1);
PORTB&=~(1<<PB0);

output

DDRB=0xff;
00000011  SER R24		Set Register 
00000012  OUT 0x17,R24		Out to I/O location 
for(unsigned char i=0;i<4;i++) PORTB|=1<<i;
00000013  SBI 0x18,0		Set bit in I/O register 
00000014  SBI 0x18,1		Set bit in I/O register 
00000015  SBI 0x18,2		Set bit in I/O register 
00000016  SBI 0x18,3		Set bit in I/O register 
PORTB&=~(1<<PB3);
00000017  CBI 0x18,3		Clear bit in I/O register 
PORTB&=~(1<<PB2);
00000018  CBI 0x18,2		Clear bit in I/O register 
PORTB&=~(1<<PB1);
00000019  CBI 0x18,1		Clear bit in I/O register 
PORTB&=~(1<<PB0);
0000001A  CBI 0x18,0		Clear bit in I/O register 
0000001B  RJMP PC-0x0000		Relative jump

which means manipulating bits via for loop will use more ROM space and waste cycles than manually enabling the bits one by one ?

Success!

static const uint8_t bit_by_index[8] = 
  {
    0b00000001,
    0b00000010,
    0b00000100,
    0b00001000,
    0b00010000,
    0b00100000,
    0b01000000,
    0b10000000,
  };

static const uint8_t invert_by_index[8] = 
  {
    0b11111110,
    0b11111101,
    0b11111011,
    0b11110111,
    0b11101111,
    0b11011111,
    0b10111111,
    0b01111111,
  };

void setup( void )
{
  DDRB=0xff;
  for(unsigned char i=0;i<4;i++) PORTB |= bit_by_index[i];
  for(unsigned char i=0;i<4;i++) PORTB &= invert_by_index[i];
}

void loop( void )
{
}

Looks hardcore but its generating cbi instructions at least :-)