The compiler will produce code that will do port I/O in a single instruction (65 nanoseconds) if it knows the port, pin and state at compile time.
So if you can eliminate the intermediate variables in that code your IO will be much faster. It doesn't matter if you code this using something like:
PORTC &= ~(1 << (dataP);
Or
cbi PORTC,5
the compiler will produce the latter fast and atomic code.
I found it convenient to define a macro that converts Arduino ports to registers and pins.
Here is your code rewritten to use the macro:
for( int i = 0; i < 8; i++)
if (!!(val & (1 << (7 - i))) == LOW)
fastWriteA(5, LOW );
the machine instructions for this are as follows, only one line of code is doing IO, everything else is the for loop and bit test.
for( int i = 0; i < 8; i++)
if (!!(val & (1 << (7 - i))) == LOW)
168: cb 01 movw r24, r22
16a: 02 2e mov r0, r18
16c: 02 c0 rjmp .+4 ; 0x172 <loop+0x7a>
16e: 95 95 asr r25
170: 87 95 ror r24
172: 0a 94 dec r0
174: e2 f7 brpl .-8 ; 0x16e <loop+0x76>
176: 80 ff sbrs r24, 0
fastWriteA(5, LOW );
178: 45 98 cbi 0x08, 5 ; 8 // set pin 5 of port C low (analogPin 5)
17a: 21 50 subi r18, 0x01 ; 1
17c: 30 40 sbci r19, 0x00 ; 0
17e: 8f ef ldi r24, 0xFF ; 255
180: 2f 3f cpi r18, 0xFF ; 255
182: 38 07 cpc r19, r24
184: 89 f7 brne .-30 ; 0x168 <loop+0x70>
Here are the macros
// the following macro sets a digital pin high or low, pin must be between 0 and 13 inclusive
// usage: fastWrite(2,HIGH); fastWrite(13,LOW);
#define fastWrite(_pin_, _state_) ( _pin_ < 8 ? (_state_ ? PORTD |= 1 << _pin_ : PORTD &= ~(1 << _pin_ )) : (_state_ ? PORTB |= 1 << (_pin_ -8) : PORTB &= ~(1 << (_pin_ -8) )))
// the macro sets or clears the appropriate bit in port D if the pin is less than 8 or port B if between 8 and 13
// this macro does fast digital write on pins shared with the analog port
#define fastWriteA(_pin_,_state_) (_state_ ? PORTC |= 1 << (_pin_ ) : PORTC &= ~(1 << (_pin_ ) ))