Fast alternative to digitalRead/digitalWrite

I was very interested in this because I'm writing for dedicated hardware with my LCD data pins contiguous and on the same port. With a simple benchmark just converting the stock LiquidCrystal library to DigitalPin and nothing else I saw a 32% speed up. By changing the write4bits method to shift the nibble directly into the port I only saw an additional 1.1% speed increase from the pure DigitalPin version.

Unless I totally mangled my direct port code, which is a very real possibility

PORTC = (PORTC & (~B00111100)) | ((value << 2) & B00111100);// D0-3 on A2-A5

When I removed the section setting the pins to output in each write the difference between digitalPin and direct port was only .06%

Is that basically what you're getting at or did I miss the point entirely?