When I was having some issues with speed I ran the following sketch
I checked the output on pin3 with and oscilliscope and determined that the digitalWrite function (in v7) took about 9us. Since the waveform was close to a 50% duty cycle it didn't appear that the overhead for the loop function was impacting performance much. If it was, in the above example, the low portion of the wave would have been longer then the high. On a tek 2225? analog scope I couldn't see much difference when the timebase was set to 2us (i think).
A similar test might help identify/quantify the issue/problem for you. Also if you are using the new v8 then it is possible that something about the new version might have altered the above behavior.
I think I remember someone mentioning that the digitalWrite function was going to be sped up. If this was the case it is certainly possible that the reduction in time for digitalWrite would mean that the loop overhead could have a greater impact percentage wise when dealing with such simple loops.