Basic for loops generally take two instructions (not sure about the ATmega, I could be off +/- one instruction). This doesn't account for compiler optimization games like loop unrolling.
Speaking of unrolling, you can stuff more work into your loop to mitigate the loop overhead. For example:
float f1=0.0, f2=0.0, f3=0.0, f4=0.0, f5=0.0;
int i;
long start, end;
start = millis();
for (i=0; i<10000; i++)
{
f1 += 3.14159;
f2 += 3.14159;
f3 += 3.14159;
f4 += 3.14159;
f5 += 3.14159;
}
end = millis() - start;
...
-j