Please use the code tags to properly format your code.
So bascially you say you need ~1 microsecond per count. Let's see what I would expect the while loop to translate to:
increment two byte integer
--> I would not expect much more from this code. I suggest to use avrdump to disassemble the .elf file. Then use the datasheet to count the cycles. Once you understand how many cycles this code takes you can start to tune it.
I would expect that
while ((PIND & B00001000) == B00001000)
would be better implemented as
((PIND & B00001000))
Also I would suggest to use 1 byte integers