Personally I wouldn't use clock cycles for timing since a clock rate can be changed, except for very small intervals for which rounding is an issue, and then I'm going to need to integrate it with operations around it.
You can't get in and out of a function in much under a microsecond, so if you are looking for better resolution it will need to be inline. For very short delays you can...
__asm__( "nop\n\t"
"nop\n\t");
... for as many nops as you need at one cycle per nop. That could be a significant code size burden for longer delays. If you have a register to spare there are probably two cycle instructions that will make better code density. At some point it is better to loop and count.
If you are needing this sort of resolution, then you will also want to use objdump to disassemble your code and see what the compiler did. It has some amusing reordering rules that can put initialization computations inside your time critical areas.