To give you a feel for it. This sketch:
volatile unsigned long foo = 20;
volatile unsigned long bar = 22;
volatile unsigned long fubar;
void setup () { }
void loop ()
{
fubar = foo + bar;
}
Loop generates this code:
void loop ()
{
fubar = foo + bar;
be: 20 91 00 01 lds r18, 0x0100
c2: 30 91 01 01 lds r19, 0x0101
c6: 40 91 02 01 lds r20, 0x0102
ca: 50 91 03 01 lds r21, 0x0103
ce: 80 91 04 01 lds r24, 0x0104
d2: 90 91 05 01 lds r25, 0x0105
d6: a0 91 06 01 lds r26, 0x0106
da: b0 91 07 01 lds r27, 0x0107
de: 82 0f add r24, r18
e0: 93 1f adc r25, r19
e2: a4 1f adc r26, r20
e4: b5 1f adc r27, r21
e6: 80 93 1a 01 sts 0x011A, r24
ea: 90 93 1b 01 sts 0x011B, r25
ee: a0 93 1c 01 sts 0x011C, r26
f2: b0 93 1d 01 sts 0x011D, r27
}
f6: 08 95 ret
The lds/sts are 2 clock cycles, the adds are 1.
12 x 2 + 4 = 28 cycles.
28 * 62.5 nS = 1.750 uS.
So a single add of two unsigned longs takes 1.750 uS. I made them volatile to force the compiler to generate code, you wouldn't normally do that. Anyway, you can see that you don't want to be doing much more than 20 such things inside loop, or you will exceed 40 uS.
However often enough when people ask very specific questions like this, it helps to know why there is this requirement. Perhaps the problem can be solved other ways (eg. timers, interrupts).