You are right, of course. I should have said "constant" instead of "variable". And since I was particularly minimum pulse generation I wasn't concerned with a variable value. The change to having to separate set, clear, and toggle registers in the newer AVRs was really necessitated by the limited number of addressable bits and the need to have atomic bit set and clear operations with the larger number of ports. The ugly case of the ATmega2560 which has some ports whose bits can be set or cleared individually but others that can't shows the limitations of the original design.
I do see a pulse width of 20ns, so for some reason there is a difference in the code generated by my macros vs your inline function (which presumably will optimize out the if statement if val is constant). The short pulse does rely that all the register loading of address and mask is done prior to the stores into OUTSET and OUTCLR. I see that is the case for the macros.
I did also notice that if I generate a string of pulses, occasionally but deterministically and seemingly for no reason there will be a delay of 20ns thrown in, as though there is an instruction fetch not being done in parallel. So yes the timing isn't consistent.