Help with passing variables at global scope into inline assembly.

I'll be stuck with direct I/O, which is half as fast as assembly.

It wasn't half as fast in my test. Perhaps you should be concentrating on making the C code faster. I mean, ultimately, if you ask the compiler to do "sensible" things (eg. use bytes rather than ints) then the generated assembler is going to be similar to what you are trying to force it to do directly.

I'd like a user to be able to do:
(supposing I name my class IOMonster)
IOMonster io(4,5,6,7,8) // cmd clock, i/o clock, i/o data, cmd_data, cmd_latch

Well this means the numbers have to be stored in variables, right? And ldi loads a number not a variable. So ultimately the compiler/assembler has to access the memory location where you stored 4 in, and get the 4 out of it. Rather than "ldi r24,4". If you want variables that is the price you pay.

You can probably get the syntax right to pull the variable contents in, I must admit examples were thin in the ground. But once again, the C compiler doesn't just sprinkle in "extra stuff to make it slower" for its amusement. If you code the same general idea that you were trying to do in assembler, in C (eg. direct port access) then it should run as fast.

Try the C version ... if it is still much slower than you expect post your code. Your loops may not be written optimally. For example in your original you had:

void shiftout(void)
{
  uint8_t mask=1;
  for (int i=0; i<8; i++) {
    if ((outbyte & mask)==0)
      digitalWrite(dpin, LOW);
    else
      digitalWrite(dpin, HIGH);
    HL(dclk);  // Toggle clockpin.
  }
  HL(dlat);    // Toggle latchpin
}

There's a few problems there. For one, I don't see mask changing. For another you are using an int in the loop where you could be using a byte. For a third you are calling another function (HL) when you could toggle the pin inline. Ditto for the latch. Plus of course you are using digitalWrite rather than direct port access. Tidy all that up and you should have a nice fast routine.