Help with passing variables at global scope into inline assembly.

I'm trying to control a parallel LCD with a 595 shift register, and control >that< 595 with another 595, and using some of my leftover outputs on the second 595 to control a 597 (I'm low on pins...).

So, dissatisfied with code like:

void shiftout(void)
{
  uint8_t mask=1;
  for (int i=0; i<8; i++) {
    if ((outbyte & mask)==0)
      digitalWrite(dpin, LOW);
    else
      digitalWrite(dpin, HIGH);
    HL(dclk);  // Toggle clockpin.
  }
  HL(dlat);    // Toggle latchpin
}

because of the massive overhead that comes along with digitalWrite, I decided to try in assembly.
(Oh, this code worked just fine, but was only able to output about 2 characters per millisecond, or 20kbps. Waaay to slow for an isr...)

I wrote:

void shiftout_asm(void)
{
  asm volatile (
    "SBI %[datport], %[datbit]"    "\n\t"  // digital write HIGH
    "SBRS %[outb], 0"              "\n\t"  // is bit 0 set?
    "CBI %[datport], %[datbit]"    "\n\t"  // No. dW low...
    "SBI %[clkport], %[clockbit]"  "\n\t"  // Clock high
    "CBI %[clkport], %[clockbit]"  "\n\t"  // and low..
    "SBI %[datport], %[datbit]"    "\n\t"  // digital write HIGH
    "SBRS %[outb], 1"              "\n\t"  // is bit 1 set?
    "CBI %[datport], %[datbit]"    "\n\t"  // No. dW low...
    "SBI %[clkport], %[clockbit]"  "\n\t"  // Clock high
    "CBI %[clkport], %[clockbit]"  "\n\t"  // and low..
    // ... rinse and repeat for bits 2-6 code ommitted for brevity
    "SBI %[datport], %[datbit]"    "\n\t"  // digital write HIGH
    "SBRS %[outb], 7"              "\n\t"  // is bit 7 set?
    "CBI %[datport], %[datbit]"    "\n\t"  // No. dW low...
    "SBI %[clkport], %[clockbit]"  "\n\t"  // Clock high
    "CBI %[clkport], %[clockbit]"  "\n\t"  // and low..
    "SBI %[latport], %[latpin]"    "\n\t"  // latch high
    "SBI %[latport], %[latpin]"    "\n\t"  // And low, and we're
                                           // shifted and happy.
    : // no output operands
    : [datport] "I" (_SFR_IO_ADDR(PORTD)), [datbit] "I" (6), //PIN6
      [clkport] "I" (_SFR_IO_ADDR(PORTD)), [clkbit] "I" (3), //PIN3
      [latport] "I" (_SFR_IO_ADDR(PORTD)), [clkbit] "I" (7), //PIN7
      [outb] "r" (outbyte)
    : // No clobbers
  );
}

And, after several hours of compiler-fighting, got it to work. Much more impressive, with upwards of 25kCPS, 250kbps...
If I stick this:
volatile uint8_t *data_port=&PORTD;
and the corresponding SFR in the input operands section to:
_SFR_IO_ADDR(*data_port)
It compiles and works fine.
However, if I move the definition of uint8_t data_port to global scope, it won't compile.
I get an "impossible constraint in ASM" error with the input operand that is defined at global scope.
I've not investigated this a great deal, but I can't even pass one of the immediate values (the pin numbers) if they're in variables declared at global scope.

This forum request for help may be a bit premature, but I'm on vacation, and this is driving me batty, trying to figure this out with awful wireless in hotels in Oregon.

Anyone been through this have an "Oh! Here's what you need to do" to throw at me?

Thanks.

because of the massive overhead that comes along with digitalWrite, I decided to try in assembly

You missed out the intermediate and much simpler option of direct port manipulation.

See discussion here:

http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1230286016

According to that direct port manipulation was measured at 20x as fast as digitalWrite. Your figures are about 12.5x as fast. So you might want to look at direct ports, and let the compiler optimize your intentions.

Here is an example:

void setup ()
{
 pinMode (3, OUTPUT); 
 pinMode (4, OUTPUT); 
 pinMode (5, OUTPUT); 
}  // end of setup


void shiftout(byte c)
{
  
  for (int i=0; i<8; i++)
    {
    if (c & 1)
      PORTD |= 0x8;
    else
      PORTD &= ~0x8;
    
    PORTD |= 0x10;   // Toggle clockpin.
    PORTD &= ~0x10;
    
    c >>= 1;    // shift out LSB
    }
    PORTD |= 0x20;   // Toggle latchpin.
    PORTD &= ~0x20;
}  // end of shiftout

void loop ()
{
  
shiftout (0xA9);      // test
}   // end of loop

I hard-coded the bit values because I was lazy. But it demonstrates the idea. The timing?

The time between the start of each byte was 11.1667 uS. So taking the inverse of that gives you an output rate of 89551 bytes per second! That is 87 K bytes/second. So, not bad, huh?

(edit)

Changing the loop to use a byte rather than an int (all you need after all), reduced the time per byte to 10.0833 uS. That increased the data rate to 99174 bytes per second.

 for (byte i=0; i<8; i++)
...

And now that I look at the fact that you are using 3 pins, why not just use SPI? That clocks out data on one pin with the clock on the second. And then you just toggle the latch after each byte. But if the pins don't work out for you the direct ports are pretty fast.

Agree with the suggestion to use the hardware SPI pins for your shift register if they are available.

You could also use I2C I/O expanders like the MCP23017, and hang lots of them off the hardware I2C pins.

I'm building a box that has 8 total wires headed into it, and I want to be able to:

  1. Read a rotary encoder
  2. Drive an LCD display
  3. Read a 3-position switch
  4. Read 3 SPST switches
  5. (Optional) drive a high-intensity LED with PWM

I can scrap the LED if need be.

I've got a 597 and a 595 working together nicely, with the 595 talking to a cheapo LCD in 4-bit mode, (d7-d4 inclusive to A-D outputs on the 595), RS (The command bit) coming from E on the 595, and the ENA pin on the LCD taking up an arduino pin.
Then I stuck the 597's latch, load, and clock pins on the other three outputs of the 595.

So, to spit a byte to the LCD, I did:
ENA low
put RS followed by HIGH nibble into the 595
Latch
Put RS (again) and low nibble into 595
Latch again

To mess about with the 597, I'd just scoot the bits over.
Problem came when I wanted to pack it all into an ISR, (Eventually, I'd like a CheapLCD.Print(const char *) method that will feed 8 or so bytes at a time into a circular buffer, and return when the buffer is empty. The same ISR that reads byte off the queue will also (every 2ms or so) poll the 597. I should be able to debounce pretty quickly in the ISR.)
Problem was, writing this thing all nibbly like that, I had no control, really, over what the F G and H bits out of the 595 going to the 597 were without using yet another arduino pin to toggle the Master Reclear pin on the 595. So, I figure, why not talk to the LCD in 8 bit mode, and devote an entire 595 to it, it can share a clock with the 597, and if I latch both of them through another 595, I can do input or output through a 595.
I did direct port manipulation, and got 3097 characters in 251 msec.
With assembly, I got 3097 characters in 128 msec.
With digitalWrite, I got 3097 characters in 1530 msec.
My routines are a bit more convoluted than the example I posted to open this thread up. Mostly, I was (and continue to be) puzzled by why I can't pass a value into inline assembly that's declared at global scope.
For example:

void setup(void)
{
  // meh;
}

void loop(void)
{
  // double meh
  asm_blob();
  delay(1200);

}
int bar[3]={ 1, 2, 7  };
void asm_blob(void)
{

  uint8_t var = bar[2];

  asm volatile (
    "push r24"                "\n\t"
    "ldi r24, %[someval]"    "\n\t"
    "pop r24"
    :
    : 
    [someval] "I" (var)
    :
  );

}

This doesn't compile.

Move the declaration of bar into function asm_blob, however...

void setup(void)
{
  // meh;
}

void loop(void)
{
  // double meh
  asm_blob();
  delay(1200);

}
void asm_blob(void)
{
  int bar[3]={ 1, 2, 7  };

  uint8_t var = bar[2];

  asm volatile (
    "push r24"                "\n\t"
    "ldi r24, %[someval]"    "\n\t"
    "pop r24"
    :
    :  [someval] "I" (var)
    :
  );
}

That works fine.

Global scope. Why u no compile?

dsacmul:
I did direct port manipulation, and got 3097 characters in 251 msec.
With assembly, I got 3097 characters in 128 msec.
With digitalWrite, I got 3097 characters in 1530 msec.

Well with my direct port manipulation I got a character out in 10.08 uS each whereas your assembly took 41.33 uS each. So you could still be better off tidying up the C version using port manipulation. And it will be easier to maintain later on.

To save pins you can use an IO expander as I described here:

That gives you 16 input/output pins. Handy for reading the two switches for example. The MCP23017 is $US 1.44 from Digi-key (or $1.20 in lots of 10).

Why does:
int b=4
void poo(void)
{
asm volatile (
"LDI r24, %[value]"
:
: [value] "I" (b)
:
);
}
not compile if b is at global scope, but does compile if b is in local scope?
I'm doing a bit more with my shift registers than I explained in the original post, and, when I'm done, I want a little board that'll handle a whole buncha stuff for $1.80 in parts. Plus, I'd like to be able to abstract the whole thing and encapsulate it in a class so that it'll be super spiffy. Right now, having to hardcode the ports is a definite downer.

I appreciate the wiring suggestions, but mostly I'm concerned with the scope issue.

Does the code compile if you change poo() to take an int value, and call it with a variable with global scope? The called function (poo()) will not know anything about the scope of the variable at the caller level.

I actually haven't tried that. I was thinking about it, but the question of why scope even matters was really bugging me.

dsacmul:

void asm_blob(void)

{
  int bar[3]={ 1, 2, 7  };

uint8_t var = bar[2];

asm volatile (
    "push r24"                "\n\t"
    "ldi r24, %[someval]"    "\n\t"
    "pop r24"
    :
    :  [someval] "I" (var)
    :
  );
}




That works fine.

Global scope. Why u no compile?

Nothing to do with scope, except indirectly.

ldi r24, %[someval]

ldi is "load immediate" (ie a number). You are passing a variable. It only works here because the compiler manages to optimize away your intentions by working out to generate:

ldi r24, 7

Actually...
"ldi r24, %[someval]" "\n\t"
"pop r24"
:
: [someval] "I" (var)

The "I" constraint is a qualified immediate value.
And, in my actual code (not the examples I posted here to flesh out the scope problem) the array in question is:
const uint8_t pinbits[14]={0, 1, ...};
so they're already consts, and my intention in this case is to actually pass the dang number in, but I can't if the array it's contained in is defined at global scope.

Try it yourself! It really does compile.

You can even specify (in the input operands section) an "I" constrainted value directly, like [someval] "I" (7)
Near as I can tell, when gcc sees an "I" operand, it hardcodes in the value of the expression after the "I". What I can't understand is why scope matters.

The compiler may not be aware that your globally scoped variable can't change. Have you tried either:

a #define
-or-
using 'const'

?

A define works, as you might expect. This works too:

enum bar  { a = 1, b = 2, c = 7  };

void setup(void) {}

void loop(void)
{
  asm_blob();
  delay(1200);
}

void asm_blob(void)
{

  asm  volatile (
    "push r24"                "\n\t"
    "ldi r24, %[someval]"    "\n\t"
    "pop r24"
    :
    :  [someval] "I" (c)
    :
  );
}

This keeps the spirit of having things in a neat list.

... my intention in this case is to actually pass the dang number in ...

I still think your original problem is that the compiler no longer views those variables as constant numbers and is complaining about the way your are trying to "ldi" them.

Nick:
Boy, I'm sure hoping you're wrong.
I'd like a user to be able to do:
(supposing I name my class IOMonster)
IOMonster io(4,5,6,7,8) // cmd clock, i/o clock, i/o data, cmd_data, cmd_latch

io.WriteLCD("This is a test.")
if (io.InputAvailable()) {
.. some meaningful code...
}
io.SetLED1(50); // Set led1 at 50% duty cycle...

This is all dependent (if I'm using assembly) in having the relevant port and bit passed into my assembly routine.
At one point (when I was walking along the oregon coastline) I thought about coding a huge bunch of #ifdefs with assembly inside them to correspond to every possible port / bit combo, but decided that was a non-starter.

If I can't pass an "I" constrained value defined at global scope into my assembly, I'll be stuck with direct I/O, which is half as fast as assembly.

Anyone know about the rules concerning passing variables at global scope into inline asm?

dsacmul:
This is all dependent (if I'm using assembly) in having the relevant port and bit passed into my assembly routine.
At one point (when I was walking along the oregon coastline) I thought about coding a huge bunch of #ifdefs with assembly inside them to correspond to every possible port / bit combo, but decided that was a non-starter.

Code's already been written (Google Code Archive - Long-term storage for Google Code Project Hosting.) for direct port IO with arbitrary defined pins, but you won't get a speed advantage unless the pins are constants and known to the compiler. Maybe something like this might work:

  switch(pin) {
    case 0: digitalWriteFast(0, val);
    // ...
  }

According to Mr. Gammon's data, direct port manipulation is quite a bit faster than your assembly - it's also simpler, and doesn't have this passing-variable issue.

I'll be stuck with direct I/O, which is half as fast as assembly.

It wasn't half as fast in my test. Perhaps you should be concentrating on making the C code faster. I mean, ultimately, if you ask the compiler to do "sensible" things (eg. use bytes rather than ints) then the generated assembler is going to be similar to what you are trying to force it to do directly.

I'd like a user to be able to do:
(supposing I name my class IOMonster)
IOMonster io(4,5,6,7,8) // cmd clock, i/o clock, i/o data, cmd_data, cmd_latch

Well this means the numbers have to be stored in variables, right? And ldi loads a number not a variable. So ultimately the compiler/assembler has to access the memory location where you stored 4 in, and get the 4 out of it. Rather than "ldi r24,4". If you want variables that is the price you pay.

You can probably get the syntax right to pull the variable contents in, I must admit examples were thin in the ground. But once again, the C compiler doesn't just sprinkle in "extra stuff to make it slower" for its amusement. If you code the same general idea that you were trying to do in assembler, in C (eg. direct port access) then it should run as fast.

Try the C version ... if it is still much slower than you expect post your code. Your loops may not be written optimally. For example in your original you had:

void shiftout(void)
{
  uint8_t mask=1;
  for (int i=0; i<8; i++) {
    if ((outbyte & mask)==0)
      digitalWrite(dpin, LOW);
    else
      digitalWrite(dpin, HIGH);
    HL(dclk);  // Toggle clockpin.
  }
  HL(dlat);    // Toggle latchpin
}

There's a few problems there. For one, I don't see mask changing. For another you are using an int in the loop where you could be using a byte. For a third you are calling another function (HL) when you could toggle the pin inline. Ditto for the latch. Plus of course you are using digitalWrite rather than direct port access. Tidy all that up and you should have a nice fast routine.