Interrupt Speed Optimisation

Hi guys,

working on a project here with a TLC5917 and need to drive it at a particular rate to be compatible with other equipment. The code works, but the ISR is too slow. I need this to fire off every 17uS but getting the execution time below this is proving difficult. (I can see that the interrupts are stacking up on the oscilloscope by monitoring the frequency of the CLK pin which as you will see, toggles on every call to the the ISR). By throwing my ISR into a loop on it's own, it seems to take around 100us give-or-take to execute.

Any tips or advice would be most appreciated.

#include <TimerOne.h>

#define SDI 2
#define CLK 3
#define LE  4
#define OE  5

uint8_t DATA_LENGTH = 3;
uint8_t DATA_BUF[3] = {0x55, 0x55, 0x55};

unsigned long cycleNo = 0;
uint8_t oddCycle = 0;


void setup()
{
  pinMode(SDI, OUTPUT);
  pinMode(CLK, OUTPUT);
  pinMode(LE, OUTPUT);
  pinMode(OE, OUTPUT);

  //Timer1.initialize(17);
}

void loop()
{
  Timer1.attachInterrupt( sendDataISR );
  
  delay(1000);
}

void sendDataISR()
{
  if(cycleNo == (240 * DATA_LENGTH))  //(24 cycles per bit, 10 bits per byte)
  {
    Timer1.detachInterrupt();
    bitClear(PORTD, SDI);
    bitClear(PORTD, CLK);
    bitClear(PORTD, LE);
    bitSet(PORTD, OE);
    
    cycleNo = 0;
    return;
  }
  
  
  //CLK
  oddCycle = cycleNo % 2;
  if(oddCycle)                       //If odd cycle
    bitSet(PORTD, CLK);
  else
    bitClear(PORTD, CLK);
    
    
    
  //SDI
  uint8_t rawBitNo = cycleNo / 24;   //when bytes are combined
  uint8_t curBitNo = rawBitNo % 10;  //10 bits per byte
  uint8_t curByteNo = rawBitNo / 10; //^^^^
  
  switch(curBitNo)
  {
    case 0:
      bitClear(PORTD, SDI);
      break;
    case 9:
      bitSet(PORTD, SDI);
      break;
    default:
      if(DATA_BUF[curByteNo] & (1 << curBitNo - 1))
        bitSet(PORTD, SDI);
      else
        bitClear(PORTD, SDI);
      break; 
  }
  
  
  //LE
  uint8_t div = cycleNo % 24;
  if(div >= 20 && div < 22)
    bitSet(PORTD, LE);  
  else
    bitClear(PORTD, LE);
    
    
  //OE
  if((cycleNo > 24) && oddCycle)
    bitClear(PORTD, OE);
  else
    bitSet(PORTD, OE);

  cycleNo++;
}

Do you need to detach the Timer1 interrupt each time the interrupt is triggered ?

 if(cycleNo == (240 * DATA_LENGTH))  //(24 cycles per bit, 10 bits per byte)

I suspect that the compiler may have already optimised this away but could you not use

 if(cycleNo == 720)  //(24 cycles per bit, 10 bits per byte)

instead ?

Instead of using bitSet() and bitClear() would it be faster to manipulate the bitwise values using the & and | operators ?

Hi, thanks for your reply.

I do need to detach the timer again in order to stop manipulating the chip, else my routine will run indefinitely.

With regards to 240 * DATA_LENGTH, in this situation I believe it should have been optimised out, but in the actual application of my ISR, DATA_LENGTH will vary. I don't believe a single multiplication of two bytes should cause too much issue though.

I had wondered if bitSet and bitClear might have taken too long, so I wrote an iteration using a dummy variables and settings the value of PORTD at the end of the ISR but this didn't make any noticeable difference.

Commenting out the SDI section yields a large decrease in execution time (100us down to 25us). Commenting out LE aswell brings execution time down to around 6us.

I have found what I believe to be part of the issue:

uint8_t div = cycleNo % 24;

commenting this line of code out brings execution time with the rest of LE in (SDI still out) down from 25us to 7us.

Modulo 24 is going to be awkward - maybe a simple counter instead?

Okay, I used a counter to replace the modulo 24 line so that's LE sorted (thanks for that). SDI is a bit awkward though. %10 isn't causing too much issue; it's the switch statement.

Nested if statements don't seem to be helping though.

running at 8us without the switch statement currently. Will double check my timings with nested if statements and with switch.

EDIT: Scratch that. I think without the switch statement, the variable definitions are getting optimised out.

working on a project here with a TLC5917 and need to drive it at a particular rate to be compatible with other equipment.

Something to consider.

If you're using 3 TLC5917s cascaded (24 bit) and 57,600 baud (17.36µs) and using an AVR type Arduino @ 16MHz, then from this table, the error is 2.1% which works out to 8.75µs error at the 24th bit.

At 38,400 baud, the timing error is 1.25µs at the 24th bit.

Stop using that timer lib! and use on of the built in timer/counters with a real ISR .

The attach/detach methods are expensive! Much better to use a flag inside the ISR to control what the ISR does.

Mark

Cosford:
I had wondered if bitSet and bitClear might have taken too long

They are macros for direct port writes, so no problem there.

Commenting out the SDI section yields a large decrease in execution time (100us down to 25us). Commenting out LE aswell brings execution time down to around 6us.

I have found what I believe to be part of the issue:

uint8_t div = cycleNo % 24;

commenting this line of code out brings execution time with the rest of LE in (SDI still out) down from 25us to 7us.

There is no divide instruction in the processor so it is a slow operation. I had to do an 8-bit division by a fixed number in an ISR recently and found that in my case it took about 6us. You have four of them, so unless the specific dividends require more processing it doesn't add up. One would also hope that the compiler would be smart enough to reduce the divde and modulo into a single divide, but who knows without looking at the assembly output.

In any case, what I ended up doing was replace the division with a multiply-shift operation. In my case it went from about 6us to a around 0.6us. Now my whole ISR takes less time than that division took.

The long thread is here, but the gist is that instead of writing x = y/26, you would instead do something like x = y*79<<11. The modulo could be obtained from this with an additional multiply and subtraction. The trick is finding the correct multiplier and shift values. There is a link within that thread that gives a method for this.

mod24 problem is best solved with an additional counter.

ISR()
{
  ...
  int8_t div = cycleNo % 24;
  ...
  cycleNo++;
}

==>

ISR()
{
  ...
  int8_t div = mod24cycleNo;
  ...
  mod24cycleNo++;
  if (mod24cycleNo == 24) mod24cycleNo = 0;
  cycleNo++;
}

robtillaart:
mod24 problem is best solved with an additional counter.

Oh, I see I missed that he'd already dealt with that problem.

I looked at the assembly output and unfortunately the compiler (at least my version) generates two division function calls when doing x=y/10 and x=y%10 one right after the other. I expected it to be smarter.

And it is true that removing the switch results in these two divisions disappearing due to optimization.

Have you tested this code with the hardware connected? I'm wondering if it will work.

The switch statement looks like it's intended to append a start and stop bit to each byte. I don't see anything in the TLC5917 datasheet that suggests that it needs start and stop bits, and it looks like they'll only interfere with the actual data.

Can you tell us what you expect this code to do? It looks like it's intended to send each bit 24 times - enough to set every output pin on three TLC5917's to the same state. At the same time, it only sets the CLK pin on odd-numbered cycles, so it will actually send each bit twelve times. What are you expecting to see on the LEDs?

You might be optimizing prematurely - before you have working hardware and code. I'd recommend that you build it, and test at a low data rate, to make sure that the code does what you want.

I would stop using the TimerOne library as suggested above. Instead make your own ISR routine, eg.

ISR (TIMER1_COMPA_vect)          // <---- Timer A compare match ISR
{
  // whatever
}

The overhead of the library calling a function from its copy of this ISR adds some time, plus the compiler than has to push and pop a whole lot more registers. I think you can save possibly 1.2 µS by doing that.

unsigned long cycleNo = 0;
uint8_t oddCycle = 0;

Variables altered in an ISR should be declared volatile.

I adjusted the sketch to use the timer directly:

const byte SDI = 2;
const byte CLK = 3;
const byte LE  = 4;
const byte OE  = 5;

const byte DEBUGPIN = 9;

const uint8_t DATA_LENGTH = 3;
volatile uint8_t DATA_BUF[3] = {  0x55, 0x55, 0x55};

volatile unsigned long cycleNo = 0;
volatile uint8_t oddCycle = 0;

void setup()
{
  pinMode(SDI, OUTPUT);
  pinMode(CLK, OUTPUT);
  pinMode(LE, OUTPUT);
  pinMode(OE, OUTPUT);

  pinMode (DEBUGPIN, OUTPUT);
  
  // timer 1 off
  TCCR1A = 0;
  TCCR1B = 0;
  
  bitSet (TCCR1A, COM1A0);   // toggle OC1A on compare
  bitSet (TIMSK1, OCIE1A);   // turn on ISR
  
}  // end of setup

void loop()
{
  bitSet (TIFR1,  OCF1A);              // clear any outstanding interrupt flag
  OCR1A = (16 * 17) - 1;               // zero relative - 17 µS at 16 MHz
  TCNT1 = 0;                           // timer back to zero
  TCCR1B = bit (WGM12) | bit (CS10);   // CTC, prescaler of 1
    
  delay(1000);    // debugging
}  // end of loop

// here when timer fires
ISR (TIMER1_COMPA_vect) 
{
  // toggle CLK
  bitSet (PIND, 3);  // D3 happens to be PIND, bit 3
  
  if(cycleNo == (240 * DATA_LENGTH))  //(24 cycles per bit, 10 bits per byte)
  {
    TCCR1B = 0;                 // timer off
    bitClear(PORTD, SDI);
    bitClear(PORTD, CLK);
    bitClear(PORTD, LE);
    bitSet(PORTD, OE);
    cycleNo = 0;
    return;
  }

 /*
  //SDI
  uint8_t rawBitNo = cycleNo / 24;   //when bytes are combined
  uint8_t curBitNo = rawBitNo % 10;  //10 bits per byte
  uint8_t curByteNo = rawBitNo / 10; //^^^^

  switch(curBitNo)
  {
  case 0:
    bitClear(PORTD, SDI);
    break;
  case 9:
    bitSet(PORTD, SDI);
    break;
  default:
    if(DATA_BUF[curByteNo] & (1 << curBitNo - 1))
      bitSet(PORTD, SDI);
    else
      bitClear(PORTD, SDI);
    break; 
  }

  //LE
  uint8_t div = cycleNo % 24;
  if(div >= 20 && div < 22)
    bitSet(PORTD, LE);  
  else
    bitClear(PORTD, LE);

  //OE
  if((cycleNo > 24) && oddCycle)
    bitClear(PORTD, OE);
  else
    bitSet(PORTD, OE);
*/

  cycleNo++;
}  // end of ISR (TIMER1_COMPA_vect)

I commented out the code that takes a while. You can work on improving its speed. You can toggle CLK by writing to PIND as illustrated, which should save a few cycles.

On the logic analyzer I measure 1.69µS between when the timer fires and when the ISR is entered. I measured that by outputting the clock signal on pin D9.

Sometimes it took a bit longer (eg. 2.5µS, but that is to be expected because of other interrupts (Timer 0) firing).

With the code commented out, the ISR takes 2.25µS to execute, so you have some leeway there (out of the 17µS).

In other words, around 4µS from the timer firing to being about to leave the ISR. So you can probably reckon on about 11 or so µS for doing other stuff. That would be 176 clock cycles, so used wisely, you should get a fair bit done.

const byte SDI = 2;
const byte CLK = 3;
const byte LE  = 4;
const byte OE  = 5;

...

  pinMode(SDI, OUTPUT);
  pinMode(CLK, OUTPUT);
  pinMode(LE, OUTPUT);
  pinMode(OE, OUTPUT);


...

    bitClear(PORTD, SDI);

This is pretty "iffy" by the way.

You are confusing Arduino pin numbers with processor bit numbers. For example, although D2 happens to be port D, bit position 2, that does not apply to all pins.

For example, D9 on the board is PORT B, bit number 1.

The numbers happen to agree for the first 7 pins, but after that they don't.

To be consistent, stick to "port" numbers, in which case the pinMode should be changed to:

   bitSet (DDRD, SDI);
   bitSet (DDRD, CLK);
   bitSet (DDRD, LE);
   bitSet (DDRD, OE);

You can replace:

bitClear(PORTD, SDI);
bitClear(PORTD, CLK);
bitClear(PORTD, LE);
bitSet(PORTD, OE);

with:

PORTD = b00100000;

JimEli:
You can replace:

bitClear(PORTD, SDI);

bitClear(PORTD, CLK);
bitClear(PORTD, LE);
bitSet(PORTD, OE);




with:


PORTD = b00100000;

You could replace it with that, but it would be better to make it more legible.

AWOL:
You could replace it with that, but it would be better to make it more legible.

ok,

PORTD = b00100000; //clear SDI, CLK, LE and set OE bits

JimEli:
ok,

PORTD = b00100000; //clear SDI, CLK, LE and set OE bits

Laugh?

I thought I'd never start.

FWIW, OP wanted execution time improvement.

  bitClear(PORTD, SDI);
  bitClear(PORTD, CLK);
  bitClear(PORTD, LE);
  bitSet(PORTD, OE);
  //compiles to:
  //  cbi  0x0b, 2 ;2 cycles
  //  cbi  0x0b, 3 ;2 cycles
  //  cbi  0x0b, 4 ;2 cycles
  //  sbi  0x0b, 5 ;2 cycles

  PORTD = 0b00100000; //clear SDI, CLK, LE bit and set OE bit
  //compiles to:
  //  ldi  r24, 0x20 ;1 cycle
  //  out  0x0b, r24 ;1 cycle

Result is 1/4 of the execution time.