Fastest possible array indexing ?

Hi,
I want to be able to read values from a ring buffer of bytes as fast as possible.

I think the absolute fastest is to simply increment the lo byte of a pointer to a 256-byte buffer
(and somehow arrange for the buffer to be on a 256-byte boundary)

However, this wastes a lot of RAM since I only really need 16 bytes or so. (I did this, now need more RAM :slight_smile: )

The next fastest I've found is to arrange a pointer to a 16-byte buffer aligned to a 32-byte boundary, increment the lo byte of the pointer and clear the 4th bit.
i.e.

union _bp{
  volatile uint8_t *ptr;  // we set this pointer into sampleBuffer at a 32-byte boundary
  struct _hl{
    uint8_t lo;           // and use this as the index by incrementing then clearing bit 4
    uint8_t hi;
  }hilo;
}samplePtr;


ISR(TIMER2_COMPA_vect)
{
  val= *samplePtr.ptr;        // extract value from ring buffer
  samplePtr.hilo.lo++;        // directly increment pointer lo byte
  samplePtr.hilo.lo &= 0xEF;  // clear bit 4    
  ...

...currently I allocate a 48-byte buffer and search for an appropriate entry point and set 'samplePtr', which wastes 32bytes.

So... a couple of questions:

  1. Is there any faster way of implementing such a ring buffer ?

  2. Is there some compiler directive/pragma/whatever which would align a 16-byte buffer to a 32-byte boundary?

Yours,
TonyWilk

P.S.
If anyone's interested, this is why...

I'm writing code for an Arduino Pro Mini (ATmega328) which has to update a bunch of serially-addressed LEDs and output audio at 16Kz sampling.

The LEDs are a pain because the datarate is stupidly high and the chain of LEDs resets if there's more than about 6uS gap in the bit train. The Audio out is a pain because even small delays in sample outputs generates an audible 'click'.

So, the LEDs are driven in mainline code and the PWM updates are done by the Timer2 ISR
The ISR is pared down to the absolute minimum: get a sample from a buffer, stuff it in the PWM register.
It is this indexing into the sample buffer which is critical.

Please show us the entire ISR.

Here's the ISR:

ISR(TIMER2_COMPA_vect) { // Called when TCNT2 == OCR2A
  
  //LED_ON();    // DEBUG - led on

  val= *samplePtr.ptr;        // extract value from ring buffer
  samplePtr.hilo.lo++;        // directly increment pointer lo byte
  samplePtr.hilo.lo &= 0xEF;  // clear bit 4   

  OCR1AL = val;               // Update the PWM output

  if( fastISR == 0 )          // if not in 'fastISR' mode
    refillBuffer();           // - we refill the buffer

  //LED_OFF();    // DEBUG - led off
}

Now you're gonna ask... what's 'fastISR' and 'refillBuffer()'

fastISR is a flag which is set just before the LED data is bit-banged out (which takes about 6 sample periods) so the ISR only takes 6.1uS during that time.

and this...

  //------------------------------------------
  // refill the sample buffer
  // - adds one sample
  // - then adds another if the buffer is not full
  // - called from the Timer2 ISR
  //
  void refillBuffer() __attribute__((used));
  void refillBuffer()
  {
    // refill sample buffer
    *(pSampleBuffer + bufFillIndex)= pGetSampleFunc();
    bufFillIndex++;
    bufFillIndex &= 0xEF;
    
    // catch-up fill sample buffer if not full
    if( bufFillIndex != (samplePtr.hilo.lo & 0x1f)){
      *(pSampleBuffer + bufFillIndex)= pGetSampleFunc();
      bufFillIndex++;
      bufFillIndex &= 0xEF;
    }  
  }

Yours,
TonyWilk

P.S. pGetSampleFunc is a pointer to the currently-active sample generation function which gets a wav sample or generates sounds by DDS etc.

That's a clever way of doing it. But sometimes it's not the clever C code that's the fastest. It all depends how the compiler turns it into machine code.

This code takes 16 clock cycles, not counting the pushes and pops:

ISR(TIMER2_COMPA_vect)
{
  byte val = *samplePtr.ptr;        // extract value from ring buffer
  samplePtr.hilo.lo++;        // directly increment pointer lo byte
  samplePtr.hilo.lo &= 0xEF;  // clear bit 4    
  OCR1AL = val;
}

Generated Assembly:
2	lds r30,samplePtr
2	lds r31,samplePtr+1
2	ld r25,Z
1	ldi r30,lo8(samplePtr)
1	ldi r31,hi8(samplePtr)
2	ld r24,Z
1	subi r24,lo8(-(1))
1	andi r24,lo8(-17)
2	st Z,r24
2	sts 136,r25
--------
16 total clock cycles

You can get the same 16 clock cycles but save a push/pop register pair by doing this:

ISR(TIMER2_COMPA_vect)
{
  OCR1AL = *samplePtr.ptr;        // extract value from ring buffer
  samplePtr.hilo.lo++;        // directly increment pointer lo byte
  samplePtr.hilo.lo &= 0xEF;  // clear bit 4    
}

Generated Assembly:
2	lds r30,samplePtr
2	lds r31,samplePtr+1
2	ld r24,Z
2	sts 136,r24
1	ldi r30,lo8(samplePtr)
1	ldi r31,hi8(samplePtr)
2	ld r24,Z
1	subi r24,lo8(-(1))
1	andi r24,lo8(-17)
2	st Z,r24
--------
16 total clock cycles

This is a little bit simpler, and saves you one clock cycle:

byte circularBuffer[16];
volatile byte cursor = 0;
ISR(TIMER1_COMPA_vect)
{
  OCR1AL = circularBuffer[cursor];        // extract value from ring buffer
  cursor = (cursor + 1) & 15;
}

Generated Assembly:
2	lds r30,cursor
1	ldi r31,0
1	subi r30,lo8(-(circularBuffer))
1	sbci r31,hi8(-(circularBuffer))
2	ld r24,Z
2	sts 136,r24
2	lds r24,cursor
1	subi r24,lo8(-(1))
1	andi r24,lo8(15)
2	sts cursor,r24
------------
15 total clock cycles

In general, I find that any time you can use a byte instead of an int, it'll save time. Enough time in this case to make up for the code it takes to index into the array. This way also saves you the extra wasted space making sure you're on a 16 byte boundary.

Excellent, just what I was after - thank you!

Been staring at this with a 'scope for too long, couldn't see the wood for the trees :slight_smile:

That simplifies everything a lot.

Thanks again,
TonyWilk