How to fix Slow RAM access

Consider the following code

uint8_t buff[1024];

volatile double notefact[4]={0,0,0,0};
volatile int8_t noteID[4] = {0,4,7,10};
volatile uint32_t noteTime[4] = {0,0,0,0};
volatile uint8_t temp_byte,  temp_bytefinal,finalbit;   //these are global temp variables
volatile uint16_t temp2_byte;
volatile double temp_dbl;
volatile const int sampleRate = 8000;
#define SRRatio 267904.57886781300191011839336786/sampleRate

#define TWOPOWER1_12 1.0594630943592952645618252949463
#define TWOPOWER_1_12 0.94387431268169349664191315666753

double noteFactor(uint8_t y)
{
  temp_dbl=1;
  
  while(y>0)
  {
    temp_dbl *= TWOPOWER1_12;
    y--;
  }
  return temp_dbl*SRRatio;
}

void setup()
{
  Serial.begin(9600);
  DDRD |= B11111011;
  DDRB |= B00000001;
  for(int i=0; i<1024; ++i) buff[i] = i;
  for(int i=0; i<4; ++i) notefact[i] = noteFactor(noteID[i]);
}
int cntr,t0;

void loop()
{
  int t;
  if(!cntr) t0 = millis();
  while(cntr<10000)
  {
    temp2_byte = 0;
    
    for(int i=0; i<4; ++i)
    {
      if(noteID[i] != -1)
      {
        
        t = ((uint32_t)(notefact[i]*noteTime[i]+0.5)) & 0x000003FF;
        
        temp2_byte += *(buff+t);              //THIS is the line of discussion henceforth 
                                              //refferred to as LOD
        noteTime[i]++;
      }
    }  
    temp_byte = temp2_byte>>2;
   
    temp_bytefinal = (temp_byte&B11111000)| (temp_byte>>1)&B00000011;
    finalbit = temp_byte&1;
    PORTD = (PORTD&B00000100) | (temp_bytefinal);
    PORTB = (PORTB&B111110) | finalbit;
    cntr++;
  }
  if(cntr == 10000)
  {
    Serial.print(millis()- t0);
    ++cntr;
  }
}

This is a code to time test the particular routine.

The Problem is with the LOD(look at code for LOD). with the LOD the output is around 1100 ms.
without it it is around 150 ms.
also if i replace 't' with a constant then output is around 150-200 ms.

So why is the access to the RAM taking So Much Time? and is it possible to optimise this code?

What makes you think RAM access time ever varies?

As i sai, if i replace the variable 't' with a constant, the time taken for the routine to run 10000 times decreases more than five times. so obviously, something about accessing the ram with a variable index, is making it go terribly slow.

by the way, just in case you haven't read the code., please have a look at the line commented as 'Line of Discussion' to see what i mean.

Maybe you shouldn't be doing so much mixed-type calculations. That code makes me shudder.

How about looking at the machine code the compiler generates from your source?

Making everything volatile is just asking the compiler to slow down.


  DDRD |= B11111011;
  DDRB |= B00000001;

Just use pinMode. Much more readable.


The LOD seems to generate a lot of code:

        temp2_byte += *(buff+t);              //THIS is the line of discussion henceforth 
 1f0:	a0 90 39 05 	lds	r10, 0x0539
 1f4:	b0 90 3a 05 	lds	r11, 0x053A
 1f8:	0e 94 1f 06 	call	0xc3e	; 0xc3e <__floatunsisf>
 1fc:	9b 01       	movw	r18, r22
 1fe:	ac 01       	movw	r20, r24
 200:	c8 01       	movw	r24, r16
 202:	b7 01       	movw	r22, r14
 204:	0e 94 ad 06 	call	0xd5a	; 0xd5a <__mulsf3>
 208:	20 e0       	ldi	r18, 0x00	; 0
 20a:	30 e0       	ldi	r19, 0x00	; 0
 20c:	40 e0       	ldi	r20, 0x00	; 0
 20e:	5f e3       	ldi	r21, 0x3F	; 63
 210:	0e 94 27 05 	call	0xa4e	; 0xa4e <__addsf3>
 214:	0e 94 f3 05 	call	0xbe6	; 0xbe6 <__fixunssfsi>
 218:	dc 01       	movw	r26, r24
 21a:	cb 01       	movw	r24, r22
 21c:	93 70       	andi	r25, 0x03	; 3
 21e:	8a 5e       	subi	r24, 0xEA	; 234
 220:	9e 4f       	sbci	r25, 0xFE	; 254
 222:	fc 01       	movw	r30, r24
 224:	80 81       	ld	r24, Z
 226:	a8 0e       	add	r10, r24
 228:	b1 1c       	adc	r11, r1
 22a:	b0 92 3a 05 	sts	0x053A, r11
 22e:	a0 92 39 05 	sts	0x0539, r10
                                              //refferred to as LOD

You seem to have confused the compiler. Confused me too. It's not the RAM access that is slow, it's the nature of what leads up to it. Can you explain what you are doing exactly?

The routine im trying to timetest is part of my project to sample various sound waveforms at different frequencies.

the logic is that 'buff' will contain 8-bit samples of a single wave pulse(i.e 0 to 2 pi) and this can be parsed at different rates to give different frequencies.
e.g. if i want twice the freq at the same sample rate, i'll read every alternate sample. thrice, every one in three samples. so on and so forth. so the calculation of t is very necessary and im not sure if it causes much of a problem because without the LOD or if 't' is replaced with a constant, speeds increase dramatically.

declaring variables to be volatile was necessary bcuz these variables are shared betw interrupt routines in the original project.

Yes, but calculating t is quite complicated, isn't it?

        t = ((uint32_t)(notefact[i]*noteTime[i]+0.5)) & 0x000003FF;

Why cast to a 32-bit integer when you are indexing into an array of only 1024 elements?

declaring variables to be volatile was necessary bcuz these variables are shared betw interrupt routines in the original project.

Just because something is shared between interrupt and background doesn't necessarily mean they have to be qualified as volatile.
You need to analyze their usage carefully.

maharjun:
e.g. if i want twice the freq at the same sample rate, i'll read every alternate sample. thrice, every one in three samples. so on and so forth. so the calculation of t is very necessary and im not sure if it causes much of a problem because without the LOD or if 't' is replaced with a constant, speeds increase dramatically.

Can't you move the calculation of t outside the loop? You don't change the sample rate, per instance of i do you?

maharjun:
so obviously, something about accessing the ram with a variable index, is making it go terribly slow.

More obviously to me, calculating that variable index is the slow part.

Good old Occam's Razor.

maharjun:
The routine im trying to timetest is part of my project to sample various sound waveforms at different frequencies.

the logic is that 'buff' will contain 8-bit samples of a single wave pulse(i.e 0 to 2 pi) and this can be parsed at different rates to give different frequencies.
e.g. if i want twice the freq at the same sample rate, i'll read every alternate sample. thrice, every one in three samples. so on and so forth. so the calculation of t is very necessary and im not sure if it causes much of a problem because without the LOD or if 't' is replaced with a constant, speeds increase dramatically.

declaring variables to be volatile was necessary bcuz these variables are shared betw interrupt routines in the original project.

The routine im trying to timetest is part of my project to sample various sound waveforms at different frequencies.

Your test is testing your test method and not actually sampling waveforms at different frequencies.
Even if you fix the method, it's not going to tell you what you want.

ok. looks like i have a lot of things to clarify.

in my original project, the routine is the code of the timer1_compA interrupt vector. so basically, visualise the loop as repeated instances of the interrupt. the slow nature severely limits the Sample rate of the audio.

Can't you move the calculation of t outside the loop? You don't change the sample rate, per instance of i do you?

no im not sure which loop u mean. if its the while loop then obviously not because then it wouldnt be much of a simulation (see above). Also, the calculation depends on noteTime[y] (ive used y istead of i due to italics confusion.) which gets incremented each time.

if its the for loop then no obviously because it depends on i.

Why cast to a 32-bit integer when you are indexing into an array of only 1024 elements?

because noteTime[y] *notefact[y] can get pretty big pretty fast and i have no idea how overflows work when converting from a floating point to an integer type.

More obviously to me, calculating that variable index is the slow part.

This is the FACT:
you remove the LOD(pls read code of my first post loop() function) things are fast;
i dunno how this could be related to the previous line of calculation.
if anyone could please explain this, id be grateful.

Your test is testing your test method and not actually sampling waveforms at different frequencies.

yes obviously. im only trying to see how much time the routine takes. this code will sample 4 notes (simultaneously) if i copy and paste it back into my interrupt routine and fillup buff[] with proper waveform data.

AWOL:

declaring variables to be volatile was necessary bcuz these variables are shared betw interrupt routines in the original project.

Just because something is shared between interrupt and background doesn't necessarily mean they have to be qualified as volatile.
You need to analyze their usage carefully.

I dispute that glib statement - in what circumstances will a non-volatile variable be guaranteed safe to communicate between threads or between ISR and main code?

maharjun:
in my original project, the routine is the code of the timer1_compA interrupt vector. so basically, visualise the loop as repeated instances of the interrupt. the slow nature severely limits the Sample rate of the audio.

I saw that code. It needs major changing as in rewrite.

How many samples per second do you need? And how many bits each? The ADC can give you 8-bit samples more quickly than 10-bit samples (at 105 usec each if you stay on the same pin, else when you change pins it's sample twice and throw the first out). That still lets you go for about 2000 10-bit samples per second with cycles left over a pretty decent bit of processing. If you can tighten that up then you might get 4000-5000 per sec and if you go to 8-bit samples you can get more.

OTOH I like my audio at either 22050 or 44100.

GoForSmoke:
I saw that code. It needs major changing as in rewrite.

How many samples per second do you need? And how many bits each? The ADC can give you 8-bit samples more quickly than 10-bit samples (at 105 usec each if you stay on the same pin, else when you change pins it's sample twice and throw the first out). That still lets you go for about 2000 10-bit samples per second with cycles left over a pretty decent bit of processing. If you can tighten that up then you might get 4000-5000 per sec and if you go to 8-bit samples you can get more.

OTOH I like my audio at either 22050 or 44100.

ya i rewrote that code and things are working fine except that there is an upper cap of 8000 Samples/sec in order to support 4 note polyphony. that is why i am concerned as to why this routine is taking such a long time. someone above has posted the assembly thats produced by the LOD. if you or anyone could explain why thats necessary, and/or how it could be modified it would be of tremendous help.

for my case i would like it to be atleast 20000 Hz

I think that you might need a DSP chip or audio shield in your project. 20000/sec is 50 usec to sample and store. At 8 bits per sample your polyphony is going to get muddy even if you could go that fast.

MarkT:
I dispute that glib statement - in what circumstances will a non-volatile variable be guaranteed safe to communicate between threads or between ISR and main code?

This one, for example:

volatile const int sampleRate = 8000;

It doesn't change, it hardly needs to be volatile.

maharjun:

More obviously to me, calculating that variable index is the slow part.

This is the FACT:
you remove the LOD(pls read code of my first post loop() function) things are fast;
i dunno how this could be related to the previous line of calculation.
if anyone could please explain this, id be grateful.

I can explain it, easily.

Consider this:

   if(noteID[i] != -1)
      {
        
        t = ((uint32_t)(notefact[i]*noteTime[i]+0.5)) & 0x000003FF;
        
        temp2_byte += *(buff+t);              //THIS is the line of discussion henceforth 
                                              //refferred to as LOD
        noteTime[i]++;
      }

It's an optimizing compiler. That's the only place you use t. So if you remove the memory access:

   if(noteID[i] != -1)
      {
        
        t = ((uint32_t)(notefact[i]*noteTime[i]+0.5)) & 0x000003FF;
        
        noteTime[i]++;
      }

The compiler says "hey, he never uses t, why calculate it?" so it generates:

   if(noteID[i] != -1)
      {
        noteTime[i]++;
      }

... if 't' is replaced with a constant, speeds increase dramatically ...

As you say, if you replace t with a constant it is fast. Surely that must convince you that it isn't the memory access that is slow, it is calculating t. What other proof do you need? You admit it yourself.

maharjun:
if you or anyone could explain why thats necessary, and/or how it could be modified it would be of tremendous help.

I have explained it. You just are not listening.