Extremely efficient code.

Hi,

I am trying to write a Timer ISR which needs to be as efficient as possible. Basically the shorter it is, the faster I can clock the timer (the faster the better).
I am attempting to control a 4.3" LCD with an Arduino Mega PWM module - I know, there are better processors to use, and better ways of doing it, but I am just curious how well I can make it work!
The Interrupt needs to occur once for every pulse of the pixel clock in order to read 24bit data from a parralel SRAM, and then output it to three 8bit ports (R,G,B). It also needs to be able to determine whether the clock occured inside the VPorch or HPorch as no data is needed there.
To read the data, it has to output the correct address to the SRAM which it does so on a mixture of PORTL (bits 2-9), PH0 (bit 10). It also uses PE3-4 (bits 0,1) to select the data for a given colour (0 = Red, 1=Green, 3=Blue) as it is only an 8bit SRAM.

So far I have managed to push the pixel clock to just of 222kHz by carefully choosing which ports to use so as to allow sbi() and cbi() to be favoured over the longer commands needed on higher ports. I have also sacraficed a couple of boolean variables and am instead saving the variable to a couple of the atmega1280's pins not used by the arduino IDE (PortG3,4), as this is amusing much more efficient!

If anyone has any suggestions on if this code can be made more efficient, I welcome it. I haven't tested this with the actual LCD yet, but am using a logic analyser to verify that the output matches the screens datasheet. With the current pixel clock, I estimate it is possible to get 1.35 FPS, which is on the low side, but still rather impressive for a 480x272px screen. At the moment, the DCLK timer has a period of 72 clock cycles.

Following is the C code, and compiler output (assembler).

C-Code (I have ommited large parts including timer setup, but basically Timer2 is the DCLK, Timers 1,3 are DataEn, and HSync. The DE and HSync use seperate timers which are set to use an External clock source. The Timer2 ISR, then writes to the external input pins, allowing it to provide a clock):

#define DCLKPort PORTH
#define DCLKDir DDRH
#define DCLKMask 0b01000000
#define DCLKUnMask 0b10111111

#define T3Port PORTE
#define T3Dir DDRE
#define T3Bit 6
#define T3Mask 0b01000000
#define T3UnMask 0b10111111
#define DEPort PORTE
#define DEDir DDRE
#define DEMask 0b00100000
#define DEUnMask 0b11011111

#define T4Port PORTH
#define T4Dir DDRH
#define T4Bit 7
#define T4Mask 0b10000000
#define T4UnMask 0b01111111
#define VSyncPort PORTH
#define VSyncDir DDRH
#define VSyncMask 0b00100000
#define VSyncUnMask 0b11011111

#define T1Port PORTD
#define T1Dir DDRD
#define T1Bit 6
#define T1Mask 0b01000000
#define T1UnMask 0b10111111
#define HSyncPort PORTB
#define HSyncDir DDRB
#define HSyncMask 0b10000000
#define HSyncUnMask 0b01111111

ISR(TIMER2_OVF_vect) { //DCLK Rising
  //----Route T2 to T3 and T1--------------
  mask(T3Port,T3Mask); //rising edge increments DE timer
  mask(T1Port,T1Mask); //rising edge increments HSync timer
  //---------------------------------------
  if(TIFR3 & (1<<OCF3B)){
    //Check for interrupt of DE first (will occur one cpu clock cycle after mask(T3Port,T3Mask). This speeds it up as there is then only one ISR to run, not two.
    setBit(TIFR3,OCF3B); //clear the interrupt so its ISR isnt triggered
    if(PORTG & 0b00010000){
      mask(PORTG,0b00001000);
      //This bit is in two places as the compiler then will optimise and use rjmp commands to use the same bit of code for both the two (saves program space).
      unMask(ColourPort,ColourUnMask);
    }
  } else if(PORTG & 0b00001000){//!HPorch){}
    //Valid data clock
    //output the dcount-th data onto the RGB pins here
    setReg(RedPort,DataPin); //Read data from SRAM, and output to Red port. Saves ~15 instructions compared with saving to a variable.
    mask(ColourPort,ColourGreen);
    setReg(GreenPort,DataPin);
    mask(ColourPort,ColourBlue); 
    setReg(BluePort,DataPin);
    byte dcount = ColLowPort;
    dcount++;
    setReg(ColLowPort,dcount);
    if(dcount == 0){
      mask(ColHighPort,ColHighMask); //When dcount rolls over (i.e. it == 0, then set the A8 bit)
    }
    //This bit is in two places as the compiler then can optimise and use rjmp commands to link the two (saves program space).
    unMask(ColourPort,ColourUnMask);
  }
  unMask(T3Port,T3UnMask); //return low
  unMask(T1Port,T1UnMask); //return low
}

Assembler (again, just for the interrupt).

0000014c <__vector_15>:
     14c:	1f 92       	push	r1
     14e:	0f 92       	push	r0
     150:	0f b6       	in	r0, 0x3f	; 63
     152:	0f 92       	push	r0
     154:	11 24       	eor	r1, r1
     156:	8f 93       	push	r24
     158:	76 9a       	sbi	0x0e, 6	; 14
     15a:	5e 9a       	sbi	0x0b, 6	; 11
     15c:	c2 9b       	sbis	0x18, 2	; 24
     15e:	05 c0       	rjmp	.+10     	; 0x16a <__vector_15+0x1e>
     160:	c2 9a       	sbi	0x18, 2	; 24
     162:	a4 9b       	sbis	0x14, 4	; 20
     164:	1c c0       	rjmp	.+56     	; 0x19e <__vector_15+0x52>
     166:	a3 9a       	sbi	0x14, 3	; 20
     168:	17 c0       	rjmp	.+46     	; 0x198 <__vector_15+0x4c>
     16a:	a3 9b       	sbis	0x14, 3	; 20
     16c:	18 c0       	rjmp	.+48     	; 0x19e <__vector_15+0x52>
     16e:	8f b1       	in	r24, 0x0f	; 15
     170:	82 b9       	out	0x02, r24	; 2
     172:	73 9a       	sbi	0x0e, 3	; 14
     174:	8f b1       	in	r24, 0x0f	; 15
     176:	88 b9       	out	0x08, r24	; 8
     178:	74 9a       	sbi	0x0e, 4	; 14
     17a:	8f b1       	in	r24, 0x0f	; 15
     17c:	80 93 08 01 	sts	0x0108, r24
     180:	80 91 0b 01 	lds	r24, 0x010B
     184:	8f 5f       	subi	r24, 0xFF	; 255
     186:	80 93 0b 01 	sts	0x010B, r24
     18a:	88 23       	and	r24, r24
     18c:	29 f4       	brne	.+10     	; 0x198 <__vector_15+0x4c>
     18e:	80 91 02 01 	lds	r24, 0x0102
     192:	81 60       	ori	r24, 0x01	; 1
     194:	80 93 02 01 	sts	0x0102, r2
     198:	8e b1       	in	r24, 0x0e	; 14
     19a:	87 7e       	andi	r24, 0xE7	; 231
     19c:	8e b9       	out	0x0e, r24	; 14
     19e:	76 98       	cbi	0x0e, 6	; 14
     1a0:	5e 98       	cbi	0x0b, 6	; 11
     1a2:	8f 91       	pop	r24
     1a4:	0f 90       	pop	r0
     1a6:	0f be       	out	0x3f, r0	; 63
     1a8:	0f 90       	pop	r0
     1aa:	1f 90       	pop	r1
     1ac:	18 95       	reti

Quick question about general registers, what affects 'r1' ?

I know the 'mul' instruction does, but are there any others? I am wondering if I can remove the lines:

14c: 1f 92 push r1
154: 11 24 eor r1, r1
......
1a2: 1f 90 pop r1

from the interrupt routine in my last post, as I can't see anything that changes r1, or relies on it being a specific value. It would save 5 clock cycles which equates to an extra 0.14 FPS.

You have already done some good optimizations.
Your use of the 'dcount' variable can be written shorter in 'C' (with less program lines), but the compiler won't make smaller code.

I don't know if user defined compiler options is already implemented. Arduino uses common optimizations for size, so it doesn't use the most extreme compiler options for speed.

What are the mask(), unMask() and setReg() functions or macro's ? If setReg() is for a register, why use setBit() for a register ?

Oh right yeah, forgot to post those:

#define setBit(r,b) r |= (1<<b)
#define clrBit(r,b) r &= ~(1<<b)
#define mask(r,m) r |= m
#define unMask(r,u) r &= u
#define setReg(r,v) r = v
#define getReg(v,r) v = r

I just got tired of writing the same thing out over and over again. There are probably better ways to name them, but this made sense to me at the time.
unMask should probably be r &= ~u, and then there would be no need for all the Timer2UnMask etc. defines, but I wasn't sure if the compiler would do the bitwise not of a constant automatically when it optimised. I think from what I have seen now that it should, but not to worry.

I have managed to get some more out of it by moving the Address8 bit to port G bit 5. That way an LDS,ORI,STS is just replaced with SBI, saving 5 clocks.
I also change the ISR definition to:

ISR(TIMER2_OVF_vect) __attribute__((naked)); //dont add additional ISR prep
ISR(TIMER2_OVF_vect) { //DCLK Rising
  asm("push r0"); //Save the current value of r0
  asm("in r0, 0x3f; 63"); //read the current SREG
  asm("push r0"); //Save the current SREG
  asm("push r24"); //Save the current value of r24 (it will be modified during ISR)

  ...

  asm("pop r24"); //Recover old r24
  asm("pop r0"); //Read back saved SREG
  asm("out 0x3f, r0; 63"); //update SREG to what it used to be
  asm("pop r0"); //recover old r0
  asm("reti"); //return from interrupt
}

Thus saving another 5 clocks, as I am not sure r1 is changed by my ISR, but the compiler added in code to save and recover its value anyway. With both of those, It is now up to 1.53FPS.

what affects 'r1' ?

gcc uses r1 as a "known zero." If you call other C functions in your ISR (which is a bad idea, since it would make it slower), you would need to make sure that r1 contains zero.

You can probably also get rid of the second push of r0, if you're not using it and don't call anything else.

You may possibly find this helpful:

I clocked out VGA signals there, both in monochrome, and RGB colour (2 bits per colour) at various rates.

My pixel clock for monochrome was 125 nS, and for colour (which was a bit harder) 375 nS.

I did not use assembler, the critical loops were done by code which the C compiler seems to have done a good a job as was possible.

You seem to be doing full 24-bit RGB, which will necessarily be slower. Do you need that much? 16-bit might give pretty good results on the LCD screen, and be correspondingly faster.

@Nick Gammon
In terms of 16bit vs 24bit, it doesn't make much difference as the red, green and blue are on seperate ports. But i suppose I could use RGB565, and then use only ports A and C. If I had already stored the colour in the SRAM in that format then it would concievably save about 6 clock cycles, and £10 as I could use a 4MBit rather than 8MBit SRAM chip. Its worth considering, thanks. I think that would get me up to 1.7FPS.

@westfw
There arent any functions called by the ISR, so thanks for that, makes me feel more confident to have removed "push r1" and "pop r1" instructions. I think the second push of r0 is used to back up the SREG register, which will probably be modified by the ISR so should be saved.

Thanks for that link, I had a quick look, and will give it a more detailed look in the morning. I gather from the link that using PWM outputs to control a screen is not as far fetched an idea as I thought it would be when I came up with this crazy plan! ]:smiley:

Switching to 16bit colour increases the frame rate to 1.79FPS. It also has a knock on effect on improving the DE interrupt by giving me an extra whole unused port. Before for the row address I had to use a mixture of portD and portB which meant alot of masking when updating the address. Now I can use the whole of PortK, which used to be fore 8bits of blue. It is also possible to swap PortC and PortL which saves two more instruction clocks.

I think judging on this C and assembler there probably aren't any more optimisations that can be made:

C-Code:

#define DCLKPort PORTH
#define DCLKDir DDRH
#define DCLKMask 0b01000000
#define DCLKUnMask 0b10111111

#define T3Port PORTE
#define T3Dir DDRE
#define T3Bit 6
#define T3Mask 0b01000000
#define T3UnMask 0b10111111
#define DEPort PORTE
#define DEDir DDRE
#define DEMask 0b00100000
#define DEUnMask 0b11011111

#define T4Port PORTH
#define T4Dir DDRH
#define T4Bit 7
#define T4Mask 0b10000000
#define T4UnMask 0b01111111
#define VSyncPort PORTH
#define VSyncDir DDRH
#define VSyncMask 0b00100000
#define VSyncUnMask 0b11011111

#define T1Port PORTD
#define T1Dir DDRD
#define T1Bit 6
#define T1Mask 0b01000000
#define T1UnMask 0b10111111
#define HSyncPort PORTB
#define HSyncDir DDRB
#define HSyncMask 0b10000000
#define HSyncUnMask 0b01111111

#define LowColourPort PORTA
#define LowColourDir DDRA
#define LowColourMask 0b11111111
#define LowColourUnMask 0b00000000
#define HighColourPort PORTL
#define HighColourDir DDRL
#define HighColourMask 0b11111111
#define HighColourUnMask 0b00000000

#define DataPort PORTF
#define DataPin PINF
#define DataDir DDRF
#define DataMask 0b11111111
#define DataUnMask 0b00000000
#define ColourPort PORTE
#define ColourDir DDRE
#define ColourMask 0b00010000
#define ColourUnMask 0b11101111
#define ColLowPort PORTC
#define ColLowDir DDRC
#define ColLowMask 0b11111111
#define ColLowUnMask 0b00000000
#define ColHighPort PORTE
#define ColHighDir DDRE
#define ColHighMask 0b00001000
#define ColHighUnMask 0b11110111
#define RowLowPort PORTK
#define RowLowDir DDRK
#define RowLowMask 0b11111111
#define RowLowUnMask 0b00000000
#define RowHighPort PORTD
#define RowHighDir DDRD
#define RowHighMask 0b10000000
#define RowHighUnMask 0b01111111


//Register macros. Saves time (dont have to jump to functions)
#define setBit(r,b) r |= (1<<b)
#define clrBit(r,b) r &= ~(1<<b)
#define mask(r,m) r |= m
#define unMask(r,u) r &= u
#define setReg(r,v) r = v
#define getReg(v,r) v = r


ISR(TIMER2_OVF_vect) __attribute__((naked)); //dont add additional ISR prep
ISR(TIMER2_OVF_vect) { //DCLK Rising
  //----Route T2 to T3 and T1-------------- Do this first as needs to be done straight away. Will only use sbi, so dont need to have backed up any registers beforehand
  mask(T3Port,T3Mask); //rising edge increments DE timer
  mask(T1Port,T1Mask); //rising edge increments HSync timer
  //---------------------------------------
  asm("push r0"); //Save the current value of r0
  asm("in r0, 0x3f; 63"); //read the current SREG
  asm("push r24"); //Save the current value of r24 (it will be modified during ISR)
  
  if(TIFR3 & (1<<OCF3B)){
    //Check for interrupt of DE first (will occur one cpu clock cycle after mask(T3Port,T3Mask). This speeds it up as there is then only one ISR to run, not two.
    setBit(TIFR3,OCF3B); //clear the interrupt so its ISR isnt triggered
    if(PORTG & 0b00010000){
      mask(PORTG,0b00001000);
      //This bit is in two places as the compiler then will optimise and use rjmp commands to use the same bit of code for both the two (saves program space).
      unMask(ColourPort,ColourUnMask);
    }
  } else if(PORTG & 0b00001000){//!HPorch){}
    //Valid data clock
    //output the dcount-th data onto the RGB pins here
    setReg(LowColourPort,DataPin); //Read data from SRAM, and output to Red port. Saves ~15 instructions compared with saving to a variable.
    mask(ColourPort,ColourMask);
    setReg(HighColourPort,DataPin);
    
    byte dcount = ColLowPort;
    dcount++;
    setReg(ColLowPort,dcount);
    if(dcount == 0){
      mask(ColHighPort,ColHighMask); //When dcount rolls over (i.e. it == 0, then set the A8 bit)
    }
    //This bit is in two places as the compiler then can optimise and use rjmp commands to link the two (saves program space).
    unMask(ColourPort,ColourUnMask);
  }
  unMask(T3Port,T3UnMask); //return low
  unMask(T1Port,T1UnMask); //return low
  
  asm("pop r24"); //Recover old r24
  asm("out 0x3f, r0; 63"); //update SREG to what it used to be
  asm("pop r0"); //recover old r0
  asm("reti"); //return from interrupt
}

Assembler:

0000014c <__vector_15>:
     14c:	76 9a       	sbi	0x0e, 6	; 14
     14e:	5e 9a       	sbi	0x0b, 6	; 11
     150:	0f 92       	push	r0
     152:	0f b6       	in	r0, 0x3f	; 63
     154:	8f 93       	push	r24
     158:	c2 9b       	sbis	0x18, 2	; 24
     15a:	05 c0       	rjmp	.+10     	; 0x166 <__vector_15+0x1a>
     15c:	c2 9a       	sbi	0x18, 2	; 24
     15e:	a4 9b       	sbis	0x14, 4	; 20
     160:	12 c0       	rjmp	.+36     	; 0x186 <__vector_15+0x3a>
     162:	a3 9a       	sbi	0x14, 3	; 20
     164:	0f c0       	rjmp	.+30     	; 0x184 <__vector_15+0x38>
     166:	a3 9b       	sbis	0x14, 3	; 20
     168:	0e c0       	rjmp	.+28     	; 0x186 <__vector_15+0x3a>
     16a:	8f b1       	in	r24, 0x0f	; 15
     16c:	82 b9       	out	0x02, r24	; 2
     16e:	74 9a       	sbi	0x0e, 4	; 14
     170:	8f b1       	in	r24, 0x0f	; 15
     172:	80 93 0b 01 	sts	0x010B, r24
     176:	88 b1       	in	r24, 0x08	; 8
     178:	8f 5f       	subi	r24, 0xFF	; 255
     17a:	88 b9       	out	0x08, r24	; 8
     17e:	88 23       	and	r24, r24
     180:	09 f4       	brne	.+2      	; 0x184 <__vector_15+0x38>
     182:	73 9a       	sbi	0x0e, 3	; 14
     184:	74 98       	cbi	0x0e, 4	; 14
     186:	76 98       	cbi	0x0e, 6	; 14
     188:	5e 98       	cbi	0x0b, 6	; 11
     18a:	8f 91       	pop	r24
     18e:	0f be       	out	0x3f, r0	; 63
     190:	0f 90       	pop	r0
     192:	18 95       	reti

I think the second push of r0 is used to back up the SREG register

SREG is saved by copying it into r0. If you don't otherwise use r0, and don't call any C code that feels free to use r0 as a tmp, then you don't also need to save on the stack...

You are correct, I hadn't thought of that, thanks! Thats another 4 clocks saving. (I've modified the code in the last post)

The ISR now currently takes only 51 instruction clock cycles to complete, giving 1.9FPS.

I find it amazing how much code can be optimised with a bit of care. When I started, the pixel clock was running so slowly that I estimated that 0.05FPS was the most possible. But this is teaching me a great deal about ways to improve code.

Though it doesn't save any time, I have moved the mask(T3Port,T3Mask) and mask(T1Port,T1Mask) above the pushing to the stack as both commands translate into sbi instructions which wont affect the SREG, or r0, or r24. It doesn't cost anything to do, but it means that DE and HSync are better aligned with the start of the clock cycles.

If I can just save 3 more instruction clocks, then I will be able to reach 2FPS with which I would be satisfied... :astonished:

I found a way to save another clock,

The bit of code:

    byte dcount = ColLowPort;
    dcount++;
    setReg(ColLowPort,dcount);
    if(dcount == 0){
      mask(ColHighPort,ColHighMask); //When dcount rolls over (i.e. it == 0, then set the A8 bit)
    }

Checks to see if dcount overflowed. But because of where I put setReg(ColLowPort,dcount); the overflow flag is ignored when it does dcount++, and then it has to check by and-ing the dcount with itself to see if it is 0.

By rearranging to:

    byte dcount = ColLowPort;
    if(++dcount == 0){
      mask(ColHighPort,ColHighMask); //When dcount rolls over (i.e. it == 0, then set the A8 bit)
    }
    setReg(ColLowPort,dcount);

It can use the same command that increments dcount to also check if it is 0. :smiley:
assembler becomes:

174:	88 b1       	in	r24, 0x08	; 8
     176:	8f 5f       	subi	r24, 0xFF	; 255
     178:	09 f4       	brne	.+2      	; 0x17c <__vector_15+0x30>
     17a:	73 9a       	sbi	0x0e, 3	; 14
     17c:	88 b9       	out	0x08, r24	; 8

Now at 1.94FPS :smiley:

Let me tell you about one efficiency gain I made with the VGA output ...

I was reading font information from PROGMEM. I needed it in columns then rows (because you draw horizontally) but the data was in rows, then columns. So to access it, the indexing required multiplying by 8 each time (the size of the font) to get from column to column. By simply re-arranging the static data (by an offline script) the data was in the faster order. That saved a few clock cycles accessing it.

I figured rather than fussing around with PROGMEM, it would be faster to just access it from an external 4MBit Parallel SRAM IC. It takes only 1 instruction cycle to read in the data, plus another 4-5 to select the address each time.
To speed things up, the data will be saved already formatted in RGB565. That way it can be read in and written out with no calculation necessary.