Go Down

Topic: Improving SIG_OVERFLOW (timer0, millis, etc) (Read 2870 times) previous topic - next topic

westfw

I like the new  algorithm used by millis() (and the underlying support); I've been having odd problems with time-based code that I finally figured out were because millis() has 32-bit overflow issues long before timer0_overflow_count does, confusing everyone.

However, I'm depressed by how long the ISR for timer0 overflows has become; all that 32bit math.  Looping!

By noticing that each timer tick is some intergal number of milliseconds long, plus some additional number of microseconds long, and that the number of microseconds can be scaled to fit in a single byte, I THINK I can chop off about 72 bytes of the 184 byte ISR (now 112 bytes, but not "thoroughly" tested.)  It looks like this:
Code: [Select]

volatile unsigned long timer0_millisecs;
static unsigned char timer0_smicrosecs; //scaled microseconds
#define MS_PER_TICK 1            /* millisecs per clock tick */
#define S_US_PER_TICK (24>>2)      /* scaled microsecs/tick (additional) */
#define S_US_PER_MS (1000>>2)      /* scaled microsecs per millisec */

SIGNAL(SIG_OVERFLOW0)
{
   register unsigned long tmp = timer0_millisecs;

   timer0_smicrosecs += S_US_PER_TICK; /* increment microsecond fraction */
   if (timer0_smicrosecs > S_US_PER_MS) {
     timer0_smicrosecs -= S_US_PER_MS;
     tmp += 1;
   }
   tmp += MS_PER_TICK;                  /* increment milliseconds */
   timer0_millisecs = tmp;
}

Unsigned long millis()
{
   return timer0_millisecs;
}

(it's currently got constants for ms_per_tick and s_us_per_tick, but I'm pretty sure those can be derived from the existing defined constants like F_CPU and etc.)
I'm not very impressed by the code produced by the compiler; it spends time saving R1 and R0 and then doesn't use them, for some reason it doesn't use adiw for the math on tmp (although it did when the math was done directly on timer0_millisecs (but added extras load/store cause of the "volatile.")), and of course it doesn't seem to peephole optimize multibyte math, but I suppose the additional 20bytes or so one might save by using assembler isn't really worth it...


mellis

I love the idea of having a more efficient timer 0 overflow routine, but remember that we need to support multiple clock speeds.  In particular, the LilyPad runs at 8 MHz, but it would also be good if the routine worked for arbitrary clock speeds, for people using the core with other hardware.  It might also make sense to optimize the 8 MHz / 16 MHz version and have a slower one for other clock speeds.  

If you can take a shot at deriving your new constants from F_CPU, that would be a good start.

I believe I suggested this in another post, but perhaps there could be three overflow ISRs: one hardcoded and maximally efficient for 16 MHz, one hardcoded and maximally efficient for 8 MHz, and one generic one that uses F_CPU.  You could use #ifdefs to determine which ISR is compiled, and it would be possible to add in hardcoded versions for other clock speeds with more #ifdefs.  It's more code, but I think the decreased time spent in the ISR would make it worthwhile, especially for Arduino users with applications that require precision in microseconds.

- Ben

mellis

That's not a bad idea, especially if someone wants to create the optimized versions.

westfw

Quote
but remember that we need to support multiple clock speeds.

On an 8MHz system, clock0 interrupts occur every 0.002048 s, rather than changing the timer parameters, right?
I think my scheme should work well for any clock speed where the "remainder" microseconds is an integral multiple of 4, which should include all the "(power of 2) MHz" clock speeds, but not (for example) 20Mhz (819.2 microseconds per tick.)

I'll take a look at basing things from F_CPU, since you seem to basically approve of the idea...

How do you feel about having conditional compilation end up generating significantly different code for different clock frequencies?  It would make me nervous!
Code: [Select]
#if ((US_PER_TICK & 3) == 0)
// code using scaled microseconds in a byte variable
#else
// code using clocks per tick in long variable (like it is now)
#endif

westfw

How's this?  Appears to generate correct constants for 8MHz and 16MHz, appropriate error message for 20MHz, and "looks" ok actually running at 16MHz.
Code: [Select]

/*
* Count milliseconds since initialization.  Use the timer0 overflow.
* Timer0 prescaler is set to 64, and overflows every 256 counts, so:
*    ((1/F_CPU) * 256 * 64)  =  seconds per tick.
*    ((1/F_CPU) * 256 * 64) * 1e6 =  microseconds per tick.
* But we only have integers, so rearrange:
*     (64e6/F_CPU) * 256 = us/tick
* or  (1024e6/F_CPU) * 16
*/
#define _US_PER_TICK ((1024000000L/F_CPU)*16)
#define MS_PER_TICK (_US_PER_TICK/1000)
/*
* Now, to fit at least one millisecond worth of microseconds in a single
* byte, we need to scale by  factor of at least 4 (1 count = 4 uS)
*/
#define S_US_PER_MS (1000>>2)      /* scaled microsecs per millisec */
#define S_US_PER_TICK ((_US_PER_TICK % 1000)>>2)
/*
* Check whether this will cause errors; all the math should reverse
*/
#if ((((MS_PER_TICK*1000)+(S_US_PER_TICK<<2)) /16) * F_CPU) != 1024000000

#error Microseconds per tick not scalable!
/*
* Insert code here for supported "odd" CPU frequencies.
*/

#else // FAST_SIG_OVERFLOW

volatile unsigned long timer0_millisecs;
static unsigned char timer0_smicrosecs; //scaled microseconds

SIGNAL(SIG_OVERFLOW0)
{
   register unsigned long tmp = timer0_millisecs;

   timer0_smicrosecs += S_US_PER_TICK; /* increment microsecond fraction */
   if (timer0_smicrosecs >= S_US_PER_MS) {
     timer0_smicrosecs -= S_US_PER_MS;
     tmp += 1;
   }
   tmp += MS_PER_TICK;            /* increment milliseconds */
   timer0_millisecs = tmp;
}

unsigned long millis()
{
   return timer0_millisecs;
}
#endif // FAST SIG_OVERFLOW

westfw

The C compiler weird.  Here's some of the code it generated for the above source:
Code: [Select]
   timer0_smicrosecs += S_US_PER_TICK; /* increment microsecond fraction */
 26:   90 91 00 00     lds     r25, 0x0000
 2a:   89 2f           mov     r24, r25
 2c:   8a 5f           subi    r24, 0xFA       ; 250
 2e:   80 93 00 00     sts     0x0000, r24
   if (timer0_smicrosecs >= S_US_PER_MS) {
 32:   8a 3f           cpi     r24, 0xFA       ; 250
 34:   38 f0           brcs    .+14            ; 0x44 <__vector_16+0x44>
       timer0_smicrosecs -= S_US_PER_MS;
 36:   94 5f           subi    r25, 0xF4       ; 244
 38:   90 93 00 00     sts     0x0000, r25
       tmp += 1;
 3c:   2f 5f           subi    r18, 0xFF       ; 255
 3e:   3f 4f           sbci    r19, 0xFF       ; 255
 40:   4f 4f           sbci    r20, 0xFF       ; 255
 42:   5f 4f           sbci    r21, 0xFF       ; 255
   }

Aside from the confusing coincidence that for 16MHz, (-S_US_PER_TICK ==  S_US_PER_MS), look at the constant it uses at locations 36: (244); it took me a while to figure out where that came from; I guess it must be an artifact of sign issues on byte arithmetic or something.
Give up?  at 2E:, we store R24, containing (timer0_smicrosecs + 6), back to timer0_smicrosecs (x + 6 = x - (-6); Not "addi" instruction in AVR!)  But R25 still contains a copy of the original value of timer0_smicrosecs, so at 36: it's actually calculating (oldval - (-6+250)) or (ta da!) (oldval-244) !
Not the way I would have written it :-(   I was hoping to avoid multiple stores by avoiding the volatile qualifier on timer0_smicrosecs.  Sigh.

westfw

Eight bytes shorter; due to rather obscure reasons.  (I was hoping to get rid of 4 from that extra store.  But the long ended up in different registers, permitting the use of addiw for its math, saving another 4...)
Code: [Select]
SIGNAL(SIG_OVERFLOW0)
{
   register unsigned long mstmp = timer0_millisecs;
   register unsigned char ustmp = timer0_smicrosecs;

   ustmp += S_US_PER_TICK; /* increment microsecond fraction */
   if (ustmp >= S_US_PER_MS) {
     ustmp -= S_US_PER_MS;
     mstmp += 1;
   }
   mstmp += MS_PER_TICK;            /* increment milliseconds */
   timer0_smicrosecs = ustmp;
   timer0_millisecs = mstmp;
}

kg4wsv

You can always embed your own assembly in the C code.  That's still practiced in device drivers, when it's necessary for timing, etc...

-j


mellis

westfw: cool.  How about combining that with code for a generic overflow interrupts for other cpu speeds?

westfw

Quote
You can always embed your own assembly in the C code.

We have a policy at work that you're only allowed to do that if you can demonstrate a significant benefit.  In this case, I did some hand optimization of the assembly that the compiler produced, and I was only able to save about 6 instructions:
  • (2) unnecessary save & restore of r0
  • (3) unnecessary save, restore, and use of r19
  • (1) peephole optimization of multi-byte increment (share adc's of upper bytes) (thought this would save more, but no...)

that's not worth it, especially since one of my personal goals is to run Arduino SW on a non-AVR cpu.

westfw

Bah; it doesn't quite work at F_CPU=8MHz.
I think it loses a millisecond each time (timer0_smicroseconds + usec_per_tick) wraps around instead of becoming greater than s_microsec_per_millisec.  Since the latter is 250, and wraparound happens at 256, this will happen some of the time whenever the remainder microseconds is more than 24 (scaled --> 6)
It looks like I can scale by another factor of 2 and still work with 1,2,4,8.16 MHz, and not really be any worse off at other common frequencies...

westfw

So I made this neat little test sketch, which allows the timer code to be tested (sort of) at various values of F_CPU, even though my only real HW runs at 16Mhz.  It does rely on extra debug code in the ISR to track total ticks separately from whatever magic is done to try to count milliseconds.   The latest code seems to work ok at 8, 16, and 20MHz...

The hack at Serial.begin might be generally useful  :-)

Code: [Select]
extern unsigned long test_total_ticks ;
extern unsigned long test_us_per_tick;
extern unsigned char test_ms_per_tick;
extern unsigned char test_s_us_per_tick;

long time = 300;
static unsigned long timer;

void setup()
{
   char i;
   pinMode(13, OUTPUT);  // something to blink
   digitalWrite(13,0);
   timer = millis();
   /*
    * Test tick vs time algorithm at multiple clock speeds, even though our
    * real hardware only goes 16Mhz.  Scale baud rate appropriately.
    */
   Serial.begin(9600 * (F_CPU/10000)/1600);
}
char itoh (int h)
{
   h += '0';
   if (h > '9')
       h += ('A'-'9')-1;
   return h;
}
void printlnlong (unsigned long x)
{
   int i;
   for (i = 28; i > 0; i -= 4) {
       Serial.print(itoh((x>>i) & 0xF), BYTE);
   }
   Serial.println(itoh(x & 0xF), BYTE);
}

void loop()
{
   int cmd;
   unsigned long mil;
   mil = millis();
   static unsigned char state = 0;

   if (mil >= timer) {
       /*
        * time to do something ?
        */
       state ^= 1;
       digitalWrite(13, state);

       timer = millis() + time;
   }
   cmd = Serial.read();
   if (cmd > 0) {
       unsigned long mymillis, myticks;
       uint8_t oldSREG = SREG;

       // disable interrupts while we read timers or we might get an
       // inconsistent value (e.g. in the middle of the timer0_millis++)
       cli();
       mymillis = millis();   // note: fetch these two values at the same time!
       myticks = test_total_ticks;
       SREG = oldSREG;  // restore interrupt status


       if (cmd == 'x') {
           Serial.println("--------------");
           Serial.print("ticks = 0x");
           printlnlong(myticks);
           Serial.print("millis()                       = 0x");
           printlnlong(mymillis);
           Serial.print("Real time (ticks*256*64/F_CPU) = 0x");
           printlnlong((myticks*256L*64L)/(F_CPU/1000));
           Serial.print("Constants:   ms_per_tick=");
           Serial.print((int) test_ms_per_tick);
           Serial.print(", us_per_tick=");
           Serial.print((int)test_us_per_tick);
           Serial.print(", remainder us_per_tick=");
           Serial.println((unsigned int)test_s_us_per_tick<<2);
           Serial.print("Difference (ticks * upt) - (millis * 1000) =");
           Serial.print((unsigned int) ((myticks * test_us_per_tick) - (mymillis * 1000L)));
           Serial.print(" or 0x");
           printlnlong((myticks * test_us_per_tick) - (mymillis * 1000L));
           Serial.println(" ");  
       }
       else if (cmd == ' ') {
           Serial.println(" ");
       }
       else if (cmd == 'r') {
           timer = millis() + time;
       }
       else {
           Serial.print("Unknown Command: ");
           Serial.println(cmd, BYTE);
       }
   }
}

mellis

westfw: do you want to post an updated version that works at 8 MHz as well?  And create the #ifdefs for choosing between the optimizing and non-optimized?

I'm still not entirely convinced it's worth the trouble of having two (or more) separate routines.  How much time and space does the optimized version save?  How often does that matter?

westfw

The new code seems to save 44 bytes over the top-of-svn-tree code (the conditional version uses that code when F_CPU doesn't work out right for the new algorithm.)  Plus a bit more in terms of cpu cycles, since the old code will loop sometimes (and it looks like it loops more when the CPU frequency is lower.)  That's not a huge amount of savings in general, but it IS an ISR wew're talking about, and it's pretty significant for one function...

Code: [Select]
#define TEST_NEW_TIMER 0

/*
* Count milliseconds since initialization.  Use the timer0 overflow.
* Timer0 prescaler is set to 64, and overflows every 256 counts, so:
*    ((1/F_CPU) * 256 * 64)  =  seconds per tick.
*    ((1/F_CPU) * 256 * 64) * 1e6 =  microseconds per tick.
* But we only have integers, so rearrange:
*     (64e6/F_CPU) * 256 = us/tick
* or  (1024e6/F_CPU) * 16
*/
#define _US_PER_TICK ((1024000000L/F_CPU)*16)
#define MS_PER_TICK (_US_PER_TICK/1000)
/*
* Now, to fit at least one millisecond worth of microseconds in a single
* byte, we need to scale by  factor of at least 4 (1 count = 4 uS)
*/
#define S_US_PER_MS (1000>>3)      /* scaled microsecs per millisec */
#define S_US_PER_TICK ((_US_PER_TICK % 1000)>>3)

#ifdef TEST_NEW_TIMER  // Assign some values to variables so they can be seen.
unsigned long test_total_ticks = 0;
unsigned long test_us_per_tick = _US_PER_TICK;
unsigned char test_ms_per_tick = MS_PER_TICK;
unsigned char test_s_us_per_tick = S_US_PER_TICK;
#endif
/*
* Check whether this will cause errors; all the math should reverse
*/
#if ((((MS_PER_TICK*1000)+(S_US_PER_TICK<<3)) /16) * F_CPU) != 1024000000
/*
* generic but slower/bigger code for "odd" CPU frequencies.
*/
#warning Odd CPU frequency; bigger and slower code in SIG_OVERFLOW

volatile unsigned long timer0_clock_cycles = 0;
volatile unsigned long timer0_millis = 0;

SIGNAL(SIG_OVERFLOW0)
{
#ifdef TEST_NEW_TIMER
   test_total_ticks++;
#endif
     // timer 0 prescale factor is 64 and the timer overflows at 256
     timer0_clock_cycles += 64UL * 256UL;
     while (timer0_clock_cycles > clockCyclesPerMicrosecond() * 1000UL) {
           timer0_clock_cycles -= clockCyclesPerMicrosecond() * 1000UL;
           timer0_millis++;
     }
}

unsigned long millis()
{
     unsigned long m;
     uint8_t oldSREG = SREG;
     
     // disable interrupts while we read timer0_millis or we might get an
     // inconsistent value (e.g. in the middle of the timer0_millis++)
     cli();
     m = timer0_millis;
     SREG = oldSREG;
     
     return m;
}

#else // FAST_SIG_OVERFLOW
#define FAST_SIG_OVERFLOW 1

volatile unsigned long timer0_millisecs;
static unsigned char timer0_smicrosecs; //scaled microseconds

SIGNAL(SIG_OVERFLOW0)
{
   unsigned long mstmp = timer0_millisecs;
   unsigned char ustmp = timer0_smicrosecs;

#ifdef TEST_NEW_TIMER
   test_total_ticks++;
#endif
   ustmp += S_US_PER_TICK; /* increment microsecond fraction */
   if (ustmp >= S_US_PER_MS) {
     ustmp -= S_US_PER_MS;
     mstmp += 1;
   }
   mstmp += MS_PER_TICK;            /* increment milliseconds */
   timer0_smicrosecs = ustmp;
   timer0_millisecs = mstmp;
}

unsigned long millis()
{
   unsigned long m;
   uint8_t oldSREG = SREG;

   // disable interrupts while we read timer0_millis or we might get an
   // inconsistent value (e.g. in the middle of the timer0_millis++)
   cli();
   m = timer0_millisecs;
   SREG = oldSREG;
   return m;
}

#if 0
unsigned long micros()
{
   unsigned long m;
   uint8_t oldSREG = SREG;

   // disable interrupts while we read timer0_millis or we might get an
   // inconsistent value (e.g. in the middle of the timer0_millis++)
   cli();
   m = (timer0_millisecs * 1000) + (timer0_smicrosecs<<3);
   m += ((unsigned long)TCNT0) * 1000000 / (F_CPU/64);
   SREG = oldSREG;
   return m;
}
#endif
#endif // FAST SIG_OVERFLOW

Go Up