How to optimize for speed (-On) instead of for size (-Os)

Hi all,

I've noticed that Arduino makes avr-gcc compile binaries optimized for size (-Os).

I'm designing a project with an ATmega running at ultra-low speed (128kHz), and I'm interested in changing the optimization flag to -O2.

Does anyone know if this is possible, or is it hardcoded in the Arduino binary?


You could always turn on verbose output, copy the commands into a script, edit them the way you want, and run the script.

Because most instructions are 1 cycle, and only some instructions are 2 cycles, optimize for size is very similar to optimize for speed. If you get the option switched somehow, I'd be very interested in hearing about any actual differences you found!

True, with regards to instruction cycles. I'd also like to try things such as -funroll_loops, for some math-heavy applications it could bring a nice boost (encryption comes to mind).

I'll see if I can find anything, I was hoping I could make it 'stick' so I wouldn't have to leave the arduino environment every time.

Have a look at template metaprograms. TMP allows you to create code that generates an unrolled loop.

This is one I created for a library. I provide the small loop based version as default, and if I have space left over I can toggle this and unroll my loops.

      template< int _Iterations > struct FILL_PINS{ 
        static inline void RUN( PinInfo *p_Info, const PinBlock &p_Block ){
          p_Info[  _Iterations - 1 ] = p_Block.b_Pins[ _Iterations - 1 ];
          FILL_PINS<  _Iterations - 1 >::RUN(  p_Info, p_Block );
      template<> struct FILL_PINS< 0 >{ static inline void RUN( PinInfo *p_Info, const PinBlock &p_Block ){ return; } };

this produces code like

 p_Info[  7 ] = p_Block.b_Pins[ 7 ];
 p_Info[  6 ] = p_Block.b_Pins[ 6 ];
p_Info[  5 ] = p_Block.b_Pins[ 5 ];
p_Info[  4 ] = p_Block.b_Pins[ 4 ];
p_Info[  3 ] = p_Block.b_Pins[ 3 ];
p_Info[  2 ] = p_Block.b_Pins[ 2 ];
p_Info[  1 ] = p_Block.b_Pins[ 1 ];
p_Info[  0 ] = p_Block.b_Pins[ 0 ];

instead of a runtime loop below

  template< const byte _PinCount >       
    void Parallel< _PinCount >::SetPins( PinInfo *p_InfoPtr, const PinBlock &p_Block, const unsigned int i_Underhang )
        if( i_Underhang ) p_InfoPtr -= i_Underhang;
        for( byte b_Index = 8 - i_Underhang ;; *p_InfoPtr-- = p_Block.b_Pins[ b_Index-- ] );

The usage is similar too

SetPins( &this->p_Pins[ 7 ], p_1to8 ); //compiled loop
FILL_PINS< 8 >::RUN( this->p_Pins + 7, p_1to8 ); //TMP version

1) 128 kHz is not super slow. Super slow is 32kHz plus full clock prescaler ;) 2) If you do not want to go all the way to templates you can also use the macro preprocessor for loop unrolling like so:

Doesn't gcc have a pragmatic option to set command line options from source these days? Not quite as slick as pragmatic optimize from MSVC, but pretty workable last I used it. With luck, that works on gcc-avr too. Google for it!

that would be good to find, I want to see if it allows c++0x, I have not yet had a chance to try the new c++ features.

This Instructable may help.