Self-modifying code on Arduino?

Is it reasonable to write SMC for Arduino? I have heard about some limits how many times you can write into the program memory which makes me a bit doubtful about this, but would like to hear some feedback. Basically I'm thinking of a case where I would be writing to a given positions in the code @ 50Hz.

Thanks, Jarkko

Search the forum some, this has been discussed before, and not too long ago IIRC.

I wasn't paying much attention, because it's a terrible practice anyway, but I think bottom line is that only code in the bootloader section can write to program memory. So there might be some convoluted machinations that a person could go through to do it, but it's probably easier just to learn to program properly ;)

There are some optimizations in very performance intensive code that would be possible due to SMC, thus the question. Anyway, after searching it seems that there's limit of 10k writes to program memory, so that would make the board last only about 3 minutes in my case /:

JarkkoL: There are some optimizations in very performance intensive code that would be possible due to SMC, thus the question. Anyway, after searching it seems that there's limit of 10k writes to program memory, so that would make the board last only about 3 minutes in my case /:

Well I won't disallow that that is a possibility. But yeah, if it's executing so much that it's a performance focus, then the flash is gonna wear out in a hurry. Plus I don't think writing to flash is all that fast. Maybe assembler is the best bet then?

What do you expect to get out of SMC that you can't get from data-driven code, perhaps a set of state machines?

I could store some of the constant values for given iteration of the loop into immediates that I otherwise have to fetch from memory.

You should make a study of Finite State Machines and C pointers, especially function pointers as well. Combine the two or just stick to state machines and your execution path can be totally responsive to events and process state. There's about nothing that SMC can do that beats that.

I could store some of the constant values for given iteration of the loop into immediates that I otherwise have to fetch from memory.

OTOH I could be the one that doesn't know what you mean by SMC instead of you.

Perhaps if you start with telling what you are trying to do and how you expect to get there using SMC?

I'm well aware of FSM and function pointers, and they don't help here. It's a bit moot point to discuss about SMC since it doesn't seem to be an option on Arduino, but: I have an audio mixer code that's very time intensive. I like to be able to mix several 8-bit audio channels at ~48kHz and run the code on Arduino Uno @ 16MHz. The mixer loop is executed at 50Hz, so each time the mixer loop is called it mixes 960 samples. During mixing those samples there are various values for each audio channel that remain constant: sample pointer, sample speed, volume, sample end & loop length. Sample position is basically only thing that changes from sample to sample. Currently when mixing samples I need to fetch all these audio channel parameters from memory, for each mixed sample. With SMC I could write these audio channel constants into the code before executing the loop and save a lot of cycles for data fetching, and also save registers to keep more channel data in the registers.

Currently I'm writing the mixer loop in asm since GCC doesn't optimize the code too well. The current mixer is able to mix only 4 audio channels @ 20kHz, so I like to push this to more channels and higher frequency.

In my near-ignorance of such things, that sounds like a whole lot to ask of an 8-bit 16MHz processor. What about an ARM processor? 32 bits, faster clocks, and some have DSP instructions...

If there’s enough room in a core, a Propeller might be a good choice as well. The code you load is from off-chip and can be determined at run time.

I don't think that's Jarkko's point. Sometimes it's more interesting to work within the constraints than solve the problem by throwing more hardware at it. Constraints bring out the creativity in us.

Jarkko, have you considered that the audio mixing frequency is much higher than the playroutine update frequency, so you could perhaps reorganize the code so that you mix a batch in sync with the playroutine, so that the batch never crosses playroutine update boundary. This way the channels always have fixed parameters during a batch and the parameters could be kept in registers.

Another idea: I think you don't need an audio mixing buffer at all, you could do the mixing directly in the timer interrupt. This would save a few cycles, but the downside is that there's not much uninterrupted CPU time left for the playroutine...

After rereading your post I realized you are already doing this. Sorry for the nonsense!

Ok, so the problem is that you don't have enough register to keep all those 16-bit params in registers.

There are some optimizations in very performance intensive code that would be possible due to SMC,

...which would be beaten into insignificance by the length of time it took to write pages of program memory (even assuming infinite erase/write cycles)

Ok, after having my morning coffee, I think I can finally post something sensible.

At 16mhz and with 48khz sampling freq you have about 333 cycles per output sample. With four channels, that’s about 83 cycles per channel. This is assuming that the mcu does nothing else. In practice of course the playroutine and timer interrupt steal cycles.

I roughly cycle estimated the inner mixing loop. I think it’s possible to run it in less than 50 cycles per channel:

  • load channel params (4 x int16, 1 x int8 volume = 9 bytes); 18 cycles
  • compute sample address in flash; 2-4 cycles
  • load sample from flash; 3 cycles
  • advance sample position; 2 cycles, assuming sample pos is kept in register
  • apply volume and accumulate to result; ~5 cycles
  • handle looping; on average about 10 (?) cycles because looping can be skipped most of the time
    That’s a little over 40 cycles (+ misc stuff I forgot :))

If you can keep channel params in registers at least for some channels, it will save a lot of cycles.

Of course, the playroutine and timer interrupt need some time too.

In practice, the problem might become that the timer interrupt running at 48khz will slow down the mixing routine too much. Might need to write it in asm too. And if it’s still too slow, you can get away with the interrupt and interleave the audio output with everything else. I.e. use the out instruction every 333th cycle to write to PORTD. This would be really painful to cycle count and code (been there) but would save a lot of cycles.

Overall, this sounds like a really fun project! :slight_smile:

In my view, rather than trying to modify the code, your time would be better spent helping the compiler optimise your code for performance. To start with, you will probably want to change the compilation options to optimise for performance rather than size.

With SMC I could write these audio channel constants into the code before executing the loop and save a lot of cycles for data fetching, and also save registers to keep more channel data in the registers.

Why not keep these "audio channel constants" in EEPROM? Will reading them from EEPROM going to be any different than reading them from Flash?

CrossRoads:

With SMC I could write these audio channel constants into the code before executing the loop and save a lot of cycles for data fetching, and also save registers to keep more channel data in the registers.

Why not keep these "audio channel constants" in EEPROM? Will reading them from EEPROM going to be any different than reading them from Flash?

That misses the point, constants compiled into the code are inserted into the actual instructions as immediate constant fields, potentially. The read-decode-execute hardware is pipelined and costs as little as one cycle per instruction, whereas accessing EEPROM takes many cycles.

The bottom line is that the AVR Havard architecture with code in flash is simply not able to support compiling inline or SMC. (Apart from anything writing flash is done a page at a time, not an instruction at a time).

ARM based Arduinos would be the answer (and anyway are faster which is the basic issue in the first place - an 8-bit AVR MCU is not a DSP chip...).

Reading from EEPROM 5-6 clock cycles, there is a nice compact example in the datasheet.
I don’t know how these 5-6 cycles (0.3uS) would impact the overall processing time.

unsigned char EEPROM_read(unsigned int uiAddress)
{
/* Wait for completion of previous write */
while(EECR & (1<<EEPE))
;
/* Set up address register */
EEAR = uiAddress;
/* Start eeprom read by writing EERE */
EECR |= (1<<EERE);
/* Return data from Data Register */
return EEDR;
}

"The EEPROM Read Enable Signal EERE is the read strobe to the EEPROM. When the correct address is set up in
the EEAR Register, the EERE bit must be written to a logic one to trigger the EEPROM read. The EEPROM read
access takes one instruction, and the requested data is available immediately. When the EEPROM is read, the
CPU is halted for four cycles before the next instruction is executed.

The user should poll the EEPE bit before starting the read operation. If a write operation is in progress, it is neither
possible to read the EEPROM, nor to change the EEAR Register."

MarkT:

CrossRoads:

With SMC I could write these audio channel constants into the code before executing the loop and save a lot of cycles for data fetching, and also save registers to keep more channel data in the registers.

Why not keep these "audio channel constants" in EEPROM? Will reading them from EEPROM going to be any different than reading them from Flash?

That misses the point, constants compiled into the code are inserted into the actual instructions as immediate constant fields, potentially. The read-decode-execute hardware is pipelined and costs as little as one cycle per instruction, whereas accessing EEPROM takes many cycles.

Are these different sets so large and/or diverse that code for each set would be too big to fit yet quickly transferable blocks would not? SMC would have to pick and chose then write that along with the rest of the code to flash --which AVR's can do, just not with existing Arduino bootloader code, AVR-Forth certainly does run-time flash writes-- more efficiently than other ways? I really doubt it.

VLSI Solutions make DSP chips with built-in MCU and GPIO pins. Their programming code is free. Look into a VS1053 breakout board for example, lots of room for customizing and in quantity the chips are economic. You can shoehorn a project into something tight but that doesn't necessarily make the best product and certainly not the best use of development resources. Chances are they may have something that does the job or close and modifiable.

VLSI Solutions in a Finnish company BTW, with very good engineers and help.

Like PetriH said, this is more of an optimization challenge/exercise for 8-bit AVR and trying to push Uno to its limits (I got Teensy 3.0 I could use but that’s beside the point). GCC isn’t really able to squeeze the cycles out from the C++ code, and even the latest AVR version 4.7.2 has 94 cycles for the inner loop, while my hand optimized asm is currently at 55 cycles, although there are still some bugs.

Something I have been thinking of doing is to change the loop to mix one channel at the time to 16-bit buffer instead of mixing all channels at once to 8-bit buffer, which will help to keep the channel data in registers without need to constantly fetching it from memory. It’ll consume twice the buffer memory, but I think I should be able to handle it (e.g. Halve the buffer length and call the mixing loop twice as frequently).

Regarding using EEPROM that would be no improvement over reading the data from RAM, like I’m doing now. These audio channel constants are not constant over the lifetime of the entire program but constant for 1/50th of a second while the mixing loop is running. I have used this SMC technique for this purpose long time ago on 286 but it has different memory model and thus a feasible solution on that platform.

If anyone is curious, here’s the mixer loop I’m optimizing (the optimized asm and the original C++ implementations:

void mod_player::mix_buffer_batch()
{
  // mix batch of samples
  uint8_t *buf=m_buffer+(m_buffer_batch_write_idx?buffer_batch_size:0), *buf_end=buf+buffer_batch_size;
  m_buffer_batch_write_idx^=1;
  audio_channel *channel_begin=m_channels, *channel_end=m_channels+modplayer_max_channels;
  asm volatile
  (
    "sample_mix: \n\t"
    "ldi r16, %[center_lo] \n\t" // r16-17 = res
    "ldi r17, %[center_hi] \n\t"
    "ldi r18, %[num_channels] \n\t"
    "movw r28, %[channel_begin] \n\t" // Y=[channel_begin]
    "channel_mix: \n\t"
    "ld r19, Y+ \n\t"       // r19-21 = sample_pos (16.8fp)
    "ld r20, Y+ \n\t"
    "ld r21, Y+ \n\t"
    "ld r0, Y+ \n\t"
    "ld r30, Y+ \n\t"       // Z(r30-31) = sample_addr
    "ld r31, Y+ \n\t"
    "add r30, r20 \n\t"     // Z = sample_addr + sample_pos>>8
    "adc r31, r21 \n\t"
    "lpm r22, Z\n\t"        // r22 = smp
    "ld r23, Y+ \n\t"       // r23 = vol
    "mulsu r22, r23 \n\t"   // r0-r1 = smp*vol
    "mov r0, r1 \n\t"       // res+=(smp*vol)>>8
    "lsl r1 \n\t"           //   ...
    "sbc r1, r1 \n\t"       //   ...
    "add r16, r0 \n\t"      //   ...
    "adc r17, r1 \n\t"      //   ...
    "eor r0, r0 \n\t"       //   ...
    "ld r22, Y+ \n\t"       // r22-r23 = sample_speed (8.8fp)
    "ld r23, Y+ \n\t"
    "add r19, r22 \n\t"     // sample_pos+=sample_speed
    "adc r20, r23 \n\t"     //  ...
    "adc r21, r0 \n\t"      //  ...
    "ld r22, Y+ \n\t"       // r22-r23 = sample_end
    "ld r23, Y+ \n\t"       //  ...
    "brcs sample_end \n\t"  // if(sample_pos>sample_end) goto sample_end;
    "cp r20, r22 \n\t"      //  ...
    "cpc r21, r23 \n\t"     //  ...
    "brcs sample_end \n\t"  //  ...
    "sbiw r28, 11 \n\t"     // store sample pos back to memory
    "next_channel: \n\t"
    "st Y+, r19 \n\t"       //  ...
    "st Y+, r20 \n\t"       //  ...
    "st Y+, r21 \n\t"       //  ...
    "adiw r28, %[channel_size]-3 \n\t" // proceed to the next channel
    "dec r18 \n\t"
    "brne channel_mix \n\t" //  ...
    "asr r17 \n\t"          // res>>=2;
    "ror r16 \n\t"
    "asr r17 \n\t"
    "brne clamp_res \n\t"
    "ror r16 \n\t"
    "st X+, r16 \n\t"
    "cp r26, %[buf_end] \n\t"
    "cpc r27, %B2 \n\t"
    "brne sample_mix \n\t"
    "jmp mix_end \n\t"

    "clamp_res: \n\t"       // res=res<0?0:255;
    "lsl r17 \n\t"          //  ...
    "sbc r16, r16 \n\t"     //  ...
    "com r16 \n\t"          //  ...
    "st X+, r16 \n\t"
    "cp r26, %[buf_end] \n\t"
    "cpc r27, %B2 \n\t"
    "brne sample_mix \n\t"
    "jmp mix_end \n\t"

    "sample_end: \n\t"
    "ld r22, Y+ \n\t"       // r22-23 = loop_len
    "ld r23, Y+ \n\t"       //  ...
    "sbiw r28, 13 \n\t"
    "sub r20, r22 \n\t"     // sample_pos-=loop_len;
    "sbc r21, r23 \n\t"     //  ...
    "or r22, r23 \n\t"      // if(loop_len) goto next_channel;
    "brne next_channel \n\t"
    "clr r19 \n\t"
    "clr r20 \n\t"
    "clr r21 \n\t"
    "std Y+6, r0 \n\t"      // volume = 0
    "std Y+7, r0 \n\t"      // sample_speed = 0
    "std Y+8, r0 \n\t"      //  ...
    "rjmp next_channel \n\t"

    "mix_end: \n\t"
    :
    :[channel_begin] "r" (channel_begin)
    ,[buf] "x" (buf)
    ,[buf_end] "r" (buf_end)
    ,[num_channels] "I" (modplayer_max_channels)
    ,[center_lo] "I" ((modplayer_max_channels*0x80)&0xff)
    ,[center_hi] "I" ((modplayer_max_channels*0x80)>>8)
    ,[channel_size] "I" (sizeof(audio_channel))
    :"r16", "r17", "r18", "r19", "r20", "r21", "r22", "r23", "r24", "r25", "r28", "r29", "r30", "r31"
  );
}
//----

#if 0
void mod_player::mix_buffer_batch()
{
  // mix batch of samples
  uint8_t *buf=m_buffer+(m_buffer_batch_write_idx?buffer_batch_size:0), *buf_end=buf+buffer_batch_size;
  m_buffer_batch_write_idx^=1;
  audio_channel *channel_begin=m_channels, *channel_end=m_channels+modplayer_max_channels;
  do
  {
    // mix sample and advance all channels
    int16_t res=0x80*modplayer_max_channels;
    audio_channel *channel=channel_begin;
    do
    {
      // mix channel sample and advance sample position
      int8_t smp=(int8_t)pgm_read_byte(channel->sample+(channel->sample_pos>>8));
      uint8_t vol=channel->volume;
      res+=(smp*vol)>>8;
      channel->sample_pos+=channel->sample_speed;
      if((channel->sample_pos>>8)>=channel->sample_end)
      {
        channel->sample_pos-=long(channel->loop_len)<<8;
        if(!channel->loop_len)
        {
          channel->sample_pos=0;
          channel->sample_speed=0;
          channel->volume=0;
        }
      }
    } while(++channel!=channel_end);

    // clip sample and write it to the buffer
    res/=modplayer_max_channels;
    *buf++=res<0?0:res>0xff?0xff:res;
  } while(buf!=buf_end);
}
//----
#endif