some performance related questions

I have a few questions regarding performance on due systems.

any compiler flags that help with performance in general with due systems apart from -O3.

I saw that gcc has options to supply L1 and L2 cache size. How much is the cache on the sam3x8e processor, I could not find it in the datasheet.

sam3x being a 32 bit processor, are int operations in general faster than byte or short operations.

are there any interrupts or other stuff running in background that I can switch off while executing some code requiring high performance.

doors666:
I saw that gcc has options to supply L1 and L2 cache size. How much is the cache on the sam3x8e processor, I could not find it in the datasheet.

No cache, SRAM is described in section 9.1.1 (or you could say that all the RAM is
cache memory and there's no main memory - unless you are driving
external memory)

sam3x being a 32 bit processor, are int operations in general faster than byte or short operations.

Arm Cortex M3 processors are documented by ARM, you can go and study the
instruction set if you need to. Normally in C a shorter value isn't any slower because
everything is done as int and then truncated when written to memory.

Is there a way that I can put a particular function and run it from sram. I thought its not possible due to harvard architecture, but I saw this in the TRM (on page 89):
SRAM - Instruction fetches and data accesses are performed over the system bus.
So it is possible to execute code from SRAM. How do i do this.

I would also like to try assembly for the loop function. Is it possible to write a separate file in assembly with the loop function, use gas on it and link the generated .o file, rather than mucking around with the clumsy syntax of embedded assembly using gcc. Anyone has some sample code that can get me started on this.

You'll very likely achieve lower performance by placing your code into the RAM, because all of the SAM3's RAM is located above address 0x20000000.

The Cortex-M3 CPU has 2 buses. The code bus is used for all memory access from 0x00000000 to 0x1FFFFFFF, which is the flash memory and boot ROM on this chip. The system bus is used for 0x20000000 to 0xDFFFFFFF, which is the RAM and peripherals. Higher addresses are generally reserved for stuff built into the CPU core, which doesn't involve data transfer on these 2 buses that link the CPU to the rest of the chip.

When executing from flash, instructions like load, store, branch+link (ARM's version of a call instruction) and interrupt entry/exit leverages both buses. While your load instruction is bringing data into the chip over the system bus, the code bus is bringing the next opcodes into the CPU prefetch buffer to keep the pipeline filled. Likewise, during interrupt entry, the code bus begins fetching the opcodes of the ISR while the system bus is saving registers to the stack.

If you locate code in the RAM, you'll cause bus contention. Any possible savings you might have hoped to achieve by using lower latency RAM will be destroyed by forcing the CPU to wait for access to the single system bus. Two buses operating in parallel really are dramatically faster. The flash memory is also pretty well optimized, with a wide path that provides a lot of bandwidth and a small cache memory between the flash and the CPU's prefetch buffer.

On the Freescale chips used on Teensy 3 (full disclosure: I'm the author of Teensyduino) a portion of the RAM is below the 0x20000000 boundary, so you can do this sort of optimization. But it's a mixed blessing, since less of the RAM is accessible from the system bus that doesn't conflict with the code bus. It's important (for performance) to always locate the stack on memory accessed by the system bus. Atmel chose to locate all their RAM above 0x20000000 on these SAM3 parts.

Just to prove the above point:

int testrom(){
  int r=0;
  for(int x=1;x<=1000000;x++){r*=x;r--;}
  return r;
}

int testram() __attribute__ ((section(".ramfunc")));
int testram() {
  int r=0;
  for(int x=1;x<=1000000;x++){r*=x;r--;}
  return r;
}


void setup() {

  Serial.begin(115200);
}

void loop() {

  long t,r;
  t=micros();
  r=testrom();
  t=micros()-t;
  Serial.print("Result: ");Serial.println(r);
  Serial.print("Time ROM: ");Serial.println(t);
  t=micros();
  r=testram();
  t=micros()-t;
  Serial.print("Result: ");Serial.println(r);
  Serial.print("Time RAM: ");Serial.println(t);
  
  
}
Result: -1628288769
Time ROM: 107012
Result: -1628288769
Time RAM: 131112

The ram function is over 20% slower.