SAM3: How many clock cycles per instruction?

Hi guys,
after looking in the data sheet for a whole while I still haven't figured out how many clock cycles per instruction are needed by the ARM core. All I found is the statement "3-stage pipeline" in the chapter about the Cortex implementation.
So is this three cycles for a "NOP" instruction? And what about the different memory access instructions?

As far as I know the PIC uC need 4 osc cycles per instruction, the AVRs only one. Does that mean the SAM3 when handling byte-size content with 84MHz (divided by three) is only abt. 30% faster as our ould friend, the 328P at 20MHz?

It depends on the instruction. Most are single-cycle, though.

As Leon said, most instructions are single cycle on the Cortex chip. However, it is even better than that. It must have two execution pipelines because it will do 1.25 million instructions per second per 1 MHz. So 84MHz * 1.25 = 105MIPS. That makes it about five times faster than a 328P running 20MHz. However, it also is natively 32bit and so can crunch 32 bit numbers in a single cycle. So, for 32 bit math it is probably around 6x faster again (for a total of 30x faster at doing 32 bit math) If you can work your algorithms so that operations on bytes can be packed into 32 bit integers and processed four at a time then you could see upwards of a 30x speed up.

But, talking about processor speed is a very complicated topic. Take, for instance, the talk of pipelines. A three stage pipeline means that each instruction goes through three stages (probably one per clock) but generally each stage can have a different instruction (so it doesn't slow down the actual # of instructions processed). Now, under ideal circumstances this means that there are always three instructions in the pipe and things are working one clock per instruction. But, a cache miss or instruction dependency can stall or clear the pipe causing it to need to reload. This will waste instruction cycles. This is not really that big of a deal with a three stage pipe. Modern processors can be over 20 stages. Then a pipe getting invalidated is a bad thing.

Hello transistorfips,
Here a brief summary of the Cortex-M3 processor in which is base the SAM3X8E, taken from Joseph Yiu book "The Definitive Guide to the ARM CORTEX-M3":

Many instructions, including multiplying, are single cycle. Separate data and instruction buses (Harvard architecture) allow simultaneous data and instruction accesses to be performed. Also, up to two instructions can be fetched in one cycle (they share the same memory space). Thanks to the Thumb-2 instruction set feature, there is no need to switch between 32 and 16 instructions that can be used together in one operation state (no state switching overhead). In other words, saving both execution time and instruction space give the Cortex-M3 processor higher performance efficiency.

If you are familiar with the Arduino legacy AVR family (Due, Mega,etc..) you will notice very soon Due's louder roar. Regards!

The answer is complicated.

It's not at all like AVR, where every instruction always takes a fixed number of cycles and the most complex thing you need to worry about is branches that take 1 extra cycle if the branch happen, or skip instructions that take an extra cycle to skip past a 2-word instruction.

ARM is far more complex.

The CPU runs much faster than the flash memory. Between the slow flash and fast instruction fetching is a small cache implemented within the flash controller, and a small buffer within the CPU. Often needed data will already be in the buffer. Most instructions are 16 bits. The data bus between the CPU and flash is 32 bits. I believe in the SAM3X the flash bus feeding its cache is 128 bits. But the cache isn't huge and there can be worst cases where the CPU needs to wait for the slow flash.

On top of that, the bus structure inside the chip allows for multiple bus masters. So far, it seems most of Due's code uses simple software polling where the CPU is always the bus master. But the SAM3X has 2 different types of DMA which can request bus access. The situation is far more involved than just a simple wait state inserted on a single bus. The CPU has 2 separate buses which feed into a structure of other buses to the various other areas of the chip, which lessens the odds of conflicts. But when 2 bus masters access the same internal bus, of course one must wait. I haven't studied the DMA controllers in Due in depth, but I can tell you the DMA that's in Teensy3 has many complex options, including some to trade-off performance vs hogging the bus. It's very complex.

Some memory accesses are special, sometimes requiring extra cycles. For example, writing to the bitband region is implement by turning a single write from the CPU into an atomic read-modify-write operation. Of course that's implemented by hardware which adds wait states to the CPU side while it performs the 2 bus operations to the memory or peripheral.

Most peripherals, by the way, are accessed through a special bus bridge. I don't actually know if it adds wait cycles on the CPU side. I know it can delay issuing actual writes. I haven't needed to study the peripheral bridges yet, at least in much detail...

Cortex-M3 also has a feature where long instructions can be interrupted, where the partial results are abandoned, and then the entire instruction is restarted after the interrupt. I believe the load and store instructions that move several registers to/from memory are the main example. So in terms of interrupt latency, those many-cycle instructions (which themselves are subject to the variable bus timing) don't delay interrupt response, but in terms of total execution time, they can take even longer if they're interrupted with partial results discarded and then the whole thing redone after the interrupt.

All that (and probably more issues I don't personally know), is on top of the usual variable timing for control flow changes due to the 3 stage pipeline. But other than branches, and writing directly to R15, which is the program counter and forces a branch, most instructions take 1 cycle if they involve only registers, or 1 or 2 cycles if they involve memory.

Also, regarding 1.25 MIPS per MHz claim, I believe that's only a quirk of the Dhrystone MIPS benchmark, which assumes specific algorithms require specific certain numbers of instructions (which are far from optimal, based on older CPU architectures), not 2 parallel execution units within the chip. The Cortex-M4 does have a very limited set of SIMD instructions, mainly for manipulating 16 bit integers, but that feature is not present in Cortex-M3.

Did you ever get your answer? Seems like if different instructions take different amounts of cycles, there should be a chart of instructions vs cycles needed...

I'm new to Arduino, and I have the same question.
I want to make short pulses using some Arduino thingy and need to know how fast I can turn on/off a port to do that. Guess it might be quicker to just write the code and measure it.

Will let you know when I do the work.

K8UR:
Did you ever get your answer? Seems like if different instructions take different amounts of cycles, there should be a chart of instructions vs cycles needed...

I'm new to Arduino, and I have the same question.
I want to make short pulses using some Arduino thingy and need to know how fast I can turn on/off a port to do that. Guess it might be quicker to just write the code and measure it.

Will let you know when I do the work.

You can find it on the arm website afaik. You can do max 21MHz. You need to use dma, pwm or inline assembly to get it.