The answer is complicated.
It's not at all like AVR, where every instruction always takes a fixed number of cycles and the most complex thing you need to worry about is branches that take 1 extra cycle if the branch happen, or skip instructions that take an extra cycle to skip past a 2-word instruction.
ARM is far more complex.
The CPU runs much faster than the flash memory. Between the slow flash and fast instruction fetching is a small cache implemented within the flash controller, and a small buffer within the CPU. Often needed data will already be in the buffer. Most instructions are 16 bits. The data bus between the CPU and flash is 32 bits. I believe in the SAM3X the flash bus feeding its cache is 128 bits. But the cache isn't huge and there can be worst cases where the CPU needs to wait for the slow flash.
On top of that, the bus structure inside the chip allows for multiple bus masters. So far, it seems most of Due's code uses simple software polling where the CPU is always the bus master. But the SAM3X has 2 different types of DMA which can request bus access. The situation is far more involved than just a simple wait state inserted on a single bus. The CPU has 2 separate buses which feed into a structure of other buses to the various other areas of the chip, which lessens the odds of conflicts. But when 2 bus masters access the same internal bus, of course one must wait. I haven't studied the DMA controllers in Due in depth, but I can tell you the DMA that's in Teensy3 has many complex options, including some to trade-off performance vs hogging the bus. It's very complex.
Some memory accesses are special, sometimes requiring extra cycles. For example, writing to the bitband region is implement by turning a single write from the CPU into an atomic read-modify-write operation. Of course that's implemented by hardware which adds wait states to the CPU side while it performs the 2 bus operations to the memory or peripheral.
Most peripherals, by the way, are accessed through a special bus bridge. I don't actually know if it adds wait cycles on the CPU side. I know it can delay issuing actual writes. I haven't needed to study the peripheral bridges yet, at least in much detail...
Cortex-M3 also has a feature where long instructions can be interrupted, where the partial results are abandoned, and then the entire instruction is restarted after the interrupt. I believe the load and store instructions that move several registers to/from memory are the main example. So in terms of interrupt latency, those many-cycle instructions (which themselves are subject to the variable bus timing) don't delay interrupt response, but in terms of total execution time, they can take even longer if they're interrupted with partial results discarded and then the whole thing redone after the interrupt.
All that (and probably more issues I don't personally know), is on top of the usual variable timing for control flow changes due to the 3 stage pipeline. But other than branches, and writing directly to R15, which is the program counter and forces a branch, most instructions take 1 cycle if they involve only registers, or 1 or 2 cycles if they involve memory.
Also, regarding 1.25 MIPS per MHz claim, I believe that's only a quirk of the Dhrystone MIPS benchmark, which assumes specific algorithms require specific certain numbers of instructions (which are far from optimal, based on older CPU architectures), not 2 parallel execution units within the chip. The Cortex-M4 does have a very limited set of SIMD instructions, mainly for manipulating 16 bit integers, but that feature is not present in Cortex-M3.