Nucleo-F746ZG Performance!

I just bought a STMicroelectronics NUCLEO-F746ZG for $23.

This is an amazing board. Here are some key features.

STM32F746ZGT6 in LQFP144 package
ARM®32-bit Cortex®-M7 + FPU + Chrom-ART™ Accelerator
216 MHz max CPU frequency
1 MB Flash
320 KB SRAM
12-bit ADCs with 24 channels (3)
12-bit DAC channels (2)
USART/UART (8 )
I2C (4)
SPI (6)
Advanced-control Timer (2)
Low-power Timer (1)
General Purpose Timers (12)
Watchdog Timers (2)
CAN 2.0B active (2)
SAI (2)
USB 2.0 OTG HS
USB 2.0 OTG FS
Ethernet 10/100Mbps
Virtual Com port
Mass storage (USB Disk drive) for drag'n'drop programming
Debug port

All CPU pins of the LQFP144 are available on the board.

I ran benchmarks with the ChibiOS/RT RTOS at 216 MHz CPU speed.

This Cortex-M7 board can do 4,909,064 context switches per second. That's about 0.2 µs for a context switch!

This is 3.5 times faster than a 100 MHz Cortex-M4 NUCLEO-F411RE.

I am using ChibiStudio for development. It's a great Eclipse based IDE.

This is 3.5 times faster than a 100 MHz Cortex-M4 NUCLEO-F411RE.

ok. How is that possible? Aren't they both 1 clock for RAM access?

Cortex-M7 can "execute two instructions per cycle". I am still looking at the architecture but this is what I read in some popular articles.

"The Cortex-M7 has a superscalar pipeline which can execute two instructions simultaneously," an ARM source told us

"The Cortex-M4 can execute just one instruction at one time. This is where most of the speed-up comes from.

Here is another.

The ARMv7E M six-stage pipeline architecture increases throughput compared to the Cortex-M4 which uses the three-stage pipeline ARMv7E M architecture with the Thumb/Thumb 2 instruction set. The Cortex-M7’s superscalar pipeline allows the processor to execute more instructions/second.

The six-stage pipeline delivers a performance of 2.14DMIPS/MHz, improving the Cortex-M4’s 1.25DIPS/MHz, to fulfil the capability requirements that are normally only seen at the high end of the market. The increase in instructions per cycle have led to a modest improvement in clock rate, says York, for twice the number of instructions per cycle.

Cortex M-7 is designed to run at up to 800 MHz with 28nm features. My old cheap STM32F746ZG runs at 216 MHz so much more speed is possible. 300 MHz Cortex-M7chips are available now.

Here is a bit from the ST datasheet for my chip.

The Cortex-M7 processor is a highly efficient high-performance featuring:
– Six-stage dual-issue pipeline
– Dynamic branch prediction
– Harvard caches (4 Kbytes of I-cache and 4 Kbytes of D-cache)
– 64-bit AXI4 interface
– 64-bit ITCM interface
– 2x32-bit DTCM interfaces

Huh. I still would have expected context switch time to be limited by memory bandwidth (which I assume hasn't changed much?)

There are lots of instructions executed in a context switch so the 64-bit instruction bus helps but a big part of the improvement may be the improved data paths and D-cache in Cortex-M7.

I think the task stacks are in DTMC (Data Tighly Coupled Memory).

Two 32-bit TCM interfaces that are optimized for various types of data accesses. It
supports two simultaneous word data transfers at the same time, which enables the
Cortex-M7 processor to achieve excellent performance.

While DTCM and ITCM RAMs can be accessed as bytes, half-words (16 bits), full
words (32 bits) or double words (64 bits). These memories can be addressed at maximum
system clock frequency without wait state.

Edit:
I looked at the ld script and the process stacks are in the 64 KB DTMC.

The main system stack is also in DTMC

/* RAM region to be used for Main stack. This stack accommodates the processing
   of all exceptions and interrupts*/