Nucleo-F746ZG Performance!

I just bought a STMicroelectronics NUCLEO-F746ZG for $23.

This is an amazing board. Here are some key features.

STM32F746ZGT6 in LQFP144 package ARM®32-bit Cortex®-M7 + FPU + Chrom-ART™ Accelerator 216 MHz max CPU frequency 1 MB Flash 320 KB SRAM 12-bit ADCs with 24 channels (3) 12-bit DAC channels (2) USART/UART (8 ) I2C (4) SPI (6) Advanced-control Timer (2) Low-power Timer (1) General Purpose Timers (12) Watchdog Timers (2) CAN 2.0B active (2) SAI (2) USB 2.0 OTG HS USB 2.0 OTG FS Ethernet 10/100Mbps Virtual Com port Mass storage (USB Disk drive) for drag'n'drop programming Debug port

All CPU pins of the LQFP144 are available on the board.

I ran benchmarks with the ChibiOS/RT RTOS at 216 MHz CPU speed.

This Cortex-M7 board can do 4,909,064 context switches per second. That's about 0.2 µs for a context switch!

This is 3.5 times faster than a 100 MHz Cortex-M4 NUCLEO-F411RE.

I am using ChibiStudio for development. It's a great Eclipse based IDE.

This is 3.5 times faster than a 100 MHz Cortex-M4 NUCLEO-F411RE.

ok. How is that possible? Aren't they both 1 clock for RAM access?

Cortex-M7 can "execute two instructions per cycle". I am still looking at the architecture but this is what I read in some popular articles.

"The Cortex-M7 has a superscalar pipeline which can execute two instructions simultaneously," an ARM source told us

"The Cortex-M4 can execute just one instruction at one time. This is where most of the speed-up comes from.

Here is another.

The ARMv7E M six-stage pipeline architecture increases throughput compared to the Cortex-M4 which uses the three-stage pipeline ARMv7E M architecture with the Thumb/Thumb 2 instruction set. The Cortex-M7’s superscalar pipeline allows the processor to execute more instructions/second.

The six-stage pipeline delivers a performance of 2.14DMIPS/MHz, improving the Cortex-M4’s 1.25DIPS/MHz, to fulfil the capability requirements that are normally only seen at the high end of the market. The increase in instructions per cycle have led to a modest improvement in clock rate, says York, for twice the number of instructions per cycle.

Cortex M-7 is designed to run at up to 800 MHz with 28nm features. My old cheap STM32F746ZG runs at 216 MHz so much more speed is possible. 300 MHz Cortex-M7chips are available now.

Here is a bit from the ST datasheet for my chip.

The Cortex-M7 processor is a highly efficient high-performance featuring: – Six-stage dual-issue pipeline – Dynamic branch prediction – Harvard caches (4 Kbytes of I-cache and 4 Kbytes of D-cache) – 64-bit AXI4 interface – 64-bit ITCM interface – 2x32-bit DTCM interfaces

Huh. I still would have expected context switch time to be limited by memory bandwidth (which I assume hasn't changed much?)

There are lots of instructions executed in a context switch so the 64-bit instruction bus helps but a big part of the improvement may be the improved data paths and D-cache in Cortex-M7.

I think the task stacks are in DTMC (Data Tighly Coupled Memory).

Two 32-bit TCM interfaces that are optimized for various types of data accesses. It
supports two simultaneous word data transfers at the same time, which enables the
Cortex-M7 processor to achieve excellent performance.

While DTCM and ITCM RAMs can be accessed as bytes, half-words (16 bits), full
words (32 bits) or double words (64 bits). These memories can be addressed at maximum
system clock frequency without wait state.

Edit:
I looked at the ld script and the process stacks are in the 64 KB DTMC.

The main system stack is also in DTMC

/* RAM region to be used for Main stack. This stack accommodates the processing
   of all exceptions and interrupts*/