Really...Really Fast

Just thought I would note something really amazing about the GIGA R1. I just took my FFT library and did a quick measure of how fast it can do a forward and inverse 1024-point 32-bit FFT. I'm seeing 0.7ms. This is about 20 to 25 times faster than the RP2040. I know the processor is about 3-times higher frequency. But I was not expecting this level of speed up.

I did verify that the FFT is producing correct results with expected accuracy.

Its only one test...but wow. That is fast.

With that kind of speed and the amazing sample rates of the ADC and DAC you can make a very high-quality audio processor.

2 Likes

Interesting, can you share the code to do the test?

Sure, but you'll need my library as well to run it. What I did is I created a test pattern which is 1024 samples of 16 periods of a sine-wave with 12-bit values. Then I define a 1024-point FFT pipeline. I pass the test-pattern into the forward FFT. Then I take those results and pass them back through an IFFT. The results are correct and I've checked the intermediate values as well. So, I know its actually doing the computation.

I assert an output pin when starting FFT and deassert when completing IFFT. I then look at this pin with a logic analyzer. The logic analyzer is sampling at 400Mhz. The period between each high and low assertion is about 770us. I have a capture of the waveform, but no way to post it here.

#define FIRMWARE_VERSION "0.0.0"



#include <fft.h>

int debug_output=3;



FFT<1024,false,true> fft; //1024-point FFT with split-radix processing, no mirroring

int32_t in_buffer_re[1024];
int32_t in_buffer_im[1024];


uint16_t testpattern_data[]= {
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620,
2640,2659,2679,2698,2716,2734,2751,2766,2781,2794,2806,2816,2824,2831,2836,2839,
2840,2839,2836,2831,2824,2816,2806,2794,2781,2766,2751,2734,2716,2698,2679,2659,
2640,2620,2600,2581,2563,2545,2528,2513,2498,2485,2473,2463,2455,2448,2443,2440,
2440,2440,2443,2448,2455,2463,2473,2485,2498,2513,2528,2545,2563,2581,2600,2620};



//End globals



void setup() 
{ 
  Serial.begin(9600);
  delay(1000);
  pinMode(debug_output,OUTPUT);
  digitalWrite(debug_output,0);
  for(uint16_t i=0;i<1024;i++)
  {
    in_buffer_re[i]=testpattern_data[i];
    in_buffer_im[i]=0;
  }
  
}


void loop() {
  
    digitalWrite(debug_output,1);
    fft.fft(in_buffer_re,in_buffer_im);
    fft.ifft(in_buffer_re,in_buffer_im);
    digitalWrite(debug_output,0);
   
}

Your observations regarding the speed are spot on! I ran a simple Sieve of Eratosthenes and a Mandelbrot Set for integer and double on M7 & M4. When the results are displayed on a bar chart in comparison with other popular Arduino boards, the bars for the GIGA are barely visible with naturally the UNO R3 using the full length of the axis to display its values.

Regards.

I have not looked at floating point (double). I am hoping these results are also very good. Right now I am only using 32-bit integer for the FFT, but doubles would eliminate some error from the result if its fast enough.

I'm discovering (after one year) this very nice post.

What FFT library are you using? CMSIS-DSP (CMSIS DSP Software Library) or something else?
If something else, should I expect the same level of performance?
How do you compile the code? Is is plain code generation, or some version of Thumb?
Also, the 0.7ms are for the round-trip FFT/IFFT?

I'm considering using the Arduino Giga in a sound processing project, and I try to make
sure the computation power is adequate before I start programming.

Its my own custom library. I don't want to share it because its never been validated. Sorry. But, I can tell you I did nothing special. Its a basic decimation in frequency FFT. Since I was inputting only real numbers I used a folding trick to make it faster.

Thanks for the reply. I spent a day porting the FFT of ESP-DSP library and tried it on the ESP32. I'm impressed by the radix4 FFT performance. Even without using the assembly routines it takes 0.7s to execute 1000 instances of 1024-point FFT in Float32. Enough for my first target application (a voice coded). :slight_smile:

I know! This device has legitimate integer and floating-point units. Transcendentals are still fairly slow, but you don't need those for real-time FFT.

Interesting. I have two questions if you have time:

  • should I expect significant gains by using assembly? Or simple tiling to exploit the cache is enough to get the best performance?
  • does the newer Esp32-S3 provide better compute performance? I/O-wise it seems less powerful than the traditional esp32.

Modern C++ compilers are pretty good at optimizing. I would doubt assembly would help much. Tiling will definitely help. Also, as a thought, utilizing both cores might also help. The one core is 240MHz and the other is 120MHz (?). But still if you can offload work to the slower core, its still somewhat better. The problem with FFT is the number of partial terms you have to share. So you'll need to have a shared memory scheme. Its something I've thought about but never spent much time on.

I don't know much about Esp32-S3. So I can't help you there.

Well, in my experience, on modern cores (x86-64, AARCH64), modern C++ compilers are quite pitiful, reaching only 5-7% of theoretical peak CPU performance for classical high-performance applications (like matrix multiplications, convolutions, and I assume it's similar for the FFT). You need specialized libraries or ML compilers to approach theoretical peak performance. Tiling is clearly necessary, and blocking to use the caches.

My first idea was also to use the second core, but the CPU->memory interconnect is very peculiar. I'm not even sure that variables can be shared between cores. Have you tried it?

I've been told the memory can be shared, but never looked into it. However, I've played around with this library (#include <pico/multicore.h>). It allows you to setup a second core as a separate loop(). I imagine this would allow sharing of the entire memory space. I've pasted some code below. There is also an RPC.h library which provides a way to communicate between two completely independent programs running on different cores. I imagine this is pretty inefficient for what you need.

#include <pico/multicore.h>

void loop1() {
  while(true){
     digitalWrite(LED_BUILTIN, LOW);
     sleep_ms(600);
  }
}

void setup() {
  Serial.begin(115200);
  pinMode(LED_BUILTIN, OUTPUT);
  
  sleep_ms(5000);
  multicore_launch_core1(loop1);
  digitalWrite(LED_BUILTIN, HIGH);
  sleep_ms(500);
}

void loop() {
    while(true){
    digitalWrite(LED_BUILTIN, HIGH);
    sleep_ms(500);
  } 
}