SPI SS to SCLK timing

Using the following sketch (on an ESP32-WROVER) I'm hopping to communicate with an SPI slave:

#include<SPI.h>

int SS1 = 13;

void setup() {
 Serial.begin(115200); 

 pinMode(SS1, OUTPUT);
 digitalWrite (SS1, HIGH);
 pinMode(SS, OUTPUT);
 digitalWrite (SS, HIGH);
 
 SPI.begin(18, 19, 23, 5); // sck, miso, mosi, ss
 
 int SPI_Clock_MHz = 40;
 SPI.beginTransaction(SPISettings(SPI_Clock_MHz*1000000, MSBFIRST, SPI_MODE0));

 Serial.print("SPI clock (MHz): ");
 Serial.println(SPI_Clock_MHz);
 Serial.println();
  
 Serial.print("MOSI: ");
 Serial.println(MOSI);
 Serial.print("MISO: ");
 Serial.println(MISO);
 Serial.print("SCK: ");
 Serial.println(SCK);
 Serial.print("SS: ");
 Serial.println(SS);
 Serial.print("SS1: ");
 Serial.println(SS1);
 
}

void loop() {
  uint8_t spiBuffer[] = {0x55};//, 0xAA, 0x77};
  const uint32_t numBytes = sizeof(spiBuffer)/sizeof(spiBuffer[0]);
  digitalWrite(SS, LOW);
  SPI.transfer(spiBuffer,numBytes);
  digitalWrite(SS, HIGH);
  delay(100);
}

When I measure the time between SS transitions and SCLK I get that from SS going LOW to start of SCLK about 750ns and between end of SCLK to rise of SS about 500ns.
Together this takes about 1.25ms which dramatically reduces the number of SPI transactions I can make per second.

Is there a way to reduce this overhead time?

You might save some time by not using digitalWrite() and directly manipulating the port registers instead.

Tried reducing the digital write using GPIO.out_w1ts/GPIO.out_w1tc and with REG_WRITE(GPIO_OUT_W1TS_REG, BIT13)/REG_WRITE(GPIO_OUT_W1TC_REG, BIT13), but time before and after SCLK activity remains the same.
It looks like is something with the SPI.transfer() function.

I may be wrong here but the SPI library may be interrupt driven using buffers. I imagine that the gap you see between SS going low and transmission actively appearing on MOSI is a result of the SPI.transfer() function copying the data into the Tx buffer first.

You could write your own SPI routine directly accessing the registers as SPI is fairly easy to do, and see if that gets rid of enough of the overheads.

Found the "ESP32DMASPI" library.
With this library, the SS to MOSI/SCLK activity is minimal.