I can speak about the "real" SD card interface, which exists on Teensy 3.5 & 3.6. I'm the guy who makes Teensy...
The most important point is you need to transfer large blocks (eg, 4096 bytes) and implement a caching layer with a substantial amount of RAM (for a microcontroller) to really reap the benefit from the native SD interface.
On Teensy, we ship a modified SD library where SD.begin(BUILTIN_SDCARD) tells the library to access the card using the native 4 bit SD interface, rather than using SPI with a chip select pin. It defaults to 4 bits at 24 MHz, and if you edit the code you can enable 4 bit 48 MHz mode. On Teensy, SPI to the SD card uses 24 MHz, but obviously only 1 bit instead of 4.
Using only the SD library, 4 bit SD is faster than SPI, but not anywhere near 4X faster even if you run it at 48 MHz. Usually the speedup is around 50% to 75%, a far cry from 400%. The problem is these cards have substantial command latency and tremendous internal overhead if you write less than the native block size of the card (which can be huge on modern cards). When you have a system like Linux running on a PC or Raspberry Pi with a gigabyte of RAM, dedicating many megabytes of RAM for reading and writing large blocks is "easy". But on a microcontroller with only 8K to 512K of RAM, allocating a big table of 4K blocks and merging writes to a 32K or 64K native block size of the card quickly burns up too much memory.
Bill Greiman has done some excellent work to support these fast 4 bit native SD features in his SdFat library. But it's not simple to use. If you want to get a taste for the incredible performance, Bill's library is free and it runs very well on Teensy 3.5 & 3.6. Last time I looked, he didn't have a caching layer and everything done very automatically, as you get with Linux. It requires a lot of work to manage huge buffers and interact with the card in ways that involve a lot more thought than simply writing data to the card (as you do with the regular SD library, and userspace programs on Linux).
It's easy to get caught up in talking about nondisclosure agreements and the SD association's exorbitant membership fees. But the truth is the "simplified" specs have nearly all the needed info, at least up to version 2 (which allows 4 bits at 50 MHz). Bill's SdFat library and my hacks to the SD library (which are partly based on some sample code NXP published) should be proof enough that using 4 bit mode really is feasible. I didn't sign any NDA or pay anything to the SD association, and I'm pretty sure Bill didn't either.
Now SD version 3 & 4 are another matter. Info about how to really use 4 bit DDR at 100+ MHz mode and switch from 3.3V to 1.8V might be "out there", but I haven't been able to find it. Today very few microcontrollers with native 4 bit SD support more than version 2. Perhaps in the future we'll have microcontrollers with GHz clock speeds and SD ports capable of these speeds? But when/if such hardware becomes available, the main limitation will be the software side. The small memory, no-caching-layer, single-sector-at-a-time approach of the SD library is already inadequate for the 4 bit SD ports microcontrollers have today.