Using interrupts it's quite straightforward to reach that precision. On the ESP8266 I've reached better than 10 ns timing precision using oversampling (on 4-10 ms signals, 12.5 ns clock signal - haven't tried running the processor at 160 MHz as it's good enough for me).
Indeed the big issue is in the system itself, and them mostly the production and detection of the sound waves and changes to the medium (air) they travel through throwing off your accuracy.