You still have the problem of ambient light interference, so you need to use modulation, like IR remote controls.
When using IR a typical modulation frequency is 38 kHz, that's a 26 µs period. The problem is now: after how many pulses will the modulation be recognised as having started, and how can you be sure that you didn't miss the first one? That'd kill your accuracy there and then. Maybe have the transmitter send a number of pulses of defined length so the receiver knows the message is coming, then one with a different length (say double the length), and at the end of that long pulse start the sound pulse?
Both LEDs and phototransistors should be fast enough for this part.