I know how to significantly reduce the no. of clock cycles required for A/D conversion, I'll post a link to it once I get home (5 hrs) cos I don't have it with me although the issue that caused it might have been sorted in the 1.5.2 IDE. I don't know if this is the limiting factor for sampling but IIRC it is. I'll also run a test and see what rate I can get, I've been working to speed up my analogRead to do some FFT data capture & calcs but I can't remember off hand all the changes I made.
I know that analogRead() can complete in 3.3uS (the number sticks in my mind but can't remember the source

) which theoretically works out to 30,3kHz and I've been told on good authority it's a pretty tidy function (not much to be gained from programming it at a lower level). This is close to your result and isn't anywhere near 1Msps. I wonder if deactivating all but 1 pin would help, I know it does for the atmel chips.
As I said I will follow up later with the documentation I found.
Regards
Andrew