Using a camera with the Arduino by itself is going to be something near to impossible; there's just not enough memory on-board (especially to do things like color processing). What cameras like the CMU Cam (and probably others) do is have an on-board higher-power microcontroller with more RAM (or external RAM) to do the processing with.
With that said, there still might be some options.
You might try hacking the sensor from an optical mouse. It will be low-resolution at best (64 x 64 pixels; some sensors might have a higher resolution, but you don't have much memory to play with on the Arduino here), and black and white - and slow (1 fps or worse). You would also need to come up with your own optical system and properly focus it. If you needed to detect a particular color of object, you could try putting colored gel filters in front of the sensor before scanning the scene, or light the scene with colored high-brightness LEDs...
Another option might be to hack an old QCIF web-camera, then only using a portion of its pixels. You could also possibly hack an old game-boy camera (if you can find one). You might again run into only having black-and-white images (the game boy camera is b/w only - QCIF cameras were both b/w and color). You could try building your own camera using a linear element optosensor, with custom optics, and scanning it across the image using a servo. You could even use a single phototransistor (or multiple phototransistors) and scan them in an x/y plane using servos.
You might also want to look into extremely low-resolution vision systems for microcontrollers - basically using simple sensor elements like phototransistors, LDRs, etc - arranged in various manners with various filters (maybe a motorised filter selector) for detecting and building up an "image" over multiple scans. It wouldn't be fast, and you would need to do some simulation and other testing with a PC and scan data - but it could be made to work. If you look around, you'll find this is an interesting and fairly current topic of research; it requires simpler sensors (for a lower cost), less up-front processing - but more intelligence on the back-end side of things. The idea is an approach of "how does an insect see things, with its simpler eyes and smaller nervous system - 'brain' if you will" - and applying insights and such from that research.
Remember you can also scan the data (whatever and whereever it is coming from) to an off-board EEPROM or Flash memory, then slowly operate on and process that data.
Finally - there's this option:
The owner of this company is a frequent member here - so if you look around, you can find an announcement for this product; basically, it can take a composite video input (which can be had from any number of cheap and small "pinhole" security cameras) and process/store the data (black and white only - no gray scale!) into the memory the TVOut library uses. It uses a modified version of the TVOut library (by now the changes nootropicdesign made might be incorporated into the standard version of the library). From there, you could process the data; just know that there isn't much RAM left over depending on the resolution you select to use for the TVOut library, but if you kept the resolution small (say 32 x 32), you could get away with it. You would still need to use external filters and such to possibly determine specific colors and such.
Also note that you could conceivably use one Arduino (or ATMega328) to do the processing of the video and such (using as much memory as needed), and then communicate with it using SPI/I2C or other means (TTL serial, perhaps) with a "main" Arduino which controls your robot. It would be a more expensive solution, but it wouldn't cause the "RAM starvation" that using a single Arduino would; you would in essence re-create something like the CMU Cam (just on a reduced processing and resolution scale, of course).
Good luck - hope this response at least gets you thinking, if nothing else!
