Even if I might talk to myself, I want to give some insights into the progress made until now.
As I removed plenty of Arduino stuff already in the sources until now, I went on to proper and real plain C coding with avr-gcc, Eclipse, AVR-Plug-in for Eclipse, ... on Mac OS X. As Arduino is mostly a wrapper for glibc functions, I hoped for more speed improvements. And to learn something during that task. (Also, I ordered a STM32F0 Discovery board just in case I fail with OV7670 on Arduino

). So I'm all setup for switching between platforms already.
Loading the code above into the plain C environment resulted in plenty errors and warnings. Some things to consider:
- Arduino offers timing functions like millis(). In plain C, you have to code that yourself via timer and interrupt functions. Plenty of samples available on the net, but for ATmega328, those samples are seldom and/or not the best solution for the problem in my eyes (using the 16 bit timer1 is like using a whole plane for transporting a single letter.)
- Serial console output. You need something like Hyperterminal (goSerial works quite nice on Mac OS X, set line length to 120 chars in the options.) Needless to say, you need to code that yourself as well - there is no Serial.begin() and stuff.
- String output is different. Either use a uart_puts() function or redirect stdout and use mighty (s)printf.
- Find the console port yourself and enter it into the GUI. (ls /dev/cu.* helps with that

)
- Variable types are proper C now. There is no boolean, it's bool.
- Compiler optimizations for project aren't necessarily used for your own classes. I was wondering about bad speed (>7s) for my RAM_bank-sample and finally found out that -O0 was set for main.cpp. -Os (size optimization) led to 5.2 seconds runtime, -O2 and O3 to the same 3.029 seconds.
SPI Ram Tests Begin. Milliseconds: 0
Writing to Module 0: ......... Done. 128 kBytes written. ms: 287 -> 456 kByte/s.
Writing to Module 1: ......... Done. 128 kBytes written. ms: 288 -> 455 kByte/s.
Writing to Module 2: ......... Done. 128 kBytes written. ms: 287 -> 456 kByte/s.
Writing to Module 3: ......... Done. 128 kBytes written. ms: 287 -> 456 kByte/s.
Writing to Module 4: ......... Done. 128 kBytes written. ms: 287 -> 456 kByte/s.
Reading from Module 0: ......... Done. 128 kBytes read and compared. ms: 313 -> 418 kByte/s.
Reading from Module 1: ......... Done. 128 kBytes read and compared. ms: 313 -> 418 kByte/s.
Reading from Module 2: ......... Done. 128 kBytes read and compared. ms: 313 -> 418 kByte/s.
Reading from Module 3: ......... Done. 128 kBytes read and compared. ms: 313 -> 418 kByte/s.
Reading from Module 4: ......... Done. 128 kBytes read and compared. ms: 313 -> 418 kByte/s.
RAM-module 0 is OK!
RAM-module 1 is OK!
RAM-module 2 is OK!
RAM-module 3 is OK!
RAM-module 4 is OK!
Ram Tests Finished. Milliseconds: 3029
And there the problem is - it's a tiny bit slower now (2.996 seconds with optimized Arduino version). The reason for this may be my less-than-optimal serial communication, amongst other things.
Anyhow, this might be useful for others digging into the same stuff, thus I post the code here.