OV7670 with FIFO, data transmission much slower than RCK clock

Hi, I'm working on OV7670 + FIFO and I have ran into a problem, I have already worked with OV7670 without FIFO and got it functioning and I have been trying to convert the code I used for that one so that it works with FIFO too. I'm going to do my best to give you enough information. You might be able to help if you have never tried OV7670.

Board: Arduino Mega2560
OV7670 internal clock: 12MHz
RCK(read clock): 1MHz
WCK(write clock): 12MHz
Baud rate: 1Mbit
Format: YUV
Resolution: QVGA

So at every rising edge of RCK clock a byte arrives to the Mega pins, I read one byte and skip the next clock rising edge and keep repeating this for every row of a frame. I'm toggling a test signal, once before data transmission starts and once after it's completed, now as you will see in the attached picture, by the time a byte is transmitted, I have skipped about 7 clock pulses.
Every byte transmitted to my laptop takes 8us and I somehow have to do this in less than 2us to keep up with the RCK clock. I cannot decrease the RCK frequency as the current one is the minimum value that I must use.

I have looked at the other signals required timings in the datasheet and looking at my logic analyser, everything appears to be where they are supposed to be so I think the above issue is the only thing left that is not right ?

Relevant part of code:

void StringPgm(const char * str){
  do{
      while (!(UCSR0A & (1 << UDRE0)));//wait for byte to transmit
      UDR0 = pgm_read_byte_near(str);
      while (!(UCSR0A & (1 << UDRE0)));//wait for byte to transmit
  } while (pgm_read_byte_near(++str));
}

static void captureImg(uint16_t wg, uint16_t hg){
  uint16_t y, x;
  StringPgm(PSTR("*RDY*"));
  
  //Write operation
  ---------------------
  while (!(PINA & 16));//wait for VSYNC high
  while ((PINA & 16));//wait for VSYNC low
  delay(1);
  PORTA = PORTA & ~(1<<2); //set pin 24(/WRST) LOW
  delay(1);
  PORTA = PORTA | (1<<2); //set pin 24(/WRST) HIGH
  PORTA = PORTA | (1<<0); //set pin 22(WR) HIGH
  while (!(PINA & 16));//wait for VSYNC high
  PORTA = PORTA & ~(1<<0); //set pin 22(WR) LOW
  delay(1);
  
  //Read operation
  ---------------------
  PORTA = PORTA & ~(1<<3); //set pin 25(/RRST) LOW
  delay(1);
  PORTA = PORTA | (1<<3); //set pin 25(/RRST) HIGH
  PORTA = PORTA & ~(1<<1); //set pin 23(/OE) LOW
  y = hg;
  while (y--){
    uint8_t*b=buf,*b2=buf;
    x = wg;
    while (x--){
      while (!(PINL & 8));//wait for RCK HIGH
      PORTH = PORTH | (1<<4); //set pin 7 HIGH //TEST!!!
      UDR0= (PINC & 255);
      while (!(UCSR0A & (1 << UDRE0)));//wait for byte to transmit
      PORTH = PORTH & ~(1<<4); //set pin 7 LOW //TEST!!!
      while (!(PINL & 8));//wait for RCK HIGH
      //skip byte
    }
  }
    PORTA = PORTA | (1<<1); //set pin 23(/OE) HIGH
    delay(2);
}

void setup(){
  pinMode(7,HIGH);
  digitalWrite(7,LOW);
  arduinoUnoInut();
  camInit();
  setRes();
  setColor();
  wrReg(REG_CLKRC, 1);
}


void loop(){
  captureImg(320, 240);
}

32.jpg