Arduino/ATmega328 C64 Emulator

Thanks

It is a bit tight on memory but I think I can squeeze in a 1000byte videobuffer into RAM.

To be able to shift out a 4MHz pixel rate I had to ditch the serial RX/TX port and use it in USART SPI mastermode.
So the video is on the Arduino TX pin and the sync on digitalpin 2.
Works really great.

I managed to get it to sync.
The problem was my 3.5" LCD-TV. I replaced it with a 5" LCD-TV and it syncs both with NTSC and PAL modes.
Now I have a 40x25 textmode output.
Yes, it supports a graphicsmode of 160x100 with a character ROM set containing 2x4 blockdrawing characters.

Using the UART in SPImode, I noticed no 9th bit problem. It shifts 8bits/byte.

Because the pixel Clock is 8MHz I'm now struggling with loading a videoshift byte in just 16 clockcycles.

It may have to be done with inline assembly but I have no idea how to get a pointer to my videoRAM/charROM passed to the inline assembly code.

Pointer-register
A very special extra role is defined for the register pairs R26:R27, R28:R29 and R30:R31. The role is so important that these pairs have extra names in assembler: X, Y and Z. These pairs are 16-bit pointer registers, able to point to adresses with max. 16-bit into SRAM locations (X, Y or Z) or into locations in program memory (Z).

The lower byte of the 16-bit-adress is located in the lower register, the higher byte in the upper register. Both parts have their own names, e.g. the higher byte of Z is named ZH (=R31), the lower Byte is ZL (=R30). These names are defined in the standard header file for the chips. Dividing these 16-bit-pointer-names into two different bytes is done like follows:

.EQU Adress = RAMEND ; RAMEND is the highest 16-bit adress in SRAM
   LDI YH,HIGH(Adress) ; Set the MSB
   LDI YL,LOW(Adress) ; Set the LSB

Accesses via pointers are programmed with specially designed commands. Read access is named LD (LoaD), write access named ST (STore), e.g. with the X-pointer: 


Example

X Read/Write from adress X, don't change the pointer LD R1,X
ST X,R1 

X+ Read/Write from/to adress X and increment the pointer afterwards by one LD R1,X+
ST X+,R1 

-X Decrement the pointer by one and read/write from/to the new adress afterwards LD R1,-X
ST -X,R1 

Similiarly you can use Y and Z for that purpose.

There is only one command for the read access to the program storage. It is defined for the pointer pair Z and it is named LPM (Load from Program Memory). The command copies the byte at adress Z in the program memory to the register R0. As the program memory is organised word-wise (one command on one adress consists of 16 bits or two bytes or one word) the least significant bit selects the lower or higher byte (0=lower byte, 1= higher byte). Because of this the original adress must be multiplied by 2 and access is limited to 15-bit or 32 kB program memory. Like this: 

   LDI ZH,HIGH(2*Adress)
    LDI ZL,LOW(2*Adress)
    LPM

Following this command the adress must be incremented to point to the next byte in program memory. As this is used very often a special pointer incrementation command has been defined to do this:

   ADIW ZL,1
    LPM

 ADIW means ADd Immediate Word and a maximum of 63 can be added this way. Note that the assembler expects the lower of the pointer register pair ZL as first parameter. This is somewhat confusing as addition is done as 16-bit- operation.
 The complement command, subtracting a constant value of between 0 and 63 from a 16-bit pointer register is named SBIW, Subtract Immediate Word. (SuBtract Immediate Word). ADIW and SBIW are possible for the pointer register pairs X, Y and Z and for the register pair R25:R24, that does not have an extra name and does not allow access to SRAM or program memory locations. R25:R24 is ideal for handling 16-bit values.

 As incrementation after reading is very often needed, newer AVR types have the instruction

   LPM R,Z+

 This allows to transport the byte read to any location R, and auto-increments the pointer register.

Hmm.., Thinking..

This is the pgm_read_byte function from pgmspace.h

(__extension__({                \
    uint16_t __addr16 = (uint16_t)(addr); \
    uint8_t __result;           \
    __asm__                     \
    (                           \
        "lpm" "\n\t"            \
        "mov %0, r0" "\n\t"     \
        : "=r" (__result)       \
        : "z" (__addr16)        \
        : "r0"                  \
    );                          \
    __result;                   \
}))

How do I interpret this?
Does it load Z from addr16?

Ok, I'm beginnig to understand this :slight_smile:

Constraint Used for Range  
a Simple upper registers r16 to r23  
b Base pointer registers pairs y, z  
d Upper register r16 to r31  
e Pointer register pairs x, y, z  
q Stack pointer register SPH:SPL  
r Any register r0 to r31  
t Temporary register r0  
w Special upper register pairs r24, r26, r28, r30  
x Pointer register pair X x (r27:r26)  
y Pointer register pair Y y (r29:r28)  
z Pointer register pair Z z (r31:r30)  
G Floating point constant 0.0  
I 6-bit positive integer constant 0 to 63  
J 6-bit negative integer constant -63 to 0  
K Integer constant 2  
L Integer constant 0  
l Lower registers r0 to r15  
M 8-bit integer constant 0 to 255  
N Integer constant -1  
O Integer constant 8, 16, 24  
P Integer constant 1  
Q (GCC >= 4.2.x) A memory address based on Y or Z pointer with displacement.    
R (GCC >= 4.3.x) Integer constant. -6 to 5

janost:
How do I interpret this?

Inline assembly is messy with this compiler.

This page helped me a lot: Inline Assembler Cookbook

fungus:

janost:
How do I interpret this?

Inline assembly is messy with this compiler.

This page helped me a lot: Inline Assembler Cookbook

Thanks. Yes, it helps.

The character ROM will need 1024bytes for charcode 0-127.
charcode 128-255 is the same characters but in videoreverse so with a simple XOR, use the same image.

There are 2 charsets on a CBM. To support switching takes another 1024bytes.

To support Hires ( :fearful: ) graphics 160x100 the blockgraphic charset requires 2048bytes.
The same XOR trick can work here also to reduce it to 1024bytes?

The videocode Clocks in at just a bit over 1K.

My videocode supports 22x23 videomap with 4MHz pixelclock and 40x25 videomap with 8MHz pixelclock.
Both in PAL and NTSC. It also has a selectable bordercolor, CBM style, of black or white.
Using SCART for color or 4 resistors for 16shade B/W it can do color but needs additional RAM for storing colors.

It is interruptdriven with Timer 0 at 15625/15748Hz so other tasks can run when not blasting pixels.
That is 184/200 of the 262/312 lines so its like running the AVR at 6MHz instead of 16MHz.

All on a single 328 chip and 2 resistors.

I would say its much more capable than the TVout library?
Perhaps I will release this as a standalone cbmTVout library?

janost:
It is interruptdriven with Timer 0 at 15625/15748Hz so other tasks can run when not blasting pixels.
That is 184/200 of the 262/312 lines so its like running the AVR at 6MHz instead of 16MHz.

All on a single 328 chip and 2 resistors.

I would say its much more capable than the TVout library?

Definitely. 320x200 is more than double the resolution of 'tvout'.

I'm currently looking to build a Space Invaders machine based on Arduino. Space Invaders needs 256x224 resolution so that fits perfectly. With character mapping it should be possible to squeeze all the graphics in.

janost:
Perhaps I will release this as a standalone cbmTVout library?

Can we see it? Just a demo of the video display code will do.

I'm at work at the moment but will put something out tonight.

It is using the UART in SPImode like Nick's VGA code so its very similar.
So the TX pin is the video output from the videoshifter and D2 is sync.

You lose the TX/RX serialport on the Arduino but I don't think it matters for the application.

My code still outputs a Square box for every character since I have to solve the videoshift loading in 16cycles.

janost:
I'm at work at the moment but will put something out tonight.

No hurry. This is a medium-term project I'm thinking about.

janost:
It is using the UART in SPImode like Nick's VGA code so its very similar.
So the TX pin is the video output from the videoshifter and D2 is sync.

You lose the TX/RX serialport on the Arduino but I don't think it matters for the application.

It makes program development a real pain though.

Can it use the real SPI port instead?

janost:
My code still outputs a Square box for every character since I have to solve the videoshift loading in 16cycles.

That might be important... :slight_smile:

fungus:

janost:
I'm at work at the moment but will put something out tonight.

No hurry. This is a medium-term project I'm thinking about.

janost:
It is using the UART in SPImode like Nick's VGA code so its very similar.
So the TX pin is the video output from the videoshifter and D2 is sync.

You lose the TX/RX serialport on the Arduino but I don't think it matters for the application.

It makes program development a real pain though.

Can it use the real SPI port instead?

janost:
My code still outputs a Square box for every character since I have to solve the videoshift loading in 16cycles.

That might be important... :slight_smile:

You only lose it while the sketch is running.
It still uploads with the Arduino IDE even with the videoresistors connected.
Just no serial debug.

The SPI interface runs with 9bits, not 8 so you get a white pixelgap on every byte.
The USART in SPImode runs in MSPIM mode hence no 9bit problem, just 8pixels/byte, back to back :slight_smile:

janost:
You only lose it while the sketch is running.
It still uploads with the Arduino IDE even with the videoresistors connected.

Ok, it was the upload I was worried about. I can debug on I2c interface.

The SPI interface runs with 9bits, not 8 so you get a white pixelgap on every byte.
The USART in SPImode runs in MSPIM mode hence no 9bit problem, just 8pixels/byte, back to back :slight_smile:
[/quote]

Oh, I forgot about clock cycle 9.

(And I didn't know the USART SPI doesn't do it)

Ok, here is the ISR.
It fires 15748times/sec.

Remember, this is just a proof of concept and not optimized.

ISR(TIMER0_COMPA_vect){//timer0 interrupt   
  if ((scanline>17)&&(scanline<40)||(scanline>239)) {
    //DDRD=DDRD|0x02;
    UCSR0B = _BV(TXEN0);
    PORTD = 0; //Hsync
    UDR0 = 0x00; //Load first byte
    for (byte x=0; x < 3; x++){
     // wait for transmitter ready
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
    // send pixelbyte
    UDR0 = 0x00;
    }
    while ((UCSR0A & _BV (UDRE0)) == 0)
      {}  
    if (border==0) UCSR0B = 0;
    PORTD =4;       
  }

 if (scanline<18) {
    //UCSR0B = 0;
    UCSR0B = _BV(TXEN0);
    PORTD = 4; //Vsync
    UDR0 = 0x00; //Load first byte
    for (byte x=0; x <3; x++){
     // wait for transmitter ready
    while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
    // send pixelbyte
    UDR0 = 0x00;
    }
    PORTD =0;
    UCSR0B = 0;
    videoptr=0;
    row=0;    
  }  
  
  if ((scanline>39)&&(scanline<240)) {
    UCSR0B = _BV(TXEN0);
    PORTD = 0; //Hsync
    UDR0 = 0x00; //Load first byte
    for (byte x=0; x < 3; x++){
     // wait for transmitter ready
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
    // send pixelbyte
    UDR0 = 0x00;
    }
  
    while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
     //send colorburst
     UDR0 = B10110010;
     PORTD =0;
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {}
     PORTD =4; 
    
    for (byte x=0; x < 3; x++){
     // wait for transmitter ready
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
     // send pixelbyte
     UDR0 = 0x00;
    }
    for (byte x=0; x < 8; x++){
     // wait for transmitter ready
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
    // send pixelbyte
    UDR0 = border;
    }
   
    for (byte x=0; x < 40; x++){
     //charcode=videomem[videoptr++]; 
      
     // wait for transmitter ready
     while ((UCSR0A & _BV (UDRE0)) == 0)
      {} 
    // send pixelbyte
    UDR0 = pgm_read_byte(&charROM[row]);
    }
  
    while ((UCSR0A & _BV (UDRE0)) == 0)
      {}
    UDR0 = border; //Front porch
    
    if (border==0) UCSR0B = 0;
    
    videoptr=((scanline-64)&0xF8)*40;
    row=scanline&0x07;
  }
  
  scanline++;
  if (scanline>261) scanline=0;
  
}

40x25 CBMvideo.jpg

janost:
Because the pixel Clock is 8MHz I'm now struggling with loading a videoshift byte in just 16 clockcycles.

I got a character out of program memory for VGA output in 17 cycles, if that helps:

Didn't need assembly either, although I looked at it to make sure optimal code was generated.

Ok, I see. The scan lines are started on an interrupt. They send a complete line of video, then return.

The problem is to read a byte from a charmap, look up a byte of data from the character ROM based on that, output it to the USART. all in 16 clock cycles.

I think the AVR chip can do that in ten cycles, a loop will add three cycles, you still need three NOPs to pad it to 16!

Coercing the compiler into doing it is another matter. You might have to resort to inline assembly language.

The USART in SPImode runs in MSPIM mode hence no 9bit problem, just 8pixels/byte, back to back

I think I used MSPIM mode, I got a bit distracted and went onto something else before I solved the 17th clock cycle issue. It might be possible.

fungus:
you still need three NOPs to pad it to 16!

...but you can use those three cycles if you want to invert the char when the top bit is set.

fungus:
Coercing the compiler into doing it is another matter. You might have to resort to inline assembly language.

This C line generates the 17-cycle line to output:

// blit pixel data to screen    
  while (i--)
    UDR0 = pgm_read_byte (linePtr + (* messagePtr++));

You might be able to optimize away one cycle, and in any case 16 cycles is the absolute minimum, of course. My testing at some point revealed the 17th cycle was necessary or the hardware threw away some output.

Incidentally, this code:

 (1)    f0e:	e7 fd       	sbrc	r30, 7
 (1)    f10:	f0 95       	com	r31

Is exactly what you need to invert the data if the top bit is set, thus proving it can be done.

That looks like it cut it down to 16 cycles. And it shows you don't need to muck around with assembler to achieve it. :slight_smile:

     ee8:       8d 91           ld      r24, X+   (2)
     eea:       f9 01           movw    r30, r18   (1)
     eec:       e8 0f           add     r30, r24   (1)
     eee:       f1 1d           adc     r31, r1   (1)
     ef0:       e8 59           subi    r30, 0x98       ; 152   (1)
     ef2:       ff 4f           sbci    r31, 0xFF       ; 255   (1)
     ef4:       e4 91           lpm     r30, Z+   (3)
     ef6:       e0 93 c6 00     sts     0x00C6, r30   (2)
     efa:       a4 17           cp      r26, r20   (1)
     efc:       b5 07           cpc     r27, r21   (1)
     efe:       a1 f7           brne    .-24            ; 0xee8 <_Z13doOneScanLinev+0x64>   (1/2)

Changed line:

  register byte * messagePtr = (byte *) & (message [messageLine] [0] );

One of my other attempts (using the main SPI hardware) had a couple of clock cycles up its sleeve:

  // pre-load pointer for speed
  const register byte * linePtr = &screen_font [ (vLine >> 1) & 0x07 ] [0];
  register char * messagePtr =  & (message [messageLine] [0] );

  // how many pixels to send
  register byte i = horizontalBytes;

  // turn transmitter on 
  SPSR = _BV (SPI2X);
  SPCR = _BV (SPE) | _BV (MSTR);

  // blit pixel data to screen    
  while (i--)
    {
    SPDR = pgm_read_byte (linePtr + (* messagePtr++));
    nop; nop;
    }

And believe me, I wouldn't have thrown in NOPs if they weren't necessary. :slight_smile: