Go Down

Topic: Arduino 6502 emulator + BASIC interpreter (Read 40 times) previous topic - next topic

janost

#15
Oct 30, 2013, 09:05 am Last Edit: Oct 30, 2013, 09:37 am by janost Reason: 1
It wasnt that slow.

FOR I=0 TO 1000:NEXT completes in 30sec.
I Think it should be around 1sec and the problem is the Arduino I2C that runs at 100KHz.

Can the Wire library run at 1MHz?
That would make it more up to speed?

Or read/write caching memory in RAM?
Like a CPU-cache?

janost

#16
Oct 30, 2013, 11:59 pm Last Edit: Oct 31, 2013, 12:00 am by janost Reason: 1
Ok, the FOR I=0 TO 1000:NEXT loop is now down to 3sec.
That is a tenfold in speed :)

I implemented a CPU FIFO L1-Cache with 4 TLB buffers of 128bytes.
It works great, even with random PEEKs all over the EEPROM.

If the data is in the cache it is returned from RAM.
Else the oldest buffer gets flushed (written if dirty) and a new is read in from the EEPROM.

But there is still a bit of penelty when there is a cache-miss on writes.

Perhaps   a greater number of smaller buffers would be better?
Like 32buffers with 16bytes?

How many bytes does EhBasic use for different variables not including strings?

fungus


Ok, the FOR I=0 TO 1000:NEXT loop is now down to 3sec.
That is a tenfold in speed :)

I implemented a CPU FIFO L1-Cache with 4 TLB buffers of 128bytes.
It works great, even with random PEEKs all over the EEPROM.


Good idea!


Perhaps   a greater number of smaller buffers would be better?
Like 32buffers with 16bytes?


Only one way to find out...

I'm fairly sure the i2c can be made faster, too.
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

janost

#18
Oct 31, 2013, 09:08 am Last Edit: Oct 31, 2013, 09:16 am by janost Reason: 1
Yes, I changed the i2c speed to 400KHz with:
Code: [Select]

Wire.begin();
TWBR = 12;


It can still be optimized more by using block read/write.

Here is the code for the 4 buffer FIFO L1-Cache.
The writecache resides in the readcache function and only writes back a buffer if its dirty.

Code: [Select]

 void writeEEPROM(int deviceaddress, unsigned int eeaddress, byte data ) {
 int cache;
 int page=(eeaddress >> 7);
 if (cachepage[0]==page) {
  cacheram[(eeaddress & 127)]=data;
  cachedirty[0]=1;
  return;
 }
 if (cachepage[1]==page) {
  cacheram[(eeaddress & 127)+128]=data;
  cachedirty[1]=1;
  return;
 }  
 if (cachepage[2]==page) {
  cacheram[(eeaddress & 127)+256]=data;
  cachedirty[2]=1;
  return;
 }
 if (cachepage[3]==page) {
  cacheram[(eeaddress & 127)+384]=data;
  cachedirty[3]=1;
  return;
 }
 readcache(page,nextcache);
 cache=nextcache;
 nextcache++;
 if (nextcache>3) nextcache=0;
 cacheram[(eeaddress & 127)+(cache << 7)]=data;
 cachedirty[cache]=1;  
}

byte readEEPROM(int deviceaddress, unsigned int eeaddress ) {
 int cache;
 int page=(eeaddress >> 7);
 if (cachepage[0]==page) return cacheram[(eeaddress & 127)];
 if (cachepage[1]==page) return cacheram[(eeaddress & 127)+128];
 if (cachepage[2]==page) return cacheram[(eeaddress & 127)+256];
 if (cachepage[3]==page) return cacheram[(eeaddress & 127)+384];
 readcache(page,nextcache);
 cache=nextcache;
 nextcache++;
 if (nextcache>3) nextcache=0;
 return cacheram[(eeaddress & 127)+(cache << 7)];
}

fungus


The writecache resides in the readcache function and only writes back a buffer if its dirty.


Your cache misses are very expensive (you have to use I2C!)  so more, smaller pages is probably good.


Code: [Select]

int page=(eeaddress >> 7);
if (cachepage[0]==page) ...



That code will optimize horribly on Arduino. Bit shifting 16-bit values is really slow, the compiler usually generates a loop for it.

You can eliminate all the if statements and bit shifting by using a simple hash, eg.

Code: [Select]

// eg. Sixteen pages of sixteen bytes each
byte cachePage[16];
byte cacheRam[16][16];

byte a = byte(eeaddress);  // Bottom 8 bits of address
byte p = eeaddress>>8;     // Page of RAM we need (the compiler usually figures out how to optimize a shift by 8 bits)

byte n = a>>4;    // Top 4 bits of 'a' are cache page
byte r = a&0x0f;  // Bottom 4 bits of 'a' are cache index
if (cachePage[n] == p) return cacheRam[n][r];


Later on when you're happy you have the best performance change it to something like:
Code: [Select]

// No bit-shifting please, we're an AVR
byte a = byte(eeaddress);  // Bottom 8 bits of address
byte p = eeaddress>>8;     // Page of RAM we need (the compiler usually figures out how to optimize a shift by 8 bits)
byte r = a&0x0f;
switch (a&0xf0) {
 case 0x00:  if (cachePage[0] == p) return cacheRam[0][r];
 case 0x10:  if (cachePage[1] == p) return cacheRam[1][r];
 ...
 case 0xf0: if (cachePage[15] == p) return cacheRam[15][r];
}


No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

janost

Thanks for your tip's.

More optimization tonight.
I'm at work at the moment so can't try anything.

When I get this running smooth I'll use a 24LC512 for the full 64Kb.
Then the 6502 ROMs will fit there also.

And also implement I/O into the adress space.

janost


Your cache misses are very expensive (you have to use I2C!)  so more, smaller pages is probably good.


I was thinking that cachewrites only happens when the interpreter accesses variables.
In that case 128bytes is better?

If it has to cache 6502 machinecode, 16bytes is better?

fungus


I was thinking that cachewrites only happens when the interpreter accesses variables.


Misses on read are expensive too.


If it has to cache 6502 machinecode, 16bytes is better?


With the same amount of cache, smaller pieces is almost always better.
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

janost

Yes, but on cachewrite there is a 5ms delay on each write.
To allow the EEPROM to store the data.

I'll try it with your 16x16byte cache.

The emulatorcode can also be optimized with inline assembler to make it even faster.

I want to run this on a standalone 328P chip with internal 8MHz osc.
No external Components other than the EEPROM.

So I really need to get to the 1sec loop test, and faster, to be able to take the cache misses.


fungus


Yes, but on cachewrite there is a 5ms delay on each write.
To allow the EEPROM to store the data.


Yep.

More reason to make the blocks very small...

With delays like that it might be worth flagging the dirty bytes individually. You could have a block size of 16 and a 16-bit int with one bit for each dirty byte in the block. :)
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

janost



Yes, but on cachewrite there is a 5ms delay on each write.
To allow the EEPROM to store the data.


Yep.

More reason to make the blocks very small...

With delays like that it might be worth flagging the dirty bytes individually. You could have a block size of 16 and a 16-bit int with one bit for each dirty byte in the block. :)



No need for dirtybytes.
The delay is 5ms on a single byte and the same on a 16byte block.

It is per write operation.

janost

#26
Nov 01, 2013, 09:10 am Last Edit: Nov 01, 2013, 01:46 pm by janost Reason: 1
I rewrote the cachecode so that LEDPin13 flashes when it caches like a diskactivity LED.

With 16byte pages it flashes constantly when ever a basic program runs.
Loading a a buffer on a cachemiss takes 75uS and another 5ms if the exsisting one is Dirty.

With 128byte pages it flashes seldom but if i write a basicprogram that add strings it starts to flash.
Loading a buffer on a cachemiss takes 600uS and another 31ms if the existing one is Dirty.

Finding the right cachepage size is not so easy.

But now it uses blockread/write so there is only 6 read/write i2c operations on a flush/load.
Very fast, hardly no impact on the basic program running.

And the EEPROM lasts 115days of running 24/7 nonstop.

An i2c RAM of the same size would be better but I did not find any.

janost

#27
Nov 01, 2013, 09:20 am Last Edit: Nov 01, 2013, 09:24 am by janost Reason: 1
I found the SID library and it fits in my current setup.
I'll add that with a memorymap like the C64 and try some real programs to see how it performs.

I have the ROM images for both VIC-20 and C64 and will perhaps try booting with those instead.

But I think the exec6502 function needs optimization.
The FOR NEXT loop indicates that the 6502 runs at about 300KHz.

fungus


With 16byte pages it flashes constantly when ever a basic program runs.


That will be different for every program and every cacheing scheme.

Maybe my idea for a hash function is bad - it can lead to pathological cases. Maybe the original 'list of if statements' causes less cache misses.


Finding the right cachepage size is not so easy.


:)
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

janost

Well, I managed to boot both the VIC-20 and C64 roms on it :)

Go Up