Go Down

Topic: Mega2560 Memory Expansion Shield (Read 14188 times) previous topic - next topic


If you could add more memory to a Mega2560, would you?

The 4K RAM that is available with the 2560 is ok for certain kinds of projects, but for others, it can be a really limiting factor. After doing a number of searches, it doesn't seem like there is a shield that is available to do this. For a current development project, I am designing an expansion shield to do just that, and a little more.

The Mega2560 has the capability of operating in an expansion memory mode. Certain ports are co-opted for use as a data bus, and address bus, and three control lines. For guys like me who worked on embedded systems, this turns the micro-controller into something more like a micro-processor, which allows one to begin adding things like memory chips to an Arduino, at least with the Mega2560.

Although 19 I/O lines are kind of expensive to lose, the gain is significant because this interface is quite fast. And if you need more memory, you will likely want to have quick access to it. And this interface isn't just limited to memory, other peripheral chips can be added to give the Mega2560 fast access to additional functionality, like a better A/D chip, or digital I/O chips that can move 8 bits at a time more quickly that using standard digital port functions.

The designers of the Mega2560 were kind enough to include all of the necessary I/O lines (as well as a number of extras), on that big digital port that adds lines 22 through 53. This makes it easy to develop a shield to add expansion capabilities.

For my current development project, the shield will have the following basic specifications:

  • A minimum of 32K of RAM, with up to 2MB of RAM possible

  • A total of 16 decoded chip select lines to be used with peripheral ICs

  • A pass-thru connector for the remaining available digital I/O lines not used for the expansion bus[/l1]

Memory access is limited to 32K chunks. Chunks are selected by writing to a paging register. I/O chip selects cover 256 bytes of address space.

Without getting into more technical details, that covers the basics. So, if you would want to add memory and I/O to a Mega 2560, I'd like to know.


ATmega2560 has 8K SRAM.
ATmega1284P has 16K SRAM.

More memory for the 2560 already exist:

For many other things, SPI transfers to ADCs, DACs, etc at 8 MHz are quite fast, if the IO device can work that fast.  Parallel transfers would of course be quicker when port reads and writes are available.
Designing & building electrical circuits for over 25 years.  Screw Shield for Mega/Due/Uno,  Bobuino with ATMega1284P, & other '328P & '1284P creations & offerings at  my website.


Yes, 8K, not 4K. Too many data sheets have passed under my eyes lately.  :*

SPI is certainly good. But once you have more than one device, things can get tricky. In my case, SPI is already being used by an Ethernet shield. And while it is easy enough to add, say, an SPI A/D, things get difficult when you need to make A/D readings at specific time intervals, and still have Ethernet activity going on. It all comes down to the kind of performance you are needing.

If you only need to make 10 or even 100 readings a second, it is possible to have multiple SPI devices, with a proper approach to the coding.

I just figured that if you need more memory, you have to enable expanded memory mode anyway. So adding the I/O decoding feature to the shield would be simple enough. It only takes a CPLD to handle the lower address byte latch, the paging register, and the I/O lines, and doesn't use any additional I/O lines to operate as upper address lines, like the example you linked in.


Feb 06, 2014, 08:15 pm Last Edit: Feb 06, 2014, 08:19 pm by pito Reason: 1
Long time back we had a long discussion on an another forum how to wire a larger sram (min 2MB) to an mcu, while not to loose too much i/o lines. I maintained a table with what would be feasible (http://retrobsd.org/viewtopic.php?f=5&t=3179), some concepts were proved (serial mram, bitbanged sdram, bitbanged 8bit CPLD sram).


This design would use 8 bit SRAM and a CPLD for handling the XMEM interface. I/O count is costly since 19 lines are needed for XMEM. But this design needs fast access to RAM, so XMEM is the only way to go. RAM access is limited to 32K pages, but a page switch is a single write to the paging register which is located in memory mapped I/O. So the page write is also very fast.

The SRAM is small and cheap, up to 4 SOIC chips of either 128Kx8 or 512Kx8. More could be added, but I would switch to a different package and go with TSOP to get higher density 2Mx8 or 4Mx8 parts. Working with wider bus width parts is possible, but it would require a fair amount of CPLD real estate to handle the buffering necessary to make it work properly. As long as 8 bit memory is still easy to get, seems best to keep it simple.

Memory mapped I/O decode lines are provided since decoding already has to be done inside the CPLD to map the RAM chips. Enough I/O pins are available on the 9536 to get 16 I/O decodes with a few pins to spare. So it doesn't hurt to put them out there to support fast I/O access.


The major issue we faced was the srams are expensive :) A 4MB sram costs $25, 8MB psram $4, 32MB sdram $2. Serial ones (sram, fram, mram) are most expensive (you may see the prices/2MB in the table, still valid afaik). So it depends on how big ram space, access time and free i/o lines you want to get.


SDRAM is so much cheaper, definitely.

Putting together an interface would prove to be interesting, but the SDRAM modes and timing are clearly a very advanced topic.  Initialization of the SDRAM and single byte to multi-byte coversion alone would probably take an FPGA.

Clock frequencies would have to be pretty high to achieve the number of cycles required to allow the Arduino to do a single byte write since that would require doing a read first to buffer all of the bytes for a memory location, update the specific byte, then performing a write to put all of the bytes back.

For the amount of head scratching and hair being torn out of my head trying to implement it, I'd rather pay for SRAM and keep what looks I have left largely intact.  :smiley-yell:


Feb 07, 2014, 12:03 am Last Edit: Feb 10, 2014, 10:28 pm by Graynomad Reason: 1
Large SRAMS are pretty cheap, for example the AS6C4008-55ZIN (512k x 8 ) is $3.37 an Digikey for 1 off.

I love the idea of XMEM, brings back the "old days" of memory-mapped peripherals etc. And there are some very fast ADCs you could use.

In fact I'm working on a 2560 with XMEM design as we speak.


Rob Gray aka the GRAYnomad www.robgray.com


Yes, those old days. Although I'm not very gray, I've definitely done my share of memory mapped I/O. Just stumbled upon the whole Arduino world about a month ago and it sure looks like you can do a lot for a little, especially if you have the engineering know-how.

I'm really curious to see how far I can push it as far as performance goes. Memory and I/O is the first step. Proper use of interrupts is next. Going from polling loops to event driven code can definitely help to stretch the clock cycles. Might as well have some fun with it and maybe spin a few shields out along the way.  ;)


Proper use of interrupts is next. Going from polling loops to event driven code can definitely help to stretch the clock cycles.

I've got hardware support for up to 64 "vectored" interrupts, sort of like the Z80 used to have although not as automatic as they were. The CPU gets an interrupt, reads 6 bits from the hardware and jumps directly to the appropriate handler.

Are you proficient with CPLD design? If so which family?

Rob Gray aka the GRAYnomad www.robgray.com


Did a ton of PLD stuff back in the old days, especially state machine design. Just getting into Xilinx 9500 series. Logic is easy, but the ISE software has its quirks. Looks like I'll have to learn some VHDL to implement state machines.

First PLD in the works already is the logic for the memory expansion board. Handles the low address latch, paging register and I/O decoding. Will have space on the board for additional decoder chips as needed. Just add one and you get 8 additional I/O decodes.

This card will also serve as an interface for shields that need the expanded I/O bus. Plug this shield into a 2560, then plug high-speed shields into it. I'll probably stick a spot on the board to install a power jack for any connected shields.

Next PLD is going to be a helper SPI interface designed specifically for the Ethernet shield. It will use the expansion shield. By tightening up the timing between command bytes, increasing the SPI clock speed and maybe putting in a buffer pointer auto-increment register, the throughput can get boosted a bit. Target is 200KBytes/S for sustained output or input .

The last PLD on the immediate horizon may actually be an FPGA, not certain about the capacity of a CPLD. But I need to be able to create a FIFO using an SRAM. That was done at a former company and is far cheaper than buying actual FIFO chips, and you get a ton more storage.

Those three would be a big help in getting the 'tool box' set up for a number of bigger projects in the works. I definitely love this platform. It is an excellent starting point for putting together so many different things.


Memory access is limited to 32K chunks.

Why?  Doesn't XMEM support 64k-<internalRamSize>?

Chunks are selected by writing to a paging register. I/O chip selects cover 256 bytes of address space.

If you're doing CPLD decoders anyway, how about 16k of always-mapped RAM and 32k of page-switchable RAM.  Having all your memory in a page that switches out is painful.  (better yet, 3x 16k independently mappable pages?)


In expanded mode there is no differentiation between internal and external. so the first 8703 bytes of address space are not available by default, since those are being used by the micro-controller. There is a caveat to this. You can mask some of that internal memory, but that means losing it. Without knowing where certain parts of the Arduino environment are located, like the C++ call stack, you risk memory corruption.

Internal memory is going to be faster than external memory. Because one byte of the address bus is multiplexed with the data bus lines, it takes 4 processor clock cycles to perform an external read or write. So you want to keep speed critical stuff located internally. On the 2560 the maximum raw transfer rate to and from external memory would be 2Mbyte/S. 

In the particular development task I am exploring, I would need memory mapped I/O for high-speed access to other peripherals. For decoding and address translation purposes, starting at the RAM at the 32K mark simplifies things greatly. Only three RAM address line need to be translated, and that translation is simply the content of the paging register.  Any RAM greater than 32K requires a minim of six RAM address lines to be translated, and the last page of RAM is going to be less than the RAM page size.

If you wanted more space for RAM, say, 48K, it becomes very difficult to map that physically into addresses for the RAM chip(s). Let's assume a 128K RAM chip which would need a total of 18 address lines, and 2 and 2/3 pages total of expansion RAM. The first 48K of RAM isn't so hard, just set the paging register to 0.  The addresses from the micro-controller would be: 0x4000 through 0xFFFF. The physical RAM address would be 0x0000 through 0xCFFF. If we look at those in binary, the micro-controller address range is:

0100 0000 0000 0000 to 1111 1111 1111 1111

The physical RAM address range is:

00 0000 0000 0000 0000 to 00 1100 1111 1111 1111

So the upper 4 address bits from the micro-controller would have to be re-mapped. The lower 12 address bits will work as is. And the upper 2 RAM address bits remain at 00. But moving on to the next 48K gets more difficult. We have the same micro-controller address range:

0100 0000 0000 0000 to 1111 1111 1111 1111

The physical RAM address range now is:

00 1101 0000 0000 0000 to 01 1001 1111 1111 1111

The same upper 4 address bits from the micro-controller need to be re-mapped, as well as the upper 2 physical RAM address bits. The same holds true for for the last 1 and 2/3 pages, a total of 6 RAM address lines have to be translated by the CPLD. The last RAM page would only be 32K in length.

Adding more RAM increases the number of address lines to translate and increases the I/O count on the CPLD. So, odd page sizes can be done to increase the total amount of visible RAM, but it is just easier to do 32K page sizes. I wanted to try to keep within a 44 pin device, and the additional inputs required to decode for the page register inside the CPLD, and perform the address/data de-multiplexing, as well as the I/O line decoding for I/O expansion, just fit.

That means 1 RAM chip and CPLD with 44 pins, plus three buffers for voltage translation will give you 128K without using any additional Arduino I/O lines besides the ones needed for XMEM. You can get away without the buffers, but 5V SRAM chips are becoming less available. 3.3V chips are easier to find and cheaper, but not all of them are tolerant of 5V input levels. A bigger CPLD will fit, but it is twice the cost. For a production shield, cost is a big factor.


Feb 11, 2014, 05:43 pm Last Edit: Feb 11, 2014, 05:58 pm by pito Reason: 1
Maybe off the topic - is there any plan the compiler might support the (data/program) bank switching? Some time back I toyed with 8051 clones and both iar and keil support bank switching for both data and program (up to several MB of data and program space available, you just configure the xmem for data and program, you specify the pins used for additional addresses and then you may write megabytes large code and you may use megabytes of data with a simple 8051/2). That would be nice to have with atmegas as well..
PS: iar 256banks of data and program, keil 64banks of data and program (or vice versa).. A variable cannot be bigger than single bank size..


Being a newbie, I can't say anything regarding the compiler. I've come across an article by a guy that added expansion memory support and used I/O lines to hold the page values. He used a library he designed to perform malloc() functions to stash variables in specific memory banks.

My specific idea just has to support the creation of data buffers and create an xmem object that keeps track of one or two pointers (two for circular buffers), as well as the page number where the buffer resides. There are probably a few other bits of data that need to be included, but these objects would always reside in internal RAM. Inheriting the streams library would also make sense because you could set up a stream to move data, up to 32K of course.

The memory decoder CPLD fits and is currently being simulated to check the design out. I also added lines to the design that would support an optional DMA controller. That may fit into a larger CPLD, at least I want to keep it in that if at all possible.

The DMA controller would support transfers between expansion memory and the I/O bus that is part of this shield. Any I/O and memory in the micro-controller is not able to be touched, of course. There may be enough room for memory to memory transfers as well. That is going to take at least 146 data flip flops, so I'm not exactly confident it can be done in a CPLD. Anyway, for some of my projects, DMA would be great. Multi-channel DMA would be even better.  :D

Some of the I/O lines are dedicated for the DMA controller, if present. There is a DMA interrupt, and a couple of lines pull DMA status out in serial format so a transfer can be checked for how many bytes have been transferred.

DMA transfers would not be able to take place at the same time as expansion memory operations by the micro-controller in this implementation. The only dead space that I can see in an expansion memory cycle is about 3/4 of a clock cycle right after ALE goes high, at T 0.5. All micro-controller driven lines are buffered in this design, so they would be taken off the bus during a DMA cycle. But that sort of interleaving is pretty complex, and without access to the internal processor clock, there is going to be a certain degree of slop between the DMA controller clock and the processor memory cycle.

So, offline transfers, meaning the micro-controller has xmem disabled, would be the easiest to implement. Oh, since the DMA controller has as many address lines as I want, it doesn't have to worry about memory pages. All memory is available to the DMA controller. And offline transfers can occur at a faster rate, as fast as attached peripherals can handle it. The DMA controller would also have programmed wait states.

DMA isn't too far off topic regarding the original post, but this question kind of is.

Are you aware of any other high speed I/O implementation out there for the Atmega2560? The thing is, you have to have the components to support xmem to add a bunch of memory, and I/O decoding just goes along for the ride. This implementation would leave a total of 14 address lines and 8 data lines for adding peripheral I/O. There is also an I/O enable line, a read and a write line,  and a few interrupt request lines, plus limited power and ground.

Anyway, if there is such a thing already out there, it would make sense to be compatible with it, if it has a decent following. Either way, I am doing this for my own projects anyway. So if a new high-speed shield interface comes out of it, that would be fine with me.  :)

Go Up