If you could add more memory to a Mega2560, would you?
The 4K RAM that is available with the 2560 is ok for certain kinds of projects, but for others, it can be a really limiting factor. After doing a number of searches, it doesn't seem like there is a shield that is available to do this. For a current development project, I am designing an expansion shield to do just that, and a little more.
The Mega2560 has the capability of operating in an expansion memory mode. Certain ports are co-opted for use as a data bus, and address bus, and three control lines. For guys like me who worked on embedded systems, this turns the micro-controller into something more like a micro-processor, which allows one to begin adding things like memory chips to an Arduino, at least with the Mega2560.
Although 19 I/O lines are kind of expensive to lose, the gain is significant because this interface is quite fast. And if you need more memory, you will likely want to have quick access to it. And this interface isn't just limited to memory, other peripheral chips can be added to give the Mega2560 fast access to additional functionality, like a better A/D chip, or digital I/O chips that can move 8 bits at a time more quickly that using standard digital port functions.
The designers of the Mega2560 were kind enough to include all of the necessary I/O lines (as well as a number of extras), on that big digital port that adds lines 22 through 53. This makes it easy to develop a shield to add expansion capabilities.
For my current development project, the shield will have the following basic specifications:
- A minimum of 32K of RAM, with up to 2MB of RAM possible
- A total of 16 decoded chip select lines to be used with peripheral ICs
- A pass-thru connector for the remaining available digital I/O lines not used for the expansion bus[/l1]
Memory access is limited to 32K chunks. Chunks are selected by writing to a paging register. I/O chip selects cover 256 bytes of address space.
Without getting into more technical details, that covers the basics. So, if you would want to add memory and I/O to a Mega 2560, I'd like to know.
ATmega2560 has 8K SRAM.
ATmega1284P has 16K SRAM.
More memory for the 2560 already exist:
For many other things, SPI transfers to ADCs, DACs, etc at 8 MHz are quite fast, if the IO device can work that fast. Parallel transfers would of course be quicker when port reads and writes are available.
Yes, 8K, not 4K. Too many data sheets have passed under my eyes lately. :*
SPI is certainly good. But once you have more than one device, things can get tricky. In my case, SPI is already being used by an Ethernet shield. And while it is easy enough to add, say, an SPI A/D, things get difficult when you need to make A/D readings at specific time intervals, and still have Ethernet activity going on. It all comes down to the kind of performance you are needing.
If you only need to make 10 or even 100 readings a second, it is possible to have multiple SPI devices, with a proper approach to the coding.
I just figured that if you need more memory, you have to enable expanded memory mode anyway. So adding the I/O decoding feature to the shield would be simple enough. It only takes a CPLD to handle the lower address byte latch, the paging register, and the I/O lines, and doesn't use any additional I/O lines to operate as upper address lines, like the example you linked in.
Long time back we had a long discussion on an another forum how to wire a larger sram (min 2MB) to an mcu, while not to loose too much i/o lines. I maintained a table with what would be feasible (http://retrobsd.org/viewtopic.php?f=5&t=3179), some concepts were proved (serial mram, bitbanged sdram, bitbanged 8bit CPLD sram).
This design would use 8 bit SRAM and a CPLD for handling the XMEM interface. I/O count is costly since 19 lines are needed for XMEM. But this design needs fast access to RAM, so XMEM is the only way to go. RAM access is limited to 32K pages, but a page switch is a single write to the paging register which is located in memory mapped I/O. So the page write is also very fast.
The SRAM is small and cheap, up to 4 SOIC chips of either 128Kx8 or 512Kx8. More could be added, but I would switch to a different package and go with TSOP to get higher density 2Mx8 or 4Mx8 parts. Working with wider bus width parts is possible, but it would require a fair amount of CPLD real estate to handle the buffering necessary to make it work properly. As long as 8 bit memory is still easy to get, seems best to keep it simple.
Memory mapped I/O decode lines are provided since decoding already has to be done inside the CPLD to map the RAM chips. Enough I/O pins are available on the 9536 to get 16 I/O decodes with a few pins to spare. So it doesn't hurt to put them out there to support fast I/O access.
The major issue we faced was the srams are expensive :) A 4MB sram costs $25, 8MB psram $4, 32MB sdram $2. Serial ones (sram, fram, mram) are most expensive (you may see the prices/2MB in the table, still valid afaik). So it depends on how big ram space, access time and free i/o lines you want to get.
SDRAM is so much cheaper, definitely.
Putting together an interface would prove to be interesting, but the SDRAM modes and timing are clearly a very advanced topic. Initialization of the SDRAM and single byte to multi-byte coversion alone would probably take an FPGA.
Clock frequencies would have to be pretty high to achieve the number of cycles required to allow the Arduino to do a single byte write since that would require doing a read first to buffer all of the bytes for a memory location, update the specific byte, then performing a write to put all of the bytes back.
For the amount of head scratching and hair being torn out of my head trying to implement it, I'd rather pay for SRAM and keep what looks I have left largely intact. :smiley-yell:
Large SRAMS are pretty cheap, for example the AS6C4008-55ZIN (512k x 8 ) is $3.37 an Digikey for 1 off.
I love the idea of XMEM, brings back the "old days" of memory-mapped peripherals etc. And there are some very fast ADCs you could use.
In fact I'm working on a 2560 with XMEM design as we speak.
Yes, those old days. Although I'm not very gray, I've definitely done my share of memory mapped I/O. Just stumbled upon the whole Arduino world about a month ago and it sure looks like you can do a lot for a little, especially if you have the engineering know-how.
I'm really curious to see how far I can push it as far as performance goes. Memory and I/O is the first step. Proper use of interrupts is next. Going from polling loops to event driven code can definitely help to stretch the clock cycles. Might as well have some fun with it and maybe spin a few shields out along the way. ;)
Proper use of interrupts is next. Going from polling loops to event driven code can definitely help to stretch the clock cycles.
I've got hardware support for up to 64 "vectored" interrupts, sort of like the Z80 used to have although not as automatic as they were. The CPU gets an interrupt, reads 6 bits from the hardware and jumps directly to the appropriate handler.
Are you proficient with CPLD design? If so which family?
Did a ton of PLD stuff back in the old days, especially state machine design. Just getting into Xilinx 9500 series. Logic is easy, but the ISE software has its quirks. Looks like I'll have to learn some VHDL to implement state machines.
First PLD in the works already is the logic for the memory expansion board. Handles the low address latch, paging register and I/O decoding. Will have space on the board for additional decoder chips as needed. Just add one and you get 8 additional I/O decodes.
This card will also serve as an interface for shields that need the expanded I/O bus. Plug this shield into a 2560, then plug high-speed shields into it. I'll probably stick a spot on the board to install a power jack for any connected shields.
Next PLD is going to be a helper SPI interface designed specifically for the Ethernet shield. It will use the expansion shield. By tightening up the timing between command bytes, increasing the SPI clock speed and maybe putting in a buffer pointer auto-increment register, the throughput can get boosted a bit. Target is 200KBytes/S for sustained output or input .
The last PLD on the immediate horizon may actually be an FPGA, not certain about the capacity of a CPLD. But I need to be able to create a FIFO using an SRAM. That was done at a former company and is far cheaper than buying actual FIFO chips, and you get a ton more storage.
Those three would be a big help in getting the 'tool box' set up for a number of bigger projects in the works. I definitely love this platform. It is an excellent starting point for putting together so many different things.
Memory access is limited to 32K chunks.
Why? Doesn't XMEM support 64k-<internalRamSize>?
Chunks are selected by writing to a paging register. I/O chip selects cover 256 bytes of address space.
If you're doing CPLD decoders anyway, how about 16k of always-mapped RAM and 32k of page-switchable RAM. Having all your memory in a page that switches out is painful. (better yet, 3x 16k independently mappable pages?)
In expanded mode there is no differentiation between internal and external. so the first 8703 bytes of address space are not available by default, since those are being used by the micro-controller. There is a caveat to this. You can mask some of that internal memory, but that means losing it. Without knowing where certain parts of the Arduino environment are located, like the C++ call stack, you risk memory corruption.
Internal memory is going to be faster than external memory. Because one byte of the address bus is multiplexed with the data bus lines, it takes 4 processor clock cycles to perform an external read or write. So you want to keep speed critical stuff located internally. On the 2560 the maximum raw transfer rate to and from external memory would be 2Mbyte/S.
In the particular development task I am exploring, I would need memory mapped I/O for high-speed access to other peripherals. For decoding and address translation purposes, starting at the RAM at the 32K mark simplifies things greatly. Only three RAM address line need to be translated, and that translation is simply the content of the paging register. Any RAM greater than 32K requires a minim of six RAM address lines to be translated, and the last page of RAM is going to be less than the RAM page size.
If you wanted more space for RAM, say, 48K, it becomes very difficult to map that physically into addresses for the RAM chip(s). Let's assume a 128K RAM chip which would need a total of 18 address lines, and 2 and 2/3 pages total of expansion RAM. The first 48K of RAM isn't so hard, just set the paging register to 0. The addresses from the micro-controller would be: 0x4000 through 0xFFFF. The physical RAM address would be 0x0000 through 0xCFFF. If we look at those in binary, the micro-controller address range is:
0100 0000 0000 0000 to 1111 1111 1111 1111
The physical RAM address range is:
00 0000 0000 0000 0000 to 00 1100 1111 1111 1111
So the upper 4 address bits from the micro-controller would have to be re-mapped. The lower 12 address bits will work as is. And the upper 2 RAM address bits remain at 00. But moving on to the next 48K gets more difficult. We have the same micro-controller address range:
0100 0000 0000 0000 to 1111 1111 1111 1111
The physical RAM address range now is:
00 1101 0000 0000 0000 to 01 1001 1111 1111 1111
The same upper 4 address bits from the micro-controller need to be re-mapped, as well as the upper 2 physical RAM address bits. The same holds true for for the last 1 and 2/3 pages, a total of 6 RAM address lines have to be translated by the CPLD. The last RAM page would only be 32K in length.
Adding more RAM increases the number of address lines to translate and increases the I/O count on the CPLD. So, odd page sizes can be done to increase the total amount of visible RAM, but it is just easier to do 32K page sizes. I wanted to try to keep within a 44 pin device, and the additional inputs required to decode for the page register inside the CPLD, and perform the address/data de-multiplexing, as well as the I/O line decoding for I/O expansion, just fit.
That means 1 RAM chip and CPLD with 44 pins, plus three buffers for voltage translation will give you 128K without using any additional Arduino I/O lines besides the ones needed for XMEM. You can get away without the buffers, but 5V SRAM chips are becoming less available. 3.3V chips are easier to find and cheaper, but not all of them are tolerant of 5V input levels. A bigger CPLD will fit, but it is twice the cost. For a production shield, cost is a big factor.
Maybe off the topic - is there any plan the compiler might support the (data/program) bank switching? Some time back I toyed with 8051 clones and both iar and keil support bank switching for both data and program (up to several MB of data and program space available, you just configure the xmem for data and program, you specify the pins used for additional addresses and then you may write megabytes large code and you may use megabytes of data with a simple 8051/2). That would be nice to have with atmegas as well..
PS: iar 256banks of data and program, keil 64banks of data and program (or vice versa).. A variable cannot be bigger than single bank size..
Being a newbie, I can't say anything regarding the compiler. I've come across an article by a guy that added expansion memory support and used I/O lines to hold the page values. He used a library he designed to perform malloc() functions to stash variables in specific memory banks.
My specific idea just has to support the creation of data buffers and create an xmem object that keeps track of one or two pointers (two for circular buffers), as well as the page number where the buffer resides. There are probably a few other bits of data that need to be included, but these objects would always reside in internal RAM. Inheriting the streams library would also make sense because you could set up a stream to move data, up to 32K of course.
The memory decoder CPLD fits and is currently being simulated to check the design out. I also added lines to the design that would support an optional DMA controller. That may fit into a larger CPLD, at least I want to keep it in that if at all possible.
The DMA controller would support transfers between expansion memory and the I/O bus that is part of this shield. Any I/O and memory in the micro-controller is not able to be touched, of course. There may be enough room for memory to memory transfers as well. That is going to take at least 146 data flip flops, so I'm not exactly confident it can be done in a CPLD. Anyway, for some of my projects, DMA would be great. Multi-channel DMA would be even better. :D
Some of the I/O lines are dedicated for the DMA controller, if present. There is a DMA interrupt, and a couple of lines pull DMA status out in serial format so a transfer can be checked for how many bytes have been transferred.
DMA transfers would not be able to take place at the same time as expansion memory operations by the micro-controller in this implementation. The only dead space that I can see in an expansion memory cycle is about 3/4 of a clock cycle right after ALE goes high, at T 0.5. All micro-controller driven lines are buffered in this design, so they would be taken off the bus during a DMA cycle. But that sort of interleaving is pretty complex, and without access to the internal processor clock, there is going to be a certain degree of slop between the DMA controller clock and the processor memory cycle.
So, offline transfers, meaning the micro-controller has xmem disabled, would be the easiest to implement. Oh, since the DMA controller has as many address lines as I want, it doesn't have to worry about memory pages. All memory is available to the DMA controller. And offline transfers can occur at a faster rate, as fast as attached peripherals can handle it. The DMA controller would also have programmed wait states.
DMA isn't too far off topic regarding the original post, but this question kind of is.
Are you aware of any other high speed I/O implementation out there for the Atmega2560? The thing is, you have to have the components to support xmem to add a bunch of memory, and I/O decoding just goes along for the ride. This implementation would leave a total of 14 address lines and 8 data lines for adding peripheral I/O. There is also an I/O enable line, a read and a write line, and a few interrupt request lines, plus limited power and ground.
Anyway, if there is such a thing already out there, it would make sense to be compatible with it, if it has a decent following. Either way, I am doing this for my own projects anyway. So if a new high-speed shield interface comes out of it, that would be fine with me. :)
is there any plan the compiler might support the bank switching?
You know, I'm a great believer in 8bit processors. I don't think they're going away any time soon.
But I have decided, as a personal metric, that when an 8bit system starts to have to implement complicated memory access: bank switching, trampolines, cpld decoders, compiler patches, special linker maps... in order to support my application, then it's time to move that application to a 32bit microcontroller that can handle all my data without ... kludges. It's not like you can't get an ARM, MIPS, Coldfire, or PPC and build a system that has about the same end price as AVR+Kludges, and is a whole lot less work and more pleasant to work with. I mean, there are $7 PIC32s with 512k of flash, 128k of RAM, and built-in ethernet MAC...
In any design approach there does come a tipping point where too much external hardware glue is just too much. By the same token, and I have seen this happen with other products, when you throw a much more powerful processor into the mix, software/firmware engineers tend to get sloppy. After all, you can do anything in code.
The print() and printf() support is a great example of that. Those are great functions to have. But using them consumes a ton of processor cycles, as was measured when I looked at the throughput of the Ethernet shield.
In all of the product designs I have been associated with over the years, there is always a certain amount of hardware glue that is needed. Most designs have needed a CPLD or two and an FPGA. You definitely don't want too much, but there always seems to be the need for some.
Anyway, the Arduino platform is great because it is modular. You can put together many different things and have something up and running pretty quickly. This particular design approach has merits and drawbacks, of course. But it is a good lesson to follow for certain sorts of low cost products with modest performance levels.
For my current development efforts, a modular approach, whether it is literal physical modules of some kind, or it is circuitry modules that are contained within a single PCB, will present more flexibility for implementing a wider array of potential product capabilities. You put together the pieces that you need for a product from a 'toolbox' of proven modules and libraries. Some modules contain optional elements. 128K of RAM may be just fine for one product, but on another, the whole 2MB would be needed. Populate the hardware as needed and offer a variety of expansion shields.
The main reason for this posting is that some of these modules would easily fit and function as Arduino shields. And I am simply curious to know if there is any interest. Regardless of implementation, if it is cost-effective and works well, that seems to be one of the key points in the Arduino philosophy. Of course, it is easier the generate interest once you have working shields to offer. But it never hurts to do a little bit of marketing research ahead of time.
Apart from SRAM what peripherals have you identified as being useful/available in a memory-mapped version.
I was looking at the LTC1851/2/3/4 ADCs but I think they are 16-bit interfaces and not a great match. And there are some UARTS like the SC68C752B and NS16C2552, they are 8 bits so would be suitable.
I like the idea of implementing the XMEM interface, always have, but I admit I'm having trouble justifying it. SRAM may be enough but a few sweet peripheral chips would seal the deal.
ADCs, DACs and Digital I/O are the main candidates. These are the bread and butter functions of typical DAQ applications. Pairing these up with big buffers and DMA is a fundamental goal of my design. So XMEM is a necessity. As far as data bus width mismatches go, I have to fall back on a CPLD or some discrete logic to deal with that issue.
One important design note, everything is 3.3V. It is just easier to deal with 3.3V parts nowadays because they are cheaper and easier to find. This means that there are buffers between this circuit and the Arduino. Based on the Atmel specifications, a logic high is 0.7*VCC, which is 3.5 volts. Neither the CPLDS nor the memory can deliver that logic level. On the peripheral side, the logic level may be an issue. If so, buffers would be needed to go from 3.3V to 5V.
For a totally Arduino-compatible shield, an alternative approach might be to implement XMEM, buffer and the desired function completely onboard. This would eliminate the need for having to use an expansion shield to get the I/O bus and a separate shield with the desired function to plug into that bus. This would be more in the Arduino spirit, so to speak, where a single shield implements the capability you are looking for.
Some of the PWM channels could be used to provide pacer clock signals. A few digital lines could be used for decoding, say three lines to act as module selects so that you could stack more than one module. And then there is the simple serial interface to check the status of a buffer transfer. A trigger line would also be needed to start things off, at least for triggered acquisitions. And an interrupt line, definitely an interrupt line!
Anyway, the script is the same. Enable XMEM, configure your chip, configure the buffer transfer byte count, configure the buffer mode (one-shot or circular), buffer direction (if needed), set up a PWM channel for the pacer signal (if needed), disable XMEM, and then either trigger immediately or wait for something that tells you to trigger. Then you wait for the interrupt to signal end of acquisition or output or whatever is meaningful for the type of function that the board providing. This offloads all of the data movement stuff into hardware.
The all in one approach dedicates the buffer to the specific board and function, but you still have XMEM to use to access both the buffer, the chip, and the memory controller. For an ADC, you could perform an acquisition, then pull chunks of acquired data through XMEM into internal memory for transmission over USB, Ethernet, or some other interface. For dynamic DAC, you would load the buffer with the waveform data from your com interface and then set it up to play out a single buffer cycle, or loop a specific number of times, or indefinitely. Digital I/O works the same way, although that can either be paced or be asynchronous, and handshaking would be supported if needed.
Once the simulation for the memory controller has been completed, the next step is to work on DMA. I'd really like to keep it in a CPLD, but the total gate count is borderline. If it does fit in a CPLD, then you are looking at about $3.50 for the logic. The cost of memory is $1.72 for 128K or $3.37 for 512K. Up to four RAM chips can be installed (all the same size). 5V to 3.3V buffers run about $1.65. Voltage regulator is about $1.20. These are quantity of 1 costs. You know the rest, PCB, connectors, the peripheral chip, some passive parts, etc. Anyway, that gives a ballpark idea.
I did look into DRAM, but SDRAM is really what is available now. The main issue with SDRAM is that it is QByte based. Refresh isn't such a big deal since they still can use CAS before RAS refresh, but getting around the QByte issue looks to be sticky, to say the least. For now, SRAM is the easiest to implement.
then it's time to move that application to a 32bit microcontroller that can handle all my data without ... kludges
Yea, that is why I've been toying with stm32f4 as it includes everything listed above and it is 250x faster with my FP single precision math sketches..
There is no question that the 32 bit micro-controllers are considerably more capable. The Atmel SAM4E would do everything that is needed for my applications. But I always get a little suspicious about libraries and code bloat whenever a micro with a wider bus and more memory comes into play.
For example, Ethernet is a must-have interface for me. So I did a bit of poking around and after spending several hours researching how that would be done with the SAM4E, I came across just such an example. Ethernet is implemented using a port of LWIP. Although I couldn't find exact references, it looks like it uses around 40K of code space and 10K or so of RAM. So that may not sound like much, but the file dependency list is more than a page long! The code is full of lots of things that I don't need. So that makes me wonder just how bloated the remaining hardware support libraries are.
Sure, I could port a driver for the W5500 ethernet chip and eliminate a software TCP/IP stack completely, and that would be fine. But when you have an application that consists of several different hardware modules interacting, the prospect of trying to debug code with such a large number of dependent files can be a real killer.
At a former company we needed to add Ethernet support to a couple of different product lines. The solution was to use a Power PC processor and a Linux kernel. It worked, but the interface development to several months to implement. It functions fine, does the job, but all the time it took to get things working, in combination with the complexity of the code is an example of why trying to do everything in code isn't always the best approach.
A while back I needed to implement division in a micro-controller without a lot of horsepower. There was no support for at the assembly level for division. After a bit of research I stumbled across a different division method called the Kenyan double-and-halve method. No, it isn't Kenyan in origin so I have no clue where the name came from. Anyway, it uses shift and add instructions and is considerably faster than successive subtraction. For this particular project, implementing that form of division was about 120x faster than successive subtraction in situations where the divisor was considerably smaller in magnitude than the number to be divided.
At the end of the day, the method worked well and allowed a slow 8-bit processor to do the job without having to do a slow, floating point implementation.
A long time ago (showing my age a little), in the days of older Windows, floating point was needed for calibration routines. However, floating point was not available at the device driver level. The solution was to use large integers. The raw data was in integer format. All of the numbers to be used were scaled up by doing a multiply instruction to add a bunch of trailing zeros. Integer division and multiplication were performed instead. Once the math was complete, the result was scaled back down. No digits of precision were lost, and the device driver worked just fine.
Those two examples simply illustrate that there are ways of getting more performance out of slower processors, or processors with more limited capabilities. Although it may seem to be easier to throw more performance at a problem, it is easy to get into the habit of lazy coding because it seems like there will always be more than enough horsepower for everything. So, sometimes doing things a little bit more differently, and combining a little bit of custom hardware where needed can still be the do the same job, and be more manageable. It all depends on the situation, of course.
Well FWIW I've decided to have 512k x 8 external SRAM on my current design, I'm sure I'll think of a use for it :)
Just to offer a simple solution to the lost 8K problem.
- I solved this years ago when I set this up.
You can add a 512Kx8 / 64Kx8 chip,
but skip using the A15 pin out - and so add an additional Page Address pin.
This way you have the 32Kx8 page repeated again
(or if you OR'ed the A15 out with trhe A16 (page) out - you could get the next 32Kx8 page).
The disadvantage is that your access is in 32Kx8 pages instead of 64Kx8.
- You are accessing 64Kx8 as a dual / repeat of the same 32Kx8 twice.