Go Down

Topic: Mega2560 Memory Expansion Shield (Read 14962 times) previous topic - next topic


is there any plan the compiler might support the bank switching?

You know, I'm a great believer in 8bit processors.  I don't think they're going away any time soon.
But I have decided, as a personal metric, that when an 8bit system starts to have to implement complicated memory access: bank switching, trampolines, cpld decoders, compiler patches, special linker maps... in order to support my application, then it's time to move that application to a 32bit microcontroller that can handle all my data without ... kludges.  It's not like you can't get an ARM, MIPS, Coldfire, or PPC and build a system that has about the same end price as AVR+Kludges, and is a whole lot less work and more pleasant to work with.  I mean, there are $7 PIC32s with 512k of flash, 128k of RAM, and built-in ethernet MAC...


In any design approach there does come a tipping point where too much external hardware glue is just too much. By the same token, and I have seen this happen with other products, when you throw a much more powerful processor into the mix, software/firmware engineers tend to get sloppy. After all, you can do anything in code.

The print() and printf() support is a great example of that. Those are great functions to have. But using them consumes a ton of processor cycles, as was measured when I looked at the throughput of the Ethernet shield.

In all of the product designs I have been associated with over the years, there is always a certain amount of hardware glue that is needed. Most designs have needed a CPLD or two and an FPGA. You definitely don't want too much, but there always seems to be the need for some.

Anyway, the Arduino platform is great because it is modular. You can put together many different things and have something up and running pretty quickly. This particular design approach has merits and drawbacks, of course. But it is a good lesson to follow for certain sorts of low cost products with modest performance levels.

For my current development efforts, a modular approach, whether it is literal physical modules of some kind, or it is circuitry modules that are contained within a single PCB, will present more flexibility for implementing a wider array of potential product capabilities. You put together the pieces that you need for a product from a 'toolbox' of proven modules and libraries. Some modules contain optional elements. 128K of RAM may be just fine for one product, but on another, the whole 2MB would be needed. Populate the hardware as needed and offer a variety of expansion shields.

The main reason for this posting is that some of these modules would easily fit and function as Arduino shields. And I am simply curious to know if there is any interest. Regardless of implementation, if it is cost-effective and works well, that seems to be one of the key points in the Arduino philosophy. Of course, it is easier the generate interest once you have working shields to offer. But it never hurts to do a little bit of marketing research ahead of time.


Apart from SRAM what peripherals have you identified as being useful/available in a memory-mapped version.

I was looking at the LTC1851/2/3/4 ADCs but I think they are 16-bit interfaces and not a great match. And there are some UARTS like the SC68C752B and NS16C2552, they are 8 bits so would be suitable.

I like the idea of implementing the XMEM interface, always have, but I admit I'm having trouble justifying it. SRAM may be enough but a few sweet peripheral chips would seal the deal.

Rob Gray aka the GRAYnomad www.robgray.com


ADCs, DACs and Digital I/O are the main candidates. These are the bread and butter functions of typical DAQ applications. Pairing these up with big buffers and DMA is a fundamental goal of my design. So XMEM is a necessity. As far as data bus width mismatches go, I have to fall back on a CPLD or some discrete logic to deal with that issue.

One important design note, everything is 3.3V. It is just easier to deal with 3.3V parts nowadays because they are cheaper and easier to find. This means that there are buffers between this circuit and the Arduino. Based on the Atmel specifications, a logic high is 0.7*VCC, which is 3.5 volts. Neither the CPLDS nor the memory can deliver that logic level. On the peripheral side, the logic level may be an issue. If so, buffers would be needed to go from 3.3V to 5V.

For a totally Arduino-compatible shield, an alternative approach might be to implement XMEM, buffer and the desired function completely onboard. This would eliminate the need for having to use an expansion shield to get the I/O bus and a separate shield with the desired function to plug into that bus. This would be more in the Arduino spirit, so to speak, where a single shield implements the capability you are looking for.

Some of the PWM channels could be used to provide pacer clock signals. A few digital lines could be used for decoding, say three lines to act as module selects so that you could stack more than one module. And then there is the simple serial interface to check the status of a buffer transfer. A trigger line would also be needed to start things off, at least for triggered acquisitions. And an interrupt line, definitely an interrupt line!

Anyway, the script is the same. Enable XMEM, configure your chip, configure the buffer transfer byte count, configure the buffer mode (one-shot or circular), buffer direction (if needed), set up a PWM channel for the pacer signal (if needed), disable XMEM, and then either trigger immediately or wait for something that tells you to trigger. Then you wait for the interrupt to signal end of acquisition or output or whatever is meaningful for the type of function that the board providing. This offloads all of the data movement stuff into hardware.

The all in one approach dedicates the buffer to the specific board and function, but you still have XMEM to use to access both the buffer, the chip, and the memory controller. For an ADC, you could perform an acquisition, then pull chunks of acquired data through XMEM into internal memory for transmission over USB, Ethernet, or some other interface.  For dynamic DAC, you would load the buffer with the waveform data from your com interface and then set it up to play out a single buffer cycle, or loop a specific number of times, or indefinitely. Digital I/O works the same way, although that can either be paced or be asynchronous, and handshaking would be supported if needed.

Once the simulation for the memory controller has been completed, the next step is to work on DMA. I'd really like to keep it in a CPLD, but the total gate count is borderline. If it does fit in a CPLD, then you are looking at about $3.50 for the logic. The cost of memory is $1.72 for 128K or $3.37 for 512K. Up to four RAM chips can be installed (all the same size). 5V to 3.3V buffers run about $1.65. Voltage regulator is about $1.20. These are quantity of 1 costs. You know the rest, PCB, connectors, the peripheral chip, some passive parts, etc. Anyway, that gives a ballpark idea.

I did look into DRAM, but SDRAM is really what is available now. The main issue with SDRAM is that it is QByte based. Refresh isn't such a big deal since they still can use CAS before RAS refresh, but getting around the QByte issue looks to be sticky, to say the least. For now, SRAM is the easiest to implement.


then it's time to move that application to a 32bit microcontroller that can handle all my data without ... kludges

Yea, that is why I've been toying with stm32f4 as it includes everything listed above and it is 250x faster with my FP single precision math sketches..


There is no question that the 32 bit micro-controllers are considerably more capable. The Atmel SAM4E would do everything that is needed for my applications. But I always get a little suspicious about libraries and code bloat whenever a micro with a wider bus and more memory comes into play.

For example, Ethernet is a must-have interface for me. So I did a bit of poking around and after spending several hours researching how that would be done with the SAM4E, I came across just such an example. Ethernet is implemented using a port of LWIP. Although I couldn't find exact references, it looks like it uses around 40K of code space and 10K or so of RAM. So that may not sound like much, but the file dependency list is more than a page long! The code is full of lots of things that I don't need. So that makes me wonder just how bloated the remaining hardware support libraries are.

Sure, I could port a driver for the W5500 ethernet chip and eliminate a software TCP/IP stack completely, and that would be fine. But when you have an application that consists of several different hardware modules interacting, the prospect of trying to debug code with such a large number of dependent files can be a real killer.

At a former company we needed to add Ethernet support to a couple of different product lines. The solution was to use a Power PC processor and a Linux kernel. It worked, but the interface development to several months to implement. It functions fine, does the job, but all the time it took to get things working, in combination with the complexity of the code is an example of why trying to do everything in code isn't always the best approach.

A while back I needed to implement division in a micro-controller without a lot of horsepower. There was no support for at the assembly level for division. After a bit of research I stumbled across a different division method called the Kenyan double-and-halve method. No, it isn't Kenyan in origin so I have no clue where the name came from. Anyway, it uses shift and add instructions and is considerably faster than successive subtraction. For this particular project, implementing that form of division was about 120x faster than successive subtraction in situations where the divisor was considerably smaller in magnitude than the number to be divided.

At the end of the day, the method worked well and allowed a slow 8-bit processor to do the job without having to do a slow, floating point implementation.

A long time ago (showing my age a little), in the days of older Windows, floating point was needed for calibration routines. However, floating point was not available at the device driver level. The solution was to use large integers. The raw data was in integer format. All of the numbers to be used were scaled up by doing a multiply instruction to add a bunch of trailing zeros. Integer division and multiplication were performed instead. Once the math was complete, the result was scaled back down. No digits of precision were lost, and the device driver worked just fine.

Those two examples simply illustrate that there are ways of getting more performance out of slower processors, or processors with more limited capabilities. Although it may seem to be easier to throw more performance at a problem, it is easy to get into the habit of lazy coding because it seems like there will always be more than enough horsepower for everything. So, sometimes doing things a little bit more differently, and combining a little bit of custom hardware where needed can still be the do the same job, and be more manageable. It all depends on the situation, of course.


Well FWIW I've decided to have 512k x 8 external SRAM on my current design, I'm sure I'll think of a use for it :)

Rob Gray aka the GRAYnomad www.robgray.com


Aug 09, 2019, 08:18 pm Last Edit: Aug 09, 2019, 08:19 pm by jlsilicon
Just to offer a simple solution to the lost 8K problem.
- I solved this years ago when I set this up.

You can add a 512Kx8 / 64Kx8 chip,
 but skip using the A15 pin out - and so add an additional Page Address pin.
This way you have the 32Kx8 page repeated again
 (or if you OR'ed the A15 out with trhe A16 (page) out - you could get the next 32Kx8 page).

The disadvantage is that your access is in 32Kx8 pages instead of 64Kx8.
- You are accessing 64Kx8 as a dual / repeat of the same 32Kx8 twice.

Go Up