ADCs, DACs and Digital I/O are the main candidates. These are the bread and butter functions of typical DAQ applications. Pairing these up with big buffers and DMA is a fundamental goal of my design. So XMEM is a necessity. As far as data bus width mismatches go, I have to fall back on a CPLD or some discrete logic to deal with that issue.
One important design note, everything is 3.3V. It is just easier to deal with 3.3V parts nowadays because they are cheaper and easier to find. This means that there are buffers between this circuit and the Arduino. Based on the Atmel specifications, a logic high is 0.7*VCC, which is 3.5 volts. Neither the CPLDS nor the memory can deliver that logic level. On the peripheral side, the logic level may be an issue. If so, buffers would be needed to go from 3.3V to 5V.
For a totally Arduino-compatible shield, an alternative approach might be to implement XMEM, buffer and the desired function completely onboard. This would eliminate the need for having to use an expansion shield to get the I/O bus and a separate shield with the desired function to plug into that bus. This would be more in the Arduino spirit, so to speak, where a single shield implements the capability you are looking for.
Some of the PWM channels could be used to provide pacer clock signals. A few digital lines could be used for decoding, say three lines to act as module selects so that you could stack more than one module. And then there is the simple serial interface to check the status of a buffer transfer. A trigger line would also be needed to start things off, at least for triggered acquisitions. And an interrupt line, definitely an interrupt line!
Anyway, the script is the same. Enable XMEM, configure your chip, configure the buffer transfer byte count, configure the buffer mode (one-shot or circular), buffer direction (if needed), set up a PWM channel for the pacer signal (if needed), disable XMEM, and then either trigger immediately or wait for something that tells you to trigger. Then you wait for the interrupt to signal end of acquisition or output or whatever is meaningful for the type of function that the board providing. This offloads all of the data movement stuff into hardware.
The all in one approach dedicates the buffer to the specific board and function, but you still have XMEM to use to access both the buffer, the chip, and the memory controller. For an ADC, you could perform an acquisition, then pull chunks of acquired data through XMEM into internal memory for transmission over USB, Ethernet, or some other interface. For dynamic DAC, you would load the buffer with the waveform data from your com interface and then set it up to play out a single buffer cycle, or loop a specific number of times, or indefinitely. Digital I/O works the same way, although that can either be paced or be asynchronous, and handshaking would be supported if needed.
Once the simulation for the memory controller has been completed, the next step is to work on DMA. I'd really like to keep it in a CPLD, but the total gate count is borderline. If it does fit in a CPLD, then you are looking at about $3.50 for the logic. The cost of memory is $1.72 for 128K or $3.37 for 512K. Up to four RAM chips can be installed (all the same size). 5V to 3.3V buffers run about $1.65. Voltage regulator is about $1.20. These are quantity of 1 costs. You know the rest, PCB, connectors, the peripheral chip, some passive parts, etc. Anyway, that gives a ballpark idea.
I did look into DRAM, but SDRAM is really what is available now. The main issue with SDRAM is that it is QByte based. Refresh isn't such a big deal since they still can use CAS before RAS refresh, but getting around the QByte issue looks to be sticky, to say the least. For now, SRAM is the easiest to implement.