Ah, thanks. Those changes do speed things up. I made another sketch, mem2mem2.ino, to test DUEling memcpy's. I added DMA interrupt to sketch and altered the DMA memcpy32() to run asynchronously. I then let the DMA memcpy battle with the library memcpy() by starting up the DMA memcpy32() and immediately starting up the library memcpy(), operating on two separate src/dst 1024 32-bit word vectors.
The DMA copy won the DUEl, finishing first in 34us, and the memcpy() took 80us, total elapsed time 82us. Using SRAM1 for one of the src/dst pairs, resulted in the DMA running in 33us and the memcpy() took 60us, with total elapsed time of 70us.
For the maple, the memcpy() finished first (64us) and the DMA took 122us.
I don't know what magic the DUE MATRIX might provide...
mem2mem2.ino can be found in my DUEZoo https://github.com/manitou48/DUEZoo