tuxedo:
but noone seems to have any clue how the emulation is done?!
Well, I have a pretty good idea about this, so I wouldn't say noone does!
During the Arduino Zero beta test period, I wrote a couple lengthy messages on a private forum Arduino used to coordinate beta testers, regarding my thoughts on implementing the EEPROM library on top of SAMD's flash. Sadly, nobody seems to have attempted to implement it. Those messages never became public. I no longer have access to them.
If you doubt I might know this stuff, I am the creator of Teensy and I implemented EEPROM emulation on Teensy-LC, which uses a 48 MHz Cortex-M0+ processor pretty similar to Arduino Zero. On Teensy-LC, sketches that use the EEPROM library work properly, as long as they access only the first 128 EEPROM bytes. That's not nearly as much EEPROM memory as most AVR-based Arduino boards have, but nearly all libraries and examples sketches which store configuration in the EEPROM use less than 100 bytes (or have a #define or variable to edit for the total size to use). Writing to the emulated EEPROM stalls program execution and even reading has very different timing than AVR, but nearly all existing sketches that need EEPROM storage aren't very sensitive to write or even read timing. Despite these limitations, Teensy-LC's compatibility with most sketches storing data with the EEPROM library is very good. Arduino Zero could be do, if anyone is willing to do the programming work....
Obviously the first step involves choosing a region of flash memory to be reserved for emulating the EEPROM. Unfortunately, the tools used for Zero erase the entire flash memory. Changing the bootloader to preserve that memory might be possible. The EDBG chip might be harder! You can still emulate EEPROM without this, but the Arduino users expect EEPROM contents to be preserved during every upload. FWIW, on Teensy-LC the top 2K of flash was set aside from the beginning as space for EEPROM emulation, so the bootloader never erases it. The linker script also prevents compiled code from using that space. If you get the rest working, maybe the Arduino devs will consider changing the published tools?
Assuming you have space reserved, you need to come up with a data storage scheme that works with the properties of your flash memory. Flash is erased in large blocks and written in small words. Erase turns all the bits within a block to 1s. Writing can turn those 1s into 0s. Some chips allow re-writing words, others don't. Even if they do, the only way to get 0 bits back to 1s is with an erase of the entire block.
Freescale's FTFA flash controller has 512 byte block erase size. Write words are 32 bits, but can be used multiple times to turn any 1s into 0s.
Many things about SAMD aren't perfectly clear to me. For example, section 21.6.3 seems to say the erase block size is 256 (4 pages, which are 64 bytes), but other places seem to suggest pages may be erasable. 21.6.5.3 pretty clearly says the write word size is either 16 or 32 bits. 8 bit writes aren't possible. It's not clear if you can turn more 1s into 0s within the same already-written word. Some chips allow this, others don't. Atmel's documentation isn't clear to me on this point. In fact, some parts seem to suggest 64 bytes have to be written at once.
When I looked into this during the Zero beta test, I found an extra restriction in Atmel's software documentation about a limited number of writes being allowed within each page before erasing. It doesn't seem to be mentioned in their datasheet at all, though I must admit I've not read every part. I just looked and couldn't find this info again.... so perhaps I'm remembering incorrectly. Anyone actually trying to implement this might look for it.
Working within the block erase size and word write size and any other erase/write hardware guidelines, you need to come up with a data storage scheme which spreads the write activity (hopefully) evenly across the memory. Flash memory has much lower write endurance, so this "wear leveling" is needed to allow EEPROM-like endurance.
For Teensy-LC, I used a simple scheme where every EEPROM write stores 2 bytes within the flash blocks, one for address and the other for data. Because I only implemented 128 emulated EEPROM bytes, addresses 128 to 255 mean the 2 bytes are unused. Reading requires a linear search to find the last location with the desired address. I used a very simple scheme where the entire 2K is erased when you need to write past the last location. Because only 128 bytes can be stored, simple code just reads the whole thing to collect up all the data into a 128 byte buffer in RAM, then the entire 2K is erase, and all non-255 bytes from the buffer are written to the beginning of the freshly erased memory. It's a very simple scheme, but it works. A good number of people have used sketches that depend on the EEPROM library for storing small amounts of data, with happy results.
Atmel appears to have designed something similar. It's documented in application note AT03265. However, Atmel's scheme appears to be designed around storage of 60 byte blocks, rather than individual bytes. It also uses a RAM-based cache, which is an opportunity for data loss and very unlike when most users of the EEPROM library would expect. Atmel recommends implementing the brownout early warning interrupt to flush the cache.
A simple but inefficient approach might be to build the EEPROM library on top of Atmel's code, by placing just a single byte in the 60-byte logical page. This would allow immediate flushing, which is closest to what people expect from AVR. Or you could pack pairs of address+data into the 60 byte pages, which might give good wear leveling even with unusual write patterns, but at a cost of searching to read (similar to what I did on Teensy-LC, which has worked well in practice). Or each 60 byte page could implement 60 bytes of emulated EEPROM, which is probably the most memory efficient way, and still might have decent wear leveling if you limit the emulated address size. Maybe?
Or you could not use any of Atmel's code and try working directly with the hardware registers documented in section 21.8. Since none of the 256K chips have the RWW feature, you'd probably have to place a small amount of code in RAM to do the actual dirty work, which you'd call with interrupts disabled. That's what I did for Freescale's chip on Teensy-LC.
Of course, if you only care about getting your own project to work, rather than developing a good emulation of the EEPROM library, perhaps Atmel's software with 60 byte blocks could meet your needs? Or just adding an external EEPROM chip might be the path of least resistance?