anyone run SDfat library for long period at spi_full_speed without corruption?

Hi,

Just curious if anyone has run SD/SDfat library using SPI_FULL_SPEED for long periods of time without experiencing data corruption?

I see the default is set at SPI_HALF_SPEED, and I think it is safe to say it is set to half speed for good reason. is the reason because spi_full_speed does not really work? Since the option is there, I'm sure there are people out there like me who will be tempted to set it to full speed. But on my setup, it can run only a few days, maybe a little over a week, until the SD card file system is corrupted. That is why I specifically asked if anyone has actual experience with their program running at full speed for long periods of time (like months or years) without corruption.

I just changed my program yesterday and switched back to the default half speed. I use the datetime callback, but I can't say if that is related to the file corruption, but I may just take that out as well, since the timestamp on the sd card is useless anyway.

thanks.

The default is 8 MHz, full speed for SdFat.

The default for SD.h is half speed because many users have breadboards or long wires.

Noise is another problem. How do you power your Arduino?

You have an SPI bus or SD module hardware problem if changing the SPI speed causes errors.

SdFat runs at 42 MHz on Due with very short wires connecting the SD socket to the SPI header.

SdFat runs at 8 MHz on AVR Arduinos and I test it by writing entire SD cards at full speed.

There is no hardware CRC on Arduino but I have implemented software CRC. You can enable CRC to catch SPI bus errors by editing SdFatConfig.h and setting USE_SD_CRC nonzero.

#define USE_SD_CRC 2

If an SdFat function fails, check the SD error code like this:

  Serial.println(sd.card()->errorCode(), HEX);

Error code values are defined in Sd2Card.h.

A read CRC error is:

/** invalid read CRC */
uint8_t const SD_CARD_ERROR_READ_CRC = 0X1B;

Write CRC errors will cause an SD command to fail. For example:

/** card returned an error token as a response to a write operation */
uint8_t const SD_CARD_ERROR_WRITE = 0X13;

Thanks for the reply.

I use a mega with ethernet shield powered by a 12v power supply stepped down to 5 volts using a buck regulator.
I use the SD library and modifled it to use SPI_FULL_SPEED since day 1. I only switched it back to default SPI_HALF_SPEED yesterday.

My program runs a lot of timer interrupts, and hardware interrupts.
One timer interrupts 20x per second, a second one once per second, and two pins hooked to hardware interrupt.
I run a modified version of TinyWebserver.
The program logs to SD once every 10 minutes, and a second log that is event driven.
The third log logs ethernet activity (http requests, authentication failures, etc). In addition, the http server reads from the SD to serve http get requests. There is actually only one single index.htm file that is 40k, and the rest are for retrieving /viewing the log files.
So this is far from the ideal lab test case where the only thing going on is a single write.

I imagine I could have an ethernet request to send a particular log file, and the program could attempt to write a log to the same file. But the SD and ethernet library should take care of serializing those requests right? so really only one request is processed at a time.
do you have any test using SD and ethernet at the same time? Those are the only two devices using SPI on my setup.

Thanks

I don't think I've experienced any garbled read or any read error.
Only write will all of a sudden fail. I think the checking write error would be too late to handle, as it is likely the damage has already been done to the file system and resulted in the error.

one time, the error occurred on a file, and the file could not be read anymore, and directory listing will not show any files beyond that file. other than the problem with that file, everything else continued to work.
the second time, the whole root directory is messed up, and file open results in error.

I think the checking write error would be too late to handle, as it is likely the damage has already been done to the file system and resulted in the error.

The reason to enable CRC is to catch a bus the error when it occurs. There is no other way to be sure the error occurs during a SPI transfer.

You are doing very little I/O to the SD card so I have some doubt that it is a bus problem.

But the SD and ethernet library should take care of serializing those requests right?

The application must insure that a file is not opened multiple times.

This is from the SdFat html documentation:

It is possible to open a file with two or more instances of SdFile. A file may be corrupted if data is written to the file by more than one instance of SdFile.

The most common reason for file system corruption is when an application writes over the SdFat cache. This is typical behavior when that happens.

directory listing will not show any files beyond that file.

This happens when a zero is written in the first byte of a cached directory entry. When this directory entry is written to the SD, the FAT file system assumes all remaining directory entries are unused.

all my file operations are like this

if ((file=SD.open(filename,O_READ))) {
//read
file.close();
}

I don't keep any file handles open, nor have multiple handles to the same file.
for writes, I always have a flush followed by a close.

I checked and double checked and just checked again that all file operations are closed.

it there anything else I should do or can do?
file.sync() is the same as file.flush() right?

I call SD.mkdir, do I need to call flush after that? That's the only instance where a write occurs where I did not do a flush.
But that only occurs at program startup and I don't see the error until days later.

So how does application write over sdfat cache? it should not occur as long flush and close is called and only one file handle is open to a file right?

when a web request comes in to get the log file, the code will be executing the webserver code reading the file from SD and wrting it to socket. my logging routine, which is in the main loop will never execute until that web request is complete and returns to the main loop. So I do not see how it is possible the application can have two file handles to the same file in my sketch. I don't have any code that reads from ethernet and writes to sd, only reads from sd and writes to ethernet. this is the only time where both spi devices are active (or at least alternately active). the file logging occurs in the main loop independent of ethernet activity.

I will keep the program running and see if the SD file corruption occurs again or not at half_speed.

I'm running a breadboarded setup that's reading an SD card for 144 bytes and pushing it out to an LED string, constantly. I've compiled it with SPI_FULL_SPEED and it's been running for days sitting on my desk. Granted, it's still too slow for what I want it to do, but for a test it works. I need to swap out some hardware and see what happens. But yeah, it's been running.

So how does application write over sdfat cache? it should not occur as long flush and close is called and only one file handle is open to a file right?

Some part of an application writes outside of an array and hits the cache.

This is exactly the pattern that causes file system corruption. When you close the file, the directory entry for the file remains in the cache even though it has been written to the SD. If a bug in other software causes a write over the 512 byte cached directory block you will see a corrupt file system since the cached block is likely to be used.

A call to close() does a flush so you only need to open, write, and close the file. Unfortunately when you reopen a file, the cached directory block will likely be used. In your application a directory block is almost always cached so it's a perfect target for an out of bounds write for days on end.

Often the problem is in another library. An application that runs for long periods, like yours, often reveals this type bug.

I will keep the program running and see if the SD file corruption occurs again or not at half_speed.

Why not enable CRC? If there is a bus problem all transfers will be checked and even operations that wouldn't cause file system corruption will fail with CRC errors. All commands and data transfers, both read and write, are CRC protected.

if the user code is overwriting the sd cache, then it won't matter what spi speed is used right?
hence I want to test at half speed. if problem does not occur after a week, then problem is not in user code but in SD/SPI. But if it's spi, then ethernet should also have problems. My program does a lot more ethenet read/writes than SD .

I'll add the crc check base on your recommendation, and I'll skip the redundant call to flush.

KirAsh4, thanks for the reply. If you are only doing SD reads, then you probably won't corrupt the SD file system.

it seems the copy of Sd2Card.cpp that comes with SD library is completely stripped of the CRC code.

SD.h never had CRC. The version of SdFat in SD.h is missing more than three years of SdFat bug fixes and features.

The Arduino company also introduced some bugs in their wrapper for SdFat. I have no connection with maintenance of SD.h.

if the user code is overwriting the sd cache, then it won't matter what spi speed is used right?

True.

if problem does not occur after a week, then problem is not in user code but in SD/SPI.

Maybe. It is a rare bug and may depend on the SD directory layout so it may take much longer to start again. The whole point of CRC is to be sure the bus is O.K. I wrote the CRC code after tracing over a dozen instances of this bug to memory overwrite. A 16-bit CRC is at least thousands of times better than your method and provides a sure test of hardware transfers.

But if it's spi, then ethernet should also have problems

No the decode of clock and data is totally different in SD cards and the Ethernet controller.

granted I see sdfat that comes with SD library is from 2009, I may be hitting a bug.

if lowering the speed fixes the problem (perhaps that is why half speed is the default for SD), I may just leave it at that. and just revise my code to work directly with sdfat later if I find I need the faster performance.

thanks

Good luck. None of the bugs I know about in SD.h depend on SPI speed.

I ran some read/write tests to compare half vs full speed.
I do 1k buffered read/writes on a 40k file.

Write is only marginally faster, but read is about 30% difference

half speed read - 0.771 secs
half speed write - 5.410 secs
full speed read - 0.541 secs
full speed write - 5.389 secs

If my data file gets any larger, then I may consider switching to full speed and use sdfat directly.

half speed write
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.417s
user 0m0.015s
sys 0m0.015s
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.407s
user 0m0.000s
sys 0m0.046s
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.406s
user 0m0.000s
sys 0m0.046s
full speed write
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.389s
user 0m0.000s
sys 0m0.046s
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.388s
user 0m0.015s
sys 0m0.031s
$ time curl -0 -T m-apex.htm http://192.168.1.15:8000/upload/index.htm
real 0m5.391s
user 0m0.000s
sys 0m0.046s
full speed read
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.550s
user 0m0.000s
sys 0m0.046s
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.536s
user 0m0.000s
sys 0m0.030s
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.538s
user 0m0.015s
sys 0m0.015s
half speed read
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.773s
user 0m0.015s
sys 0m0.061s
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.769s
user 0m0.031s
sys 0m0.000s
$ time curl http://192.168.1.15:8000/index.htm>/dev/null
real 0m0.772s
user 0m0.015s
sys 0m0.030s