Go Down

Topic: Ethernet (w5100) sketch hangs (Read 25 times) previous topic - next topic

draythomp

Has anyone taken a look at the code I mentioned above?

Next, I think measuring this thing causes it to work better.  Aside from the pin 4 fix, I have monitoring code in the ethernet library that doesn't change anything, just reports stuff.  It wouldn't fail.  So, I made a batch file to just nail it as fast as it could with http get requests.
I even asked for the favicon.ico that is so annoying when it shows up constantly in the logs.  It ran through a couple of thousand iterations.  This is on an arduino that is making client requests to several other internet devices and web sites at the same time.  My packet count shot up well over 300K and the darn thing just kept on trucking.  It did the other requests, monitored an XBee network and never a hiccup.  We're talking well over 2000 get requests to the device as fast as it could respond.

A few days ago, it would never have survived that.  So, I'm going back to the library I started with and comment out the pin 4 code and try it again.  If it fails, I'm going to step through the changes and monitoring code one at a time until I get more information.  I'm also leaving in the changes I made to my sketch to cut down on packets.  Things that tried to client.print(variable) and such.  I suspect, that Surfer is right, lots of little packets will kill it.  But, I'll see.

Sometimes you just have to start over.
Trying to keep my house under control http://www.desert-home.com/

draythomp

Quote
You don't really ever need to look at TD_RD. Use the free space register.

The W5100 only updates its internal copy of TX_WR after you send a "SEND" command.


What I was wondering about is that, since we have two processors using the same block of memory without locks on the memory, if the 5100 could be handling a retransmission or etx or something at the same time that we were copying data into the send buffer.  Since the buffer is circular, it could be possible that the 5100 could be reading from it at the same time that we are writing and the consequent updates to the pointer would clash.  It didn't

The 5100 datasheet is clear that the two pointers TX_RX and TX_WR have to be equal and the status register should be correct before copying data into the transmit buffer.  The code in the 5100 routines doesn't do this or check it like that.  However, extensive testing hasn't shown a problem.  That's what I was comparing to see if there was a problem.
Trying to keep my house under control http://www.desert-home.com/

draythomp

OK, found at least one problem.  I took my device all the way back (except for single byte packets) and started over.  I have my curl script that pounds it as fast as the arduino can respond so I can test a lot in a little time.

Ran it and it just worked fine.  I mean thousands of interaction (get /) and no problems at all.  Remember, pin 4 was not initialized.

Got mad and changed my code to send one of my long strings out in a loop that would cause it to send a LOT of one byte packets, around 80, right in the middle of the page...twice because it is in a loop.  It lasted about 100 iterations before it locked up.  Second test 60 iterations.  It got as low as 30 and as high as 162 before locking up.  But it did lock up.

Put the code back in for pin 4 and reran the test.  It locked up in a similar fashion, time after time.

I put a check in the write data processing routine to check the values of the registers against what they should be and no problem was reported even when it locked up.

So, I took the code out that separated it into multiple single byte packets and ran the test again.  It worked a full 1000 times, so I ran it again...and again.  It's setting there blissfully working just fine and I have taken my changes out of the ethernet library.

Net, multiple 1 byte packets will kill it.  I don't know exactly why, but it definitely does.  Pin 4 didn't seem to make any difference at all, but I am setting it high just in case (and, because it's an unknown if I don't).  I also notice there was a very visible slow down in the amount of traffic that I was able to generate with the 1 byte packets (pretty freakin' obvious, right?). 

So, all you folks out there that are using the streams library to talk to the ethernet or client.print(variable); are in for a surprise.

I'm just going to set back and watch it now...........sigh.

Trying to keep my house under control http://www.desert-home.com/

SurferTim

#153
Oct 21, 2011, 11:25 am Last Edit: Oct 21, 2011, 12:33 pm by SurferTim Reason: 1
Quote
Got mad and changed my code to send one of my long strings out in a loop that would cause it to send a LOT of one byte packets, around 80, right in the middle of the page...twice because it is in a loop.  It lasted about 100 iterations before it locked up.  Second test 60 iterations.  It got as low as 30 and as high as 162 before locking up.  But it did lock up.


This is the bug I am talking about. It is only when you send a bunch of packets in a row that it happens, and that was what you tried, wasn't it?

Add: Now for the not-so-good news. My "correction code" will not prevent that. I just ran another test, and the w5100 has less manners than I thought.

I inserted another register read immediately after the send. The w5100 indicates it has read the data from the buffer.

// The registers are read just prior to the send:

TX_RD = 32
TX_WR = 32

// 30 byte send here and immediate register read
// the w5100 changed the TX_RD to equal TX_WR, indicating it is finished reading

TX_RD = 62
TX_WR = 62

// BUT IT ISN'T!! 20 seconds later with NO send, the values are now this:

TX_RD = 1216
TX_WR = 1216

Bad manners! If it had any manners or knew how to share, the registers would still be

TX_RD = 62
TX_WR = 62


fungus


What I was wondering about is that, since we have two processors using the same block of memory without locks on the memory, if the 5100 could be handling a retransmission or etx or something at the same time that we were copying data into the send buffer.  Since the buffer is circular, it could be possible that the 5100 could be reading from it at the same time that we are writing and the consequent updates to the pointer would clash.  It didn't


That makes no sense. The W5100 has to keep the data in the buffer until it receives an ACK from the destination. It will only update the TX_RD pointer after the ACK packet has arrived.


The 5100 datasheet is clear that the two pointers TX_RX and TX_WR have to be equal before copying data into the transmit buffer.


Where does it say that? It's not in my copy of the docs.

If that's true then why bother with the whole circular buffer thing? What's the free space register for...?


(nb. I assume you meant "TX_RD", not "TX_RX")
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

fungus


Quote
Got mad and changed my code to send one of my long strings out in a loop that would cause it to send a LOT of one byte packets, around 80, right in the middle of the page...twice because it is in a loop.  It lasted about 100 iterations before it locked up.  Second test 60 iterations.  It got as low as 30 and as high as 162 before locking up.  But it did lock up.


This is the bug I am talking about. It is only when you send a bunch of packets in a row that it happens, and that was what you tried, wasn't it?

Add: Now for the not-so-good news. My "correction code" will not prevent that. I just ran another test, and the w5100 has less manners than I thought.



What happens if you wait for the "SEND_OK" bit to appear in the interrupt register after you send each packet?

This bit lets you know that the W5100 has fully processed your "SEND" command.

It might be possible that if you do two "SEND" commands very quickly you could cause bad things to happen ... or that the W5100 has limited memory to remember where the start/end of all the individual packets are in the transmit buffer.

One of the things I don't like about the standard driver is that it doesn't use the interrupt bits to check for completion of actions.

(Remember, to clear the "SEND_OK" bit you write a 4 to the interrupt register - this isn't very clear in the docs)
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

SurferTim

Thanks again, fungus. I'll look into that also. Something has to be the flag that lets the Arduino know it is finished with that register. Obviously the TX_RD and TX_WR registers being equal isn't it.




fungus



// 30 byte send here and immediate register read
// the w5100 changed the TX_RD to equal TX_WR, indicating it is finished reading

TX_RD = 62
TX_WR = 62

// BUT IT ISN'T!! 20 seconds later with NO send, the values are now this:

TX_RD = 1216
TX_WR = 1216



I find that impossible to believe...there must be something else going on.

No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

SurferTim

#158
Oct 21, 2011, 12:49 pm Last Edit: Oct 21, 2011, 01:26 pm by SurferTim Reason: 1
Quote
I find that impossible to believe...there must be something else going on.

Me too! That is the part that is so confusing.

At first I did think it was the SPI collision with the SD reader. But that problem is gone. This is too repeatable to be a coincidence. And this doesn't happen just now and then. Those registers are different EVERY 20 SECONDS!. They are NEVER the same as when I left them after the last send. Always different by at least a thousand.

Add: If it is using that TX_WR pointer to build a packet, that is ok. I could deal with that if it would change the registers in the correct order. Like this:

TX_WR = 62
TX_RD = 62

// 30 byte send

TX_WR = 92
TX_RD = 62

// Arduino wants to send again, but it knows the w5100 isn't finished because TX_WR != TX_RD
// so it waits

// w5100 needs to build a packet, so it grabs some memory (TX_WR += 1000 as an example), but leaves the TX_RD register alone
// the registers have not been equal yet.

TX_WR = 1092
TX_RD = 62

// Arduino tries the send again, but it knows the w5100 still isn't finished because TX_WR != TX_RD
// so it waits again

// w5100 finishes the send and sets the TX_RD register

TX_WR = 1092
TX_RD = 1092

// Now the Arduino knows the w5100 is finished. TX_WR = TX_RD

fungus

#159
Oct 21, 2011, 02:21 pm Last Edit: Oct 21, 2011, 02:23 pm by fungus Reason: 1

Add: If it is using that TX_WR pointer to build a packet, that is ok. I could deal with that if it would change the registers in the correct order. Like this:


In my experience the W5100 doesn't change TX_WR.

Try printing out the socket number as well - make sure you're looking at the same socket (it's all I can think of...)

No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

SurferTim

#160
Oct 21, 2011, 02:50 pm Last Edit: Oct 21, 2011, 03:18 pm by SurferTim Reason: 1
I know why mine is showing different TX_RD and TX_WR values now. It is because I am an idiot! I am opening and closing a new socket for each send. That will not create this error, and why the values are different for each send. My bad.

The error should occur sending multiple packets on the same socket connection. In that case, my "correction code" should work. If I break up my send into multiple packets and check the registers between packets, they increment as I expect.

Here is a thought. Is there anything that prevents that buffer from wrapping around and overwriting itself if the w5100 gets behind sending packets?

And another thought. If it isn't building the packets in this buffer, where is it building them? How many packets can it store there before that memory overflows.

fungus


Here is a thought. Is there anything that prevents that buffer from wrapping around and overwriting itself if the w5100 gets behind sending packets?


Nope. That's entirely up to you. You can write random values to TX_WR if you want to but don't expect useful results.


And another thought. If it isn't building the packets in this buffer, where is it building them? How many packets can it store there before that memory overflows.


In the "Reserved" RAM from 0x0800-0x4000...?

It has to store the start/size of each packet as you write them to the buffer so all that info probably goes in the reserved RAM too. This is why I suggested above that it might run out of space if you send lots of single byte packets without checking the interrupt flag for success after each one.
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

draythomp

Quote
Here is a thought. Is there anything that prevents that buffer from wrapping around and overwriting itself if the w5100 gets behind sending packets?


Yes there is.  In socket send it hangs in a loop until there's enough room to put the stuff in.  That's one of the loops I talked about earlier that could (possibly) hang up forever.  There's some weird use of the various flags that is difficult to understand.

On page 33 of the datasheet is where they talk about the tx RR and RW registers being equal to be part of a comprehensive check of the end of a transmission sequence.  The other parts of that are the IR register being right and the SR having the proper value.  In their support forum, they call out that ALL that stuff has to be in place before copying data into the buffer, and the ethernet library appears to be doing a good job of checking that stuff.

While it's true that you can write random stuff into the memory of the 5100, it's not true that you can if you are using the ethernet library.  There are a lot of checks to prevent that.  Also, like Surfer, I checked the tx RW register pretty heavily for changes and it was always right.  The chip indicates that it isn't done messing with the tx buffer using the tx RSR register.  That's why they check it twice in getTXFreeSize, although their pseudocode in the the datasheet shows hanging in a loop until it doesn't change as another method.  That's essentially what Surfer did (if I read it right).

The ethernet code DOES check the IR register for success after each packet sent in socket send.  However, there is an escape in there for when the other end closes the connection.  That's another loop that I wonder about.  There are two loops in there that have the possibility of waiting forever.  Especially if one of those tiny packets gets lost somewhere and there's some obscure bug in the ACK, RETRY logic.

So, I'm convinced that the problem we're seeing is due to tiny packets going into the buffer faster than the chip can service them since it has to actually send them before it can move on.  I'm not sure what to do about it though.  The board isn't run in interrupt mode, it's in poll mode and polling takes time that allows for things like this to happen.
Trying to keep my house under control http://www.desert-home.com/

SurferTim

#163
Oct 21, 2011, 08:11 pm Last Edit: Oct 21, 2011, 08:21 pm by SurferTim Reason: 1
I submitted an enhancement request on the Arduino IDE code website a while back. I just went back there, and someone had left a comment that it had fixed his problem. It concerned accessing the transmit buffer free space register and sending a large image file.
http://code.google.com/p/arduino/issues/detail?id=642&start=200

And he was sending each byte in its own packet, then checking TXFreeSize(). That is what client.free() in this code returns.
That is what this would do, right?
Code: [Select]
while (client.free() > 0) {
   client.write(currentFile.read());
}


Edit: If you didn't notice, this guy has figured out the SS lines for the two SPI interfaces. He is reading from the SD card, and writing to the w5100 without changing the SS lines.  :)

fungus


On page 33 of the datasheet is where they talk about the tx RR and RW registers being equal to be part of a comprehensive check of the end of a transmission sequence ...... ALL that stuff has to be in place before copying data into the buffer


Doesn't that make the whole circular buffer thing pointless?


The ethernet code DOES check the IR register for success after each packet sent in socket send.


Oh, yes, in "socket.cpp"...


However, there is an escape in there for when the other end closes the connection.  That's another loop that I wonder about.  There are two loops in there that have the possibility of waiting forever.  Especially if one of those tiny packets gets lost somewhere and there's some obscure bug in the ACK, RETRY logic.


All arrows in the state diagram on page 29 lead back to "CLOSED" so it can't loop forever without a bug in the W5100.

It's easy to find out if that's where the crash is happening...add a timer to the loop which tests IR and print messages:

Code: [Select]

uint16_t sendto(...)
{
 ...
 unsigned long start = millis();
 while ( (W5100.readSnIR(s) & SnIR::SEND_OK) != SnIR::SEND_OK )
 {
   /* m2008.01 [bj] : reduce code */
   if ( W5100.readSnSR(s) == SnSR::CLOSED )
   {
     close(s);
     return 0;
   }
   if ((millis()-start)>10000) {
     start = millis();
     Serial.println("wait for IR is taking too long!");
   }
 }
 ...





So, I'm convinced that the problem we're seeing is due to tiny packets going into the buffer faster than the chip can service them since it has to actually send them before it can move on.


If that's true it should have been solved by waiting for (TX_RD==TX_WR) before sending each packet.

No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

Go Up