UDP freezes, possible cure found(???)

So yeah,
I asked a lot of questions so when i’m on a trail i’d like to contribute!

Been struggling for a while now. I have multiple (25+) arduino Ethernet devices that send and receive UDP at a rate of 10 to 40Hz, and packages over 40bytes. During testing we found that the units randomly froze. Some make it to 24 hours but most freeze after 2 to 4 hours. Resetting them with a watchdog did the trick, but there was an issue because the PWM output was also resetted.

I tried a bunch load… from disabling the SPI pin (High) of the SD socket to resetting the W5100 with an output pin…
After a typical debugging session found out that it froze during a routine that sends the UDP packages, and after some more debugging found out that the function

Udp.endPacket();

caused all the problems. It calls the function but never recovers from it…
Digging into the include file, I found allot of “while” loops that wait until a certain condition is met. One of them, in the socket.cpp is the following (found after yes, more debugging!):

int sendUDP(SOCKET s)
{
  W5100.execCmdSn(s, Sock_SEND);
  /* +2008.01 bj */
  while ((W5100.readSnIR(s) & SnIR::SEND_OK) != SnIR::SEND_OK )
  {
    if (W5100.readSnIR(s) & SnIR::TIMEOUT)
    {
      /* +2008.01 [bj]: clear interrupt */
      W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
      return 0;
    }
  }

  /* +2008.01 bj */	
  W5100.writeSnIR(s, SnIR::SEND_OK);

  /* Sent ok */
  return 1;
}

And sometimes it never recovers from this while loop!

So what I did, was the most DIRTY way of finding out what actually happens:

int sendUDP(SOCKET s)
{

  W5100.execCmdSn(s, Sock_SEND);
		int t;
  /* +2008.01 bj */
  while (((W5100.readSnIR(s) & SnIR::SEND_OK) != SnIR::SEND_OK ) && t<25500)
  {
    if (W5100.readSnIR(s) & SnIR::TIMEOUT)
    {
      /* +2008.01 [bj]: clear interrupt */
      W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
      return 0;
    }
	t++;
  }

  /* +2008.01 bj */	
  W5100.writeSnIR(s, SnIR::SEND_OK);
 if(t>25000)
 {
	return 2;
	}
  /* Sent ok */
  return 1;
}

Just increase a integer so when it reaches above a set amount, exits the while loop and lets me know.

And guess what: The devices now run for over 5 days, without issues… It does return “2” on occasion, but continues to run without problems!

So, I know it’s not the cleanest way of finding out, but it works for me!

I’d really like your collective thoughts of constructive criticism and like to know if someone else has this issue or another way of solving it.

My question is: why is it not timing out?

    if (W5100.readSnIR(s) & SnIR::TIMEOUT)
    {
      /* +2008.01 [bj]: clear interrupt */
      W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
      return 0;
    }

edit: Have you changed the setRetransmissionTime setting in the w5100?

SurferTim: My question is: why is it not timing out?

    if (W5100.readSnIR(s) & SnIR::TIMEOUT)
    {
      /* +2008.01 [bj]: clear interrupt */
      W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
      return 0;
    }

edit: Have you changed the setRetransmissionTime setting in the w5100?

I indeed "played around" with both the Re-transmission Time and Count, didn't seem to have much effect. And the question why it didn't time out is also a mystery to me. When I first dove into the problem I skipped that loop because I was thinking there is a timeout function in there, must be OK! but when all else didn't had any effect I still wanted to be sure...

Last I checked, the timeout worked. I'll check it when I get a chance.

OK, I checked. The UDP.endPacket() call will time out and return 0 according to the settings in setRetransmissionCount and setRetransmissionTime. It does not freeze.

FYI: The setRetransmissionCount must be a value of 1 or greater. The default is 8. The setRetransmissionTime is in 100us increments. The default is 2000 (200ms).

edit: Last time I checked, the setRetransmissionCount caused a failure if set to 0. It never sent a packet.

edit2: I just checked again, and that is true. A value of 0 never sends a packet.

SurferTim: OK, I checked. The UDP.endPacket() call will time out and return 0 according to the settings in setRetransmissionCount and setRetransmissionTime. It does not freeze.

FYI: The setRetransmissionCount must be a value of 1 or greater. The default is 8. The setRetransmissionTime is in 100us increments. The default is 2000 (200ms).

edit: Last time I checked, the setRetransmissionCount caused a failure if set to 0. It never sent a packet.

edit2: I just checked again, and that is true. A value of 0 never sends a packet.

I didn't try it with "0" for the count, but good to know. I found out after another night of testing that the timeout does trigger most of the times, but every so often doesn't and freezes the W5100. Both with default values and with adjusted values for the time and count of the re-transmission. The only explanation I can think of is that in the time during the call of the function and the timeout, a buffer overflows or something like that. As I said, I have around 20 to 30 devices that send and receive packages at 10 to 40 times a second. The smallest package is 42 and the biggest around 100 bytes By returning from the UDP.endPacket() function with a non-default value, and resetting the W5100 (By lowering the reset pin with a digital output) i'm able to continue with reliable devices.

You may be right, but I don't have that many devices to test with. Normally my Arduino is the UDP "client" and a Linux box (RPi) is the "server".

edit: How do you have all these communicating with each other? Are they all working as both a client and server? I know there is not really a client and server, but do they initiate a send and receive on the same socket?

The point is; could the socket have received a packet from another device just before you are trying to send one? Maybe that would cause a fail.

SurferTim: The point is; could the socket have received a packet from another device just before you are trying to send one? Maybe that would cause a fail.

They don't "really" communicate with each other. There is a central server with software controlling them all. The devises respond with a broadcast to a device specific port to the host (Not my idea, customer demand). All devises have individual IP and port settings to keep them apart.

I see. So they all communicate through a central "server". That shouldn't present a problem. I use the same technique, and have never had a lockup. But, like I said, I don't have 25 devices communicating with one server. However, NTP servers don't have problems with this. ??

Do you really mean a broadcast? Most UDP is unicast.

SurferTim: I see. So they all communicate through a central "server". That shouldn't present a problem. I use the same technique, and have never had a lockup. But, like I said, I don't have 25 devices communicating with one server. However, NTP servers don't have problems with this. ??

Do you really mean a broadcast? Most UDP is unicast.

Yeah it's really broadcast. The person responsible for the server/host requested this. My guess is it'll make his software easier to create...

OK. DHCP uses broadcast and it works ok.

Does the w5100 quit locking up with your new timeout feature installed, or do you need to reset it?

SurferTim: OK. DHCP uses broadcast and it works ok.

Does the w5100 quit locking up with your new timeout feature installed, or do you need to reset it?

At first it did quit and started to behave normally. I added an external counter to see how many times it returned the "wrong" value. After a few times it froze completely untill a hard reset was triggered. It didn't parsed any packages to and from the uP. And thanks for your help btw!

Have you installed a way to examine the socket status? Maybe there is a clue there. Add this function to your sketch and call it when the socket fails. You can call it any time to examine the socket status.

#include <utility/w5100.h>

byte socketStat[MAX_SOCK_NUM];

void ShowSockStatus()
{
  for (int i = 0; i < MAX_SOCK_NUM; i++) {
    Serial.print(F("Socket#"));
    Serial.print(i);
    uint8_t s = W5100.readSnSR(i);
    socketStat[i] = s;
    Serial.print(F(":0x"));
    Serial.print(s,16);
    Serial.print(F(" "));
    Serial.print(W5100.readSnPORT(i));
    Serial.print(F(" D:"));
    uint8_t dip[4];
    W5100.readSnDIPR(i, dip);
    for (int j=0; j<4; j++) {
      Serial.print(dip[j],10);
      if (j<3) Serial.print(".");
    }
    Serial.print(F("("));
    Serial.print(W5100.readSnDPORT(i));
    Serial.println(F(")"));
  }
}

I will, thanks.

I'll run it for a while and let you know what it outputs!

Let me know how it looks after the fail. We can add other checks to that function if it isn't showing enough socket data to determine the cause of the fail.

A status list for that function: 0x0 = available 0x14 = waiting for a connection 0x17 = connected 0x1C = connected waiting for close 0x22 = UDP

Very strange...

Kept it running overnight and it didn't freeze. Normally freezes every couple of hours, but now: Nothing!

So it got me thinking: The only 2 things I did different is to include your piece of code, no strange things there., and put the bootloader back to be able to upload over a serial line (Was uploading them through an external uploader over the ISP headers...) because I was fed up with removing the device from it's case and so on...

Sorry I didn't mentioned that last part earlier...

Could the bootloader have an effect on the stability of the W5100 and the UDP part of my story? And why DOES it crash without bootloader? It should work normally with or without! The further I come, the more questions I have about the why/how/what part of the problems...

Hi,

I appear to be hitting this issue.

I am not calling setRetransmissionCount and setRetransmissionTime, so they should be default.

I am having an issue that it will run for hours when checking time server every minute, then freeze. Without the ntp time queries, it continues as long as it is left running.

I read that this can happen when a ping is sent to the socket prior to the ntp request, but I am not sure how to move forward. Is using the setRetransmissionCount and setRetransmissionTime likely to solve the problem, and is the ping problem likely connected to this.. another thread discusses the ping..

http://forum.arduino.cc/index.php?topic=397002

which was a spin off from a similar thread to this one..

https://forum.arduino.cc/index.php?topic=396156

I believe there is still a problem to solve here?