Strange buffer corruption with W5200 & Arduino EthernetUDP

Hi,

I've been grappling with this problem for almost a week, so I'm hoping that by posting it someone might know a bit more than me! Any help greatly appreciated! :slight_smile:


I am encountering a strange transient problem with the W5200 ethernet chip coupled with the Arduino's EthernetUDP library. Yes, I am using a version of the library that is compatible with the W5200 - specifically this one.

In the project, we are sending UDP packets of fixed length (64 bytes) to an Arduino Due via the W5200. At the moment, the actual contents the packets are (almost) ignored, we just count how many we received of the correct type - obviously in future we will examine the contents, but the problem is already happening before we've reached that stage.

We're sending a UDP packet every 1-5ms. At the moment, the Due is sending packets back only occassionally (on user input), but eventually it will have to do so at much the same rate.

Everything works swimmingly initially, but after a non-specific number of packets received (usually in the range of a few hundreds of thousands, but we've had ~200 and ~4-million as outliers when trying to diagnose this problem), parsePacket() and available() on the UDP library will start reporting absurd packet sizes. As best as I can tell, for some reason the incoming UDP datagram is not being written correctly, or overwritten, or written offset from where it should be, and when the UDP library then tries to extract the packet length from the UDP header, it is infact looking at the wrong data (some other part of the datagram), causing the absurd values.

I have code in place that should, if the parsePacket() call returns other than 64 or 0, call flush() and thus clear the buffer making way for it to recover gracefully, but it doesn't work and it will now be permanently broken until a hard reset is applied.

My code is somewhat indepth and complex, but the part I think is most relevant I have posted below. If you need other parts, please let me know.

void UDPOQT_RECEIVER::receive()
{

    byte packetBuffer[UDPOQT_SIZE_MESSAGE];
    byte parseStatus;
    
    //Actually receive data
    int packetLength = _udpEngine->parsePacket();   //Look for a packet
    switch (packetLength) {
      case UDPOQT_SIZE_MESSAGE:  
      
        //Packet with correct length found - read it in.
        _udpEngine->read(packetBuffer, UDPOQT_SIZE_MESSAGE);
        //Flush out any rubbish on the end
        _udpEngine->flush();
          
        parseStatus = _message.parseMessage(packetBuffer, UDPOQT_SIZE_MESSAGE);
        if (parseStatus == 0) {
          //Message was successfully added.
          _messageValid = true;
        } else {
          DLOG("! -> Message failed to parse due to upstream error #");
          DLOGLN(parseStatus);
        }
        
        break;
      case 0:  //No packet detected
        ;  //Do nothing.
        break;
      default:  //Invalid packet size - discard
        DLOG("! -> Discarded a packet with invalid length ");
        DLOGLN(packetLength);
        while (_udpEngine->available() > 0) {
          if (_udpEngine->available() > UDPOQT_SIZE_MESSAGE) {
            _udpEngine->read(packetBuffer, UDPOQT_SIZE_MESSAGE);
            DHEX(packetBuffer, UDPOQT_SIZE_MESSAGE);
          } else {
            int len = _udpEngine->available();
            _udpEngine->read(packetBuffer, len);
            DHEX(packetBuffer, len);
          }
        }
        _udpEngine->flush();
    }
}

Note that DLOG, DLOGLN and DHEX are debug functions that do Serial.print, Serial.println and Serial.println of a byte array converted to hex.
Normally the while loop under the switch's default case would not be present, its only there as a result of my diagnostic efforts.

I have attached a text file containing the Serial log from the most recent failure, which occured after ~2 million packets. Caution: its a Big file!

The first failure, on line 16, is followed by a dump of the 16128 bytes supposedly contained in the absurd packet. First, there is 212 lines which represent 106 identical repetitions of the content of one UDP datagram (sans-header), each being 64 bytes long. The interpretation of that packet is in the following code blob:

Version 05 00 03 00    [two unsigned shorts, representing version 5.3, constant]
Type    01             [unsigned byte, constant in this test]
Channel 0E             [unsigned byte, constant in this test]
Resrvd  00 00
MsgID   77 42 2E 00    [unsigned long, should increment]
SrcIP   0A E9 A6 0C    [unsigned long, representing IP address of source computer]
Missed  00 00 00 00    [unsigned long, constant in this test]
Resrvd  00 00 00 00
VarName 63 6C 75 74 63 68 5F 70 65 64 61 6C 5F 70 6F 73 69 74 69 6F B0 08 07 20 26 40 00 00 75 01 08 00	[Null-padded 32-byte string, random in this test]
VarData 64 0A 07 20 DF 25 08 00 [double, random in this test]

(don't ask, I didn't design the payload's format)

At line 229, things change. There's 20 bytes which look like the end of a valid datagram, but obviously the previous 44 bytes are missing. Then there is a 8 byte UDP header; "0A E9 A6 0C" is the source IP, "E4 C8" is the checksum and "00 40" is the length of the following packet (0x40 being 64 bytes, as it should be). Then another valid datagram follows, though the Message ID is discontinuous with the previous one that was repeated 106 times. Thereafter the pattern repeats, with incrementing Message IDs up to the end of the 16128-byte "packet" at line 773.

Then something interesting happens. The next erroneous packet continues the pattern, except that there are eight bytes missing. I believe those eight bytes were consumed by the UDP library, which treated them as the UDP header of the packet. Since those bytes (in this case) happen to correspond with the last four bytes of VarName and the first four bytes of VarData, it has interpreted that as the source IP, packet checksum and packet length, giving the incorrect packet length that it then uses.

This is proven at the next packet boundary (line 1079) where the missing bytes correspond to Type, Channel, Reservd and MessageID, making the two bytes interpreted as the packet length here the two final bytes of the Message ID at that point - since the message IDs increment (and are little endian), these bytes are the same as those on the immediately surrounding packets. Therefore I know these two bytes are "2E 00", which happens to equal 11776 in decimal - the packet length reported at this point.


AAAAAAnnnnnddd.... that's where I am right now. I don't know why its failing to start with, I don't know why subsequent calls to flush() don't reset the buffer and allow normal operation thereafter, and I don't know how to fix it. I have a feeling that a pointer somewhere is getting mucked up and pointing to the wrong place, but I don't know which or how, or why its not reset by flush().

Sorry for the long post, but I thought I should give as much information as I have. If anyone can offer any ideas, I will be most grateful :slight_smile:

dump1.txt (882 KB)