dhcp request (ethernet.begin(mac)) hangs in parseDHCPResponse of Dhcp.cpp

Hi,

First, as a first time poster I'd just like to state that the arduino hardware, software, libraries, and user communities are fantastic. I'm truly amazed at the quality.

I am planning to use an arduino to monitor power buses for our computer room in a fortune 500 company. I purchased an ethernet shield a while back and ran a couple of the examples and everything ran properly out of the box (the primary test being UdpNtpClient).

I finally brought the hardware into the office and started trying to make it work here. Ethernet.begin(mac) failed (hung) right off the bat. I temporarily bypassed the problem by using static IP addressing, but it troubled me DHCP wasn't working so I came back to figure it out.

I found that the call to DHCPResponse never returned. Digging in deeper I found that this procedure would continue attempting to process DHCP options after the end option ($FF) was hit.

when endoption is parsed, it just drops out of the switch statement:
case endOption :
break;

there should be nothing left in the buffer at this time, so the while loop should exit:
while (_dhcpUdpSocket.available() > 0)

BUT for some reason it doesn't. I examined the packet with a protocol analyzer, and there is nothing in the packet after the FF option.

I don't understand what is happening well enough to figure out why the buffer isn't empty but I am able to make it empty by modifying the endOption case like this:

case endOption :
_dhcpUdpSocket.flush();
break;

This works and DHCP no longer hangs when I use our work DHCP server.

I can only guess the reason the original DHCP.cpp code works at home and not in the office has something to do with the options the office dhcp server transmits. It transmits these options: 53, 1, 58, 59, 51, 54, 3, 16, 14, $FF.

My 'fix' works, but it doesn't address the underlying reason as to why either the buffer isn't empty or the function at least doesn't think it is empty.

Dan

It probably thinks it has stuff in the rx buffer when it is empty. The "605 Bug" is the most likely suspect. It affects about all ethernet functions, including udp and dhcp.
http://code.google.com/p/arduino/issues/detail?id=605

Here is a thread where another user had about the same problem with dhcp:

Thanks for the response.

It sounded like this should fix the problem, yet somehow doesn’t. I spent the afternoon trying to figure out the problem myself. Even with your change, when I first get back the DHCPOffer packet, it reports 614 bytes in the UDP part of the packet. Looking at wireshark the correct size should have been 311.

Here is how I examine the size of the udp packet (see Serial.println):

uint8_t DhcpClass::parseDHCPResponse(unsigned long responseTimeout, uint32_t& transactionId)
{
	uint16_t avail;
    uint16_t cc = 0;
    uint8_t type = 0;
    uint8_t opt_len = 0;

    unsigned long startTime = millis();

    while((avail = _dhcpUdpSocket.parsePacket()) <= 0)
    {
        if((millis() - startTime) > responseTimeout)
        {
            return 255;
        }
        delay(50);
    }
    // start reading in the packet
    RIP_MSG_FIXED fixedMsg;
    Serial.print("~bytes at top of parseDHCPResponse=");Serial.println(_dhcpUdpSocket.available());
    _dhcpUdpSocket.read((uint8_t*)&fixedMsg, sizeof(RIP_MSG_FIXED));

I checked this all the way to the code you suggest modifying and it returns what I see here.

The other puzzling part of this is I used this same arduino/ethernet shield and version of the compiler at home and it worked fine.

It sounded like this should fix the problem, yet somehow doesn't.

Somehow doesn't? Does it somehow still lock up in the Ethernet.begin(mac) dhcp routine? Did your shield somehow get an ip assigned or somehow not?

Correct, it never returns from ethernet.begin because it never finds the end of the packet before it goes loopy.

One difference between my home and office environ: At the office there is a redundant DHCP server. The DHCP request gets 2 offers from 2 servers. I wonder if somehow those two packets are in the buffer together and that is why the code thinks there is more data when there should not be. I wouldn't think that would be possible, but I've never worked this close to hardware before.

Unfortunately it would be difficult for me to filter out the 2nd server's offer, but I'm thinking when I get into the office tomorrow I'll dump the entire buffer and look at it to see if I see 2 offers in it.

Two dhcp servers on the same localnet, both sending an offer, may confuse the shield.

Sure enough, both packets are in the hardware buffer which explains why the code doesn't stop when you would expect it

02 
01 06 00 00 00 03 31 00 00 80 00 00 00 00 00 0A 
02 EF 13 0A FE 01 1F 0A 02 E1 05 90 A2 DA 0D 02 
6B 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 63 82 53 63 35 
01 02 01 04 FF FF E0 00 3A 04 00 13 C6 80 3B 04 
00 22 9B 60 33 04 00 27 8D 00 36 04 0A FE 01 1F 
03 04 0A 02 E1 01 06 0C 0A 02 C1 10 0A 01 0A 1E 
0A FE 01 1F 0F 07 64 72 31 2E 65 69 00 FF 0A 02 
E1 04 00 43 01 2F 02 01 06 00 00 00 03 31 00 00 
80 00 00 00 00 00 0A 02 EF 13 0A FE 01 1F 0A 02 
E1 04 90 A2 DA 0D 02 6B 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 63 82 53 63 35 01 02 01 04 FF FF E0 00 3A 
04 00 13 C6 80 3B 04 00 22 9B 60 33 04 00 27 8D 
00 36 04 0A FE 01 1F 03 04 0A 02 E1 01 06 0C 0A 
02 C1 10 0A 01 0A 1E 0A FE 01 1F 0F 07 64 72 31 
2E 65 69 00 FF

My fix in Dhcp.cpp fixes the problem though I don't know if it is the best solution (perhaps it will also flush something it shouldn't):

case endOption :
    Serial.println("end option hit");
     _dhcpUdpSocket.flush();	// dwh
     break;

Granted this problem is going to be rare, not many individuals/companies are running redundant DHCP servers, but should I try to notify someone of this problem and the possible fix? It has definitely burned up several hours of mine!

Dan

Very rare.

In your application, the best you can hope for is for you to find the solution. I do not recommend two dhcp servers on the same localnet.

My routers have a routine that checks for what is referred to as a "rogue dhcp server" on each localnet. If it finds one, it will attempt to "arp poison" any contact with that mac address.

I had this exact same issue - worked great at home, hung at work. DanH’s fix worked great for me, thanks!

Cool, I'm glad I was able to help someone!

I went back and looked carefully at our network and found the problem wasn't due to backup DHCP server, but due to a problem in the routing causing the same DHCP reply to come thru both the primary connection back to HQ and the backup connection.

I am experiencing this symptom (hang on Ehternet.begin(mac)), so want to try DanH's fix, but it's not obvious to me where to add his line of code. Did he have a "case endoption" in his sketch, or did he somehow manage to add it to the Ethernet library code? If the latter, how?

The fix was made in the file Dhcp.cpp in the procedure parseDHCPResponse. If you look at

https://code.google.com/p/arduino/issues/attachmentText?id=716&aid=7160017002&name=Dhcp.cpp&token=dc0FWLnNWk097X5Gg0mwu8iETSE%3A1330153164746

the fix would be inserted after line 295.

DanH