help with EthernetClient.connect problem

We are having great difficulty establishing a connection between the Arduino and our Linode (custom server app). This app is doing a traditional accept -> fork child to handle the socket.

It seems that even though the server side sees a socket connect come through (ie. the child gets spawned) the Arduino is returning a 0. I see three possible places that this could occur in the source code for the connect method.

========= begin code snip from EthernetClient.cpp ==========

if (_sock == MAX_SOCK_NUM)
return 0;

_srcport++;
if (_srcport == 0) _srcport = 1024;
socket(_sock, SnMR::TCP, _srcport, 0);

if (!::connect(_sock, rawIPAddress(ip), port)) {
_sock = MAX_SOCK_NUM;
return 0;
}

while (status() != SnSR::ESTABLISHED) {
delay(1);
if (status() == SnSR::CLOSED) {
_sock = MAX_SOCK_NUM;
return 0;
}

========= end snip ==============================

Since our server app is seeing the connect come through I'm guessing the return 0 is the last one. (Any ideas how I can add debug to be certain?).

I understand the socket() -> connect() sequence because that is the same as our Linux based C client uses (which works every time by the way).

What is the while() loop doing?

Thanks in advance for your help.

bob

By "playing" with return code values we have determined that the connect method is exiting in this block

while (status() != SnSR::ESTABLISHED) {
delay(1);
if (status() == SnSR::CLOSED) {
_sock = MAX_SOCK_NUM;
return 0;
}

Next step is to add a counter to see how many times this while loop goes round before the CLOSED (ie. timeout) status is noted.

I tried to find out where that timeout is set in w5100 datasheet without any luck.

Any suggestions?

As I indicated in my previous email, the socket connection is succeeding as far as the server is concerned. It has accepted the socket and forked the child to handle it. However, it appears that the w5100 state machine isn't seeing this acceptance and timing out (setting its status state to CLOSED). As I also noted the Linux C client for doing the same connection always succeeds against this same server.

Thanks in advance for your help.

bob
PS.
In my travels I came across this curious piece of code in w5100.cpp

writeMR(1<<RST);

where RST is apparently = 0x80

Does this not mean roll 1 around to left 128 times ? Strange.

The server side code is the traditional accept socket -> fork child handler methodology.

====== begin server side snip =======
socket_fd=acceptSocket(mySocket);

fcLogx(FILE, fn,
globalMask,
TCL_SURROGATE_TRACK,
"fork child socket_fd=%d",
socket_fd
);

myChild = fork();
if(myChild == 0) // in child
{
isThisChild=1;
sprintf(me,"tclSurro_%03d",counter);

close(mySocket);

initChild(counter);

// reset the select stuff
my_fds[0] = socket_fd; // socket
my_fds[1] = whatsMyRecvfd(); // fifo
my_fds[2] = whatsMyReplyfd(); // reply fifo
FD_ZERO(&watchset);
FD_SET(my_fds[0], &watchset);
FD_SET(my_fds[1], &watchset);
FD_SET(my_fds[2], &watchset);

for(maxfd=my_fds[0],i=1; i<3; i++)
{
if(my_fds > maxfd) maxfd=my_fds*;
_
}_
_
}_
_
else // in parent*_
* {*
* close(socket_fd);
_
#ifdef ZOMBIE*_
* killZombies();*
#endif
* fcLogx(FILE, fn,
_
globalMask,_
TCL_SURROGATE_MARK,
_
"in parent"_
_
);_
_
counter++;_
_
if(counter > 999) counter=0;_
_
}_
_
====== end server side snip ========_
where the accept socket function is coded as:
_
====== begin snip 2 ============_
int acceptSocket(int s)
_
{_
static char *fn="acceptSocket";
int ns;
/***********************************************************************/
_/ /
/
Accept the client connect() function. Like answering a telephone /
/
/
/***********************************************************************/_

ns = accept(s, 0, 0);
if (ns == -1)
_
{_
_
printf("File:%s line:%d process:%s %s\n"_
,FILE
,LINE
_
, fn*_
* ,"cannot create new (service) socket for client"*
* );*
* }*
return(ns);
*} *
====== end snip 2 =============
This very traditional code where the forked child immediately closes the duplicate of the parent's listener socket and the parent immediately closes the duplicated accepted socket. The child then handles the socket communication and the parent returns to listen for connections on the port.
We have full "control" over the Linode server code. Are there any suggestions for debug changes to this server side code that might help shed some light on why the Arduino w5100 chip apparently thinks that the socket (it successfully connects as evidenced by the forking of a child process) has timed out and returned to CLOSED status?
Thanks in advance for your help.
bob

We have added a counter to the block of code which exits with return 0 in EthernetClient.connect. I appears to loop 31000 times before exiting or roughly 30sec. During this test the connect actually "succeeded" on the Linode side. A child process was spawned.

What is going on here? We are at an impasse now. The server code appears to be accepting the socket connection from the Arduino, but the w5100 chip appears to not see that happening and times out.

Thanks in advance for any help you can offer.

bob

We have run some more tests using tcpdump on the Linode side to see if we can shed some light on the Arduino client connect problem. We ran these 3 tests all from the same network node.

i) Arduino connecting to Linode on port 8000 (reports a failure - connect times out)

17:05:24.569579 IP dyn-dial-mb-216-168-109-79.nexicom.net.34110 > li78-47.member
s.linode.com.8000: Flags , seq 1785336902, win 2048, options [mss 1400], length 0
17:05:24.569659 IP li78-47.members.linode.com.8000 > dyn-dial-mb-216-168-109-79.
nexicom.net.34110: Flags [S.], seq 3277542125, ack 1785336903, win 5840, options
[mss 1460], length 0
17:05:24.619341 IP dyn-dial-mb-216-168-109-79.nexicom.net.34110 > li78-47.member
s.linode.com.8000: Flags - , ack 1, win 2048, length 0
NOTE: very often the Arduino simply is unable to send any packets to the Linode (tcpdump sees no traffic) and we get the same result: a timeout. Very occasionally the Arduino will succeed and connect to port 8000. It is very much a hit an miss.
ii) Arduino connecting to Linode on port 80 (always succeeds - exactly same code with port number changed)
17:04:29.794894 IP dyn-dial-mb-216-168-109-79.nexicom.net.48786 > li78-47.member
s.linode.com.www: Flags , seq 1804884288, win 2048, options [mss 1400], length 0
17:04:29.794977 IP li78-47.members.linode.com.www > dyn-dial-mb-216-168-109-79.n
exicom.net.48786: Flags [S.], seq 2430990952, ack 1804884289, win 5840, options
[mss 1460], length 0
17:04:29.844972 IP dyn-dial-mb-216-168-109-79.nexicom.net.48786 > li78-47.member
s.linode.com.www: Flags - , ack 1, win 2048, length 0
iii) Linux test stub connection to Linode on port 8000 (always succeeds)
17:02:20.915522 IP dyn-dial-mb-216-168-109-79.nexicom.net.48676 > li78-47.member
s.linode.com.8000: Flags , seq 1900867220, win 14600, options [mss 1400,sackO
K,TS val 241846 ecr 0,nop,wscale 7], length 0
17:02:20.915646 IP li78-47.members.linode.com.8000 > dyn-dial-mb-216-168-109-79.
nexicom.net.48676: Flags [S.], seq 417864843, ack 1900867221, win 5792, options
[mss 1460,sackOK,TS val 295109769 ecr 241846,nop,wscale 5], length 0
17:02:20.966519 IP dyn-dial-mb-216-168-109-79.nexicom.net.48676 > li78-47.member
s.linode.com.8000: Flags - , ack 1, win 115, options [nop,nop,TS val 241898 ecr
295109769], length 0
As far as I can tell from the tcpdump the packet sequence is the same in all cases:
Arduino (or Linux node) -> Linode
[S.] Linode -> Arduino (or Linux node)
- Arduino (or Linux node) -> Linode
Interpacket timing looks very similar. The Linux test stub seems to send a slightly different packet.
~~What is going on here? Why does the Arduino always connect to port 80 on the Linode but almost always fails on port 8000? Why does the Linux test stub always connect to port 8000 from same network node? ~~
Any suggestions for further testing would be appreciated. Thanks in advance.
bob

Two more tests in the matrix. We took down Apache and put our custom daemon up on port 80. The Arduino could connect without difficulty! So it isn't something related to the construction of the daemon as Apache and this custom code accept the connections from the Arduino equally well.

However when that same custom code is moved to port 8000 we get the behavior described below.

We have also modified the Arduino connect to put it into a loop and displayed debug info about how many false starts it has. Here is the code:

======== begin snip ========
#include <SPI.h>
#include <Ethernet.h>

void setup()
{
EthernetClient client;
int port = 8000;
//int port = 80;
char *serverName = "www.icanprogram.ca";
byte mac[] = {0x90, 0xA2, 0xDA, 0x0D, 0x0D, 0x77};
int ret;

Serial.begin(9600);
Serial.println();
Serial.println("starting up");

// start ethernet
if (!Ethernet.begin(mac))
{
Serial.println("error on ethernet");
while(1);
}

// give the ethernet shield a chance to initialize
delay(1000);

// connect to server
for (int i = 0; i < 100; i++)
{
Serial.print("connect ");
Serial.println(i+1);
ret = client.connect(serverName, port);
if (ret == 1)
break;
}

switch (ret)
{
case 1:
Serial.println("success");
delay(10000);
client.stop();
break;
case 0:
Serial.println("failure: 0");
break;
case -1:
Serial.println("failure: -1");
break;
case -2:
Serial.println("failure: -2");
break;
case -3:
Serial.println("failure: -3");
break;
default:
Serial.println("failure: default");
}
}

void loop()
{
Serial.println("loop started ... going to spin now");
while(1);
}
======== end snip =========

This produces a very orderly and reproducible pattern. First time after cold boot it takes 3 tries to succeed. 2nd time after reset in 4 tries. 3rd time after reset in 5 tries etc.

Why this pattern? Each try waits ~30sec for timeout to occur.

This problem has us completely baffled. Why would a larger number port affect the ability of the code to connect to that port? It is almost as if the number as large as 8000 is causing an internal "overflow" of some kind that takes a timeout to reset things.

Any help would be greatly appreciated.

bob