Some Yun processes can access the web, some can't anymore (2.5 days problem?)

Hi,
I have a Yun (image as of November 2014, ex-rooted as instructed) running as a webserver and gateway to some home automation, on two networks (eth0 and wlan1).

Everything "south of the bridge" (on the MCU side) works flawlessly (controlling other Arduinos via an nRF24xx network), continuously, no need to reset / memory leakage / etc...

Everything "north of the bridge" (on the openWrt side) works flawlessly:

  • communicating with the MCU through the bridge,
  • controlling Kankun - outlets on my WiFi network,
  • serving the Yun / luci interface on port 80 on both networks (192.xxx and 10.xxx),
  • serving my python flask web interface on port 5000 on both networks (192.xxx and 10.xxx)).

I watch my processes on the linux side through the bridge and restart them as necessary, if they fail, and log all that into a database - I have quite good insight into the stability of my sketch and the scripts running on both processors. They are stable, at least for weeks.

Problem is that after a few hours or days, some of the services can't access anymore the www.

Namely those are "yalertunnel":

Apr 25 10:07:26 webserver user.info sysinit: 2015-04-25'T'14:07:26'Z':connection failure:148:No route to host:try.yaler.net:80@yalertunnel.c:1036
Apr 25 10:07:26 webserver user.info sysinit: 2015-04-25'T'14:07:26'Z':connection failure:148:No route to host:try.yaler.net:80@yalertunnel.c:1036
Apr 25 10:07:26 webserver user.info sysinit: 2015-04-25'T'14:07:26'Z':connection failure:148:No route to host:try.yaler.net:80@yalertunnel.c:1036
Apr 25 10:07:26 webserver user.info sysinit: 2015-04-25'T'14:07:26'Z':connection failure:148:No route to host:try.yaler.net:80@yalertunnel.c:1036

And one of my python processes:

import urllib2 
f = urllib2.urlopen('some .json out on the web')

Traceback (most recent call last):
  File "x.py", line 231, in <module>
    f = urllib2.urlopen('http://api.wunderground.com/api/somePrivateInfoHere.json')
(...)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 145] Connection timed out>

The Yun is logged into two different networks, both have access to the www:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.0.0.1        0.0.0.0         UG    0      0        0 wlan0
default         192.168.1.1     0.0.0.0         UG    0      0        0 eth1
default         10.0.0.1        0.0.0.0         UG    0      0        0 wlan0
10.0.0.0        *               255.255.255.0   U     0      0        0 wlan0
192.168.1.0     *               255.255.255.0   U     10     0        0 eth1

When I open a shell, I may or may not get pingbacks:

root@webserver:/mnt/sda1# ping www.arduino.cc

PING www.arduino.cc (107.22.183.212): 56 data bytes
--- www.arduino.cc ping statistics ---
26 packets transmitted, 0 packets received, 100% packet loss

root@webserver:/mnt/sda1# ping www.google.com

PING www.google.com (64.233.168.104): 56 data bytes
64 bytes from 64.233.168.104: seq=0 ttl=42 time=69.098 ms
64 bytes from 64.233.168.104: seq=1 ttl=42 time=77.026 ms
64 bytes from 64.233.168.104: seq=2 ttl=42 time=66.581 ms
^C
--- www.google.com ping statistics ---
4 packets transmitted, 3 packets received, 25% packet loss
round-trip min/avg/max = 66.581/70.901/77.026 ms

I do have a second, identical (!) setup (same code, but fewer devices controlled) somewhere else (same internet service provider), that just works (up for 3 weeks now, and running strong).

Any suggestions?

Thanks,
jafrei

ping www.arduino.cc

is always failed since its firewall to block ICMP !

Is it a bad idea for a firewall to block ICMP?

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.0.0.1        0.0.0.0         UG    0      0        0 wlan0
default         192.168.1.1     0.0.0.0         UG    0      0        0 eth1
default         10.0.0.1        0.0.0.0         UG    0      0        0 wlan0
10.0.0.0        *               255.255.255.0   U     0      0        0 wlan0
192.168.1.0     *               255.255.255.0   U     10     0        0 eth1

Which one is your ISP gateway? 10.0.0.1 or 192.168.1.1?

root@webserver:/mnt/sda1# ping www.google.com

PING www.google.com (64.233.168.104): 56 data bytes
64 bytes from 64.233.168.104: seq=0 ttl=42 time=69.098 ms
64 bytes from 64.233.168.104: seq=1 ttl=42 time=77.026 ms
64 bytes from 64.233.168.104: seq=2 ttl=42 time=66.581 ms
^C
--- www.google.com ping statistics ---
4 packets transmitted, 3 packets received, 25% packet loss

25% packet loss to www.google.com is not good sign.

Use ping -c 4  www.google.com test it again.

...
Everything "south of the bridge" (on the MCU side) works flawlessly...

Everything "north of the bridge" (on the openWrt side) works flawlessly:
...

I like your definition:

The southbridge typically implements the slower capabilities of the system.

The northbridge typically implements the faster capabilities of the system.

Now I have off topic question:

Why north is faster than south?

@jafrei

Your route shows you have three (3) default routes.

This one is duplicated:

default         10.0.0.1        0.0.0.0         UG    0      0        0 wlan0

I doubt that is the problem, but it is not helping.

I think the problem is the wifi. This should be intermittent and inconsistent between devices. That is to say, It is possible one Yun will work without flaw, and the other will have this consistent behaviour of dropping-out every few days.

NOTE, other people have this problem. The work around is to reboot the system. However, I think just shutting down the wifi for a few seconds will do.

Given this, are you willing to run some experiments, or would you just like to reset (or reboot) the device on a regular basis?

TIA
Jesse

@sonnyyu: 25% loss of pings to google is only because I Ctrl-C'ed it before the 4th package came back.
@sonnyyu: 10.0.0.1 and 192.168.1.1 are both gateways to the internet (one Ethernet router and one WiFi router, the 192.-router is via Ethernet connected to the 10.-router, which is the gateway, both networks have www access this way)

@jessemonroy650: I'd prefer to run experiments, and only resort to resets if the situation can't be solved via software. Please let me know what you would like me to try.

jafrei

jafrei:
@sonnyyu: 25% loss of pings to google is only because I Ctrl-C'ed it before the 4th package came back.
...

That is why I ask to Use

ping -c 4  www.google.com

test it again.

Your routing table smells fishy;- beside as Jesse indicates duplicate entry. The second default gateway has no metric.

Mine:

 route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         192.168.0.1     0.0.0.0         UG    0      0        0 wlan0
default         192.168.0.1     0.0.0.0         UG    10     0        0 eth1
192.168.0.0     *               255.255.255.0   U     0      0        0 wlan0
192.168.0.0     *               255.255.255.0   U     10     0        0 eth1

Some more insight:

The problem is intermittent - without resetting the device (forced or system failure and reboot).

Since I log all user actions and background job actions into a database, I found that the Atheros / openWrt is up for 12 days now.

When I look at the log for the last two days, I see for example:

www access was available (log every 3 hours)
2015-04-24 19:23 through 2015-04-25 12:53 no
2015-04-25 15:53 through 2015-04-25 18:53 yes
2015-04-25 21:53 through 2015-04-26 20:50 no

And I know that at 2015-04-25 15:00 and 15:02 a user action was made on the web interface from the www, not from within the LAN. So the Yun processes had connectivity through WiFi or Ethernet.

It seems like the problem goes away for several hours in a row, and then comes back for several more hours / days.

I don't have exact times, and can't tell if during the 3h logging intervals the status changed more than once.

Nagios/Icinga monitoring both interfaces (WIFI and Ethernet as well as http service)

Nagios (opensource) is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. Nagios is installed a Linux host machine.

send out alert via email or sms once threshold is reached. leave usb connection for debug session.

http://forum.arduino.cc/index.php?topic=227693.msg1645492#msg1645492

Additional Install nrpe and nagios-plugins at Yun for memory/cpu usage/procs/load... monitor

http://www.smallbusinesstech.net/more-complicated-instructions/nagios/setting-up-nagios-on-a-debian-server-to-remotely-monitor-an-openwrt-router

Please upgrade your Yún - latest is 1.5.3

http://forum.arduino.cc/index.php?topic=279008.0

Old Yun OS=Unstabilized system

@jafrei,

I just noticed you might be routing both gateways to a single backend (third) gateway. Dual-homed is not un-heard of, but it is tricky. Generally, the second gateway routes to an internal network and never hits the internet. I'll need to check with a friend who works as a Network Administrator to see if this is an issue.

In the meantime, I see you have logging system that is pretty good. Before we start could you explain or sketch out how you expect the network traffic to travel through your network and out to the Internet. This will make it easier to design some tests.

TIA
Jesse

@jafrei: to test "access to the Web" using wget might deliver more relevant results than ping.

E.g. the yalertunnel daemon mentioned above should work if wget -Sq try.yaler.net works, while ping is not supported at all.

Kind regards, Thomas (founder of Yaler.net)

I'm running 1.5.3 (November 2014).

I was working with tamberg on this problem in parallel, since one of the symptoms is always that Yaler doesn't work anymore. It is quite obvious at this point that the TCP connection fails "after a while", but can also recover for hours, before it fails again.

While the TCP connection fails (tested with wget), ping still works.

I checked out the "Re: Measuring if the Yun is internet connected" thread. The idea there is to ping a server on the www, but ping always works on my Yun.

What fails are

  • wget (run from the shell)
  • urllib2.urlopen( ) (run from a python script)
  • yalertunnel (my way to make the Yun accessible from the www)

I checked out nagios/incinga. Those are monitoring systems that don't solve the problem, but will help alerting me when the problem occurs, and also can reboot the system?

I'm afraid that I lock myself out of the system when I reboot automatically as soon as the TCP connection fails. There is potential to turn on an endless loop of re-booting, right?

What would be the python commands to turn off and on the WiFi?

And why doesn't the Ethernet route work? After all, the Yun should have access to two different internet gateways, one through WiFi, one through Ethernet cable.

My expectation how the traffic should travel through my network:

The ISP's (IP addresses 10.0.0.x) router does not belong to me, nor should I mess with it. I know the admin password, but I keep out of it. The Yun's WiFi is connected to this router's network, and pulls a variable IP address via DHCP. The ISP does not give this router a static IP address. The owner of this router can access the web interface running on the Yun through http://10.0.0.x:5000 locally, if necessary.

My router is connected to the ISP's router via Ethernet cable. My router serves the addresses 192.168.1.x, and assigns the Yun's Ethernet adapter a static IP address.

My expectation is that any traffic to and from the web is routed whether through the WiFi of the Yun <-> 10.0.0.0 gateway, or through the Ethernet adapter of the Yun <-> 192.168.1.0 gateway (and from there my router has a dynamic address at the ISP router which has a dynamic address on the www).

I have another Yun that functions exactly the same way (WiFi into the ISP's router, Ethernet into a second router, which is plugged into a WiFi extender that connects to the ISP's router). This Yun runs stable, and as I pointed out, the Yun with the issues works every now and then for hours. There doesn't seem to be something basically wrong with this.

If the double-dipping into two networks via WiFi and Ethernet is the problem: My solution does not depend on this, I could turn WiFi off all together. My solution though depends on stable internet access.

jafrei

jafrei:
I'm running 1.5.3 (November 2014).

::::SNIP::::

I have another Yun that functions exactly the same way (WiFi into the ISP's router, Ethernet into a second router, which is plugged into a WiFi extender that connects to the ISP's router). This Yun runs stable, and as I pointed out, the Yun with the issues works every now and then for hours. There doesn't seem to be something basically wrong with this.

If the double-dipping into two networks via WiFi and Ethernet is the problem: My solution does not depend on this, I could turn WiFi off all together. My solution though depends on stable internet access.

jafrei

@jafrei,

the best solution is the path of least resistance. I'm fairly confident that since you have a second YUN, and it routes the same way, and the software is fairly identical, I would say you have a flaky wifi module. We could find a solution, such as turning it on and off for a few seconds (then again it could be minutes on and off), but I would recommend just turning off the wifi on that unit and run the hardwire.

FWIW: here are the wifi reset instructions:
http://wiki.openwrt.org/doc/faq/faq.wireless#how_do_i_reset_wifi_interface

Let us know what you decide

TIA
Jesse

Hi Jesse,

when I pull the Ethernet cable on the Yun and log into the 10.0.0.x WiFi network, I can perfectly access the web interface running on the Yun (via WiFi, obviously), while the Yun processes that require TCP can't access the www (and I checked that the 10.0.0.0 gateway on the WiFi actually does allow to access the web).

IMHO that doesn't point towards a flaky WiFi module, but to routing issues within the openWrt distribution running on the Yun. Also, if it works, it works for hours, so there seems to be a piece of software or a setting that causes a problem / doesn't respond to whatever change happens every now and then within the network, maybe?

I'll now turn off the Yun's WiFi and see if it is stable through Ethernet.

jafrei

when I run

wifi down && sleep 5 && wifi

I get "Successfully initialized wpa_supplicant", and then immediately wget works, and yalertunnel, and my python scripts that pull web content.

So it can only be a software issue of the openWrt distribution.

I now have a work around, and in case you want me to do some more research for the next Yun update image, I'd be glad to help.

jafrei

jafrei:
when I run

wifi down && sleep 5 && wifi

I get "Successfully initialized wpa_supplicant", and then immediately wget works, and yalertunnel, and my python scripts that pull web content.

So it can only be a software issue of the openWrt distribution.

I now have a work around, and in case you want me to do some more research for the next Yun update image, I'd be glad to help.

jafrei

@jafrei,
in the end it will be a combination of both. I think for now the best think is for you to have a workable system, even if there is a work-around. Let's leave it for now. I will work on the problem and talk to the original author. Maybe I can get him to fix it.

Jesse

@OP;-

Since you have the second box just working fine, use Clone Yun OS at Linux Box by dd Command to clone its OS then use it at first box?

http://forum.arduino.cc/index.php?topic=319452.0