Consistent http failures on every 5th request

Hi,

I have some code that looks like this which I am trying to run on the Giga R1 (approx Portenta H7) using the latest IDE and arduino mbed code version v4.1.5 with ArduinoHttpClient 0.6.1

#include <Arduino.h>
#include <WiFi.h>
#include <HttpClient.h>

void setup() {
  Serial.begin(115200);
  delay(1000);
  Serial.println("HI");
  // put your setup code here, to run once:
  auto wifi = WiFiInterface::get_default_instance();
  wifi->set_dhcp(true);
  wifi->set_credentials("ZZZZ", "ZZZZ", nsapi_security_t::NSAPI_SECURITY_WPA2);
  Serial.println(wifi->connect());
}

void loop() {
  // put your main code here, to run repeatedly:
  Serial.println("1");
   WiFiClient client;
  
  Serial.println("2");
   HttpClient httpClient(client, "httpforever.com", 80);
  Serial.println("3");
   httpClient.beginRequest();
  Serial.println("4");
   httpClient.get("/");
  Serial.println(WiFiInterface::get_default_instance()->get_connection_status() != NSAPI_STATUS_DISCONNECTED);
   httpClient.endRequest();
  Serial.println("5");
   Serial.println(httpClient.contentLength());
  Serial.println("6");
  delay(1000);
  Serial.println("7");
}

This fails on every 5th request and the http client's content length returned is -1 (after a long delay). This happens irrespective of the http url I choose. Translating this to python and running it on my Mac does not exhibit this behaviour.

Does anyone else do network stuff on their Giga board? Are you seeing issues? Am I just doing it wrong?

I guess it's better than 4.1.1 where the WifiClient destructor ends up calling join() which often hangs due to a bug in MbedClient but still. ugh.

Andy

... I think it's related to a timeout in the mbed code. If I make the delay 10s it's fine. Note this is not a website timeout, my python code is happily retrieving / on the same site every second.

Hi, try a call to client.setSocketTimeout() varying the timeout value (zero for non-blocking).

You may need to tune the libmbed.a to get decent wifi performance (I did)

Thank you for taking the time to reply. Whatā€™s put me off considering this a performance problem is the issue that got me (on mbed core version 4.1.1) here was a similar thing but instead of returning -1 and recovering it failed to create an underlying mbed socket entirely and then couldnā€™t recover as the mbed stack was basically broken at that point. It wasnā€™t leaking sockets as I checked the lwip socket arena and it wasnā€™t all allocated outā€¦. And WifiClient regularly hung on stop until I patched it. I see a quite a few fixes in this area post 4.1.1 so I donā€™t think that 4.1.1 issue was specific to me tbh.

Are you finding the mbed based network stuff reliable and robust in the face of disconnects and consistent reconnects?

Nevertheless, I will see how the socket timeout impacts this.

What did you tune in mbed? Buffer sizes or something else? Since youā€™re turning I assume you are running with your own build of arduino core mbed?

It has certainly improved a lot from when I started, in no small part due to the work of @Juraj.

There are a few tweaks to the mbed config that really helped me (see links) but one of the biggest wins was realising the overhead of using multiple calls to client.print and especially println as it makes two calls to wifi, the provided data and then a newline. Batching up the data where possible and adding newlines to it made a huge difference.

I'm using the wifi in AP mode these days but spent a long time going via my router. Latency
did cause me some instability at higher workloads. AP mode is fast and I'd give it 9/10 for stability.

Maximizing throughput | lwIP Wiki | Fandom
Tuning TCP | lwIP Wiki | Fandom

update: I played around with a bunch of these settings with mixed results. Setting "lwip.tcp-mss": 1460, improved everything. I'm on the stock 4.1.5 with just some tweaks to compiler flags

p.s. I should add I'm not using the HttpClient class as I do my own thing

1 Like

Thank you. My WiFi requirements are not latency sensitive but require full robustness and appropriate recovery on all the edge cases. Iā€™m running BearSSL over the top in real life using the ECC08x which has complicated matters as the Arduino integration of that is not without its own issues in the face of common failure modes like ā€œserver accepts but doesnā€™t respondā€. I do worry that Iā€™ve chosen the wrong OS given the n/w code issues but right now Iā€™m 8 months pregnant so a switch isnā€™t on the cards so I need to make this work. Thanks again.

p.s. I should add I'm not using the HttpClient class as I do my own thing

Good to know. I have started to do the same in other areas recently. I was worrying about the NIH factor but I seem to get a better outcome.

My experience suggests mbed os is solid. The Arduino layer is a work-in-progress for sure.

I have worked around the deficiencies I've found and, in a few cases, used the mbed:: classes directly to good effect. However, if you need industrial strength in short order then you may need to look elsewhere.

Thank you for the advice

Debugging this issue it's actually identical to the mbed 4.1.1 issue. On the failing call TCPSocket::open returns -3005 (NSAPI_ERROR_NO_SOCKET) from the MbedClient::connect. There's then a delay of 30s before it returns and then the next go is a success. I think I'm going to have to do an instrumented build to get to the bottom of this issue and fix it or, as you say, work directly at the mbed layer and figure out how to get BearSSL running over this.

I have my WiFi Servers and Clients global. May be useful for you depending on your reqts.

You may be running out of socks as the wifi thread may not be releasing quickly enough on the destructor (a guess)

I checked the socket arena of the lwip library and it hadn't filled up (with a hideous hack to get round the private access on the relevant variables). It had the exact number of sockets in the area in-use as you'd expect, i.e. one. If I move the WifiClient globally it breaks with -3003 (bad parameter). The TCPSocket is definitely released on the main thread of MbedClient without delay (once the join terminates). That being said, there is something in what you're saying, but maybe it's a buffer or something. Worth a look for sure.

I raised a ticket as well: Every 5th TCP connection fails on Arduino Giga R1. Ā· Issue #937 Ā· arduino/ArduinoCore-mbed Ā· GitHub

30secs is the default sock timeout. I adjust between 0 and 5ms in my project and handle the result.

1 Like

try to add client.stop();

This does stop the -3003 errors (thanks). This is clearly a defect. There should be a virtual destructor on Client and WiFi client should implement it to call stop() if not already stopped. This does not fix the no socket (-3005) issue.

07:31:34.551 -> Client.stop: Fetching content length from httpforever.com:80
07:31:34.748 -> Return code from connect0
07:31:35.543 -> took=990ms length: 5124
07:31:37.555 -> Client.stop: Fetching content length from httpforever.com:80
07:31:37.727 -> Return code from connect0
07:31:38.519 -> took=961ms length: 5124
07:31:40.524 -> Client.stop: Fetching content length from httpforever.com:80
07:31:40.819 -> Return code from connect0
07:31:41.643 -> took=1121ms length: 5124
07:31:43.621 -> Client.stop: Fetching content length from httpforever.com:80
07:31:43.785 -> Return code from connect0
07:31:44.611 -> took=970ms length: 5124
07:31:46.609 -> Client.stop: Fetching content length from httpforever.com:80
07:31:46.609 -> Return code from TCPSocket::open-3005

Changed code for ref:

WiFiClient client;
void loop() {
  // put your main code here, to run repeatedly:
 
   Serial.println("Client.stop: Fetching content length from httpforever.com:80");
   auto then = millis();
   HttpClient httpClient(client, "httpforever.com", 80);
   httpClient.beginRequest();
   httpClient.get("/");
   httpClient.endRequest();
   auto len = httpClient.contentLength();
   Serial.print("took="); Serial.print((millis() - then)); Serial.print("ms "); Serial.print("length: "); Serial.println(len);
   delay(1000);
   httpClient.stop();
   client.stop();
}```

could you try these changes? (without the stop() in sketch)

The result after applying the changes:

void arduino::MbedClient::readSocket() {
  uint8_t data[SOCKET_BUFFER_SIZE];

  while (sock != nullptr) {
    event->wait_any(0xFF, 100);
    int ret = NSAPI_ERROR_WOULD_BLOCK;
    do {
      mutex->lock();
      if (sock == nullptr) { mutex->unlock(); goto cleanup; }
      if (rxBuffer.availableForStore() == 0) {
        mutex->unlock();
        yield();
        continue;
      }
      ret = sock->recv(data, rxBuffer.availableForStore());
      if (ret < 0 && ret != NSAPI_ERROR_WOULD_BLOCK) {
        mutex->unlock();
        goto cleanup;
      }
      if (ret == NSAPI_ERROR_WOULD_BLOCK || ret == 0) {
        mutex->unlock();
        break;
      }
      for (int i = 0; i < ret; i++) {
        rxBuffer.store_char(data[i]);
      }
      mutex->unlock();
      _status = true;
    } while (true);
  }
cleanup:
  _status = false;
  return;
}

This doesn't seem to address the issue (assuming I applied the patch correctly).

As an aside, the above code might be thread unsafe. I can't immediately see why the gcc optimiser would keep re-reading sock. There's no writer in this loop and sock is neither volatile or atomic. I am guessing the mutex might (??) result in a fence of some form but what's to stop the compiler keeping sock in a register and not reloading it? I appreciate the default compiler is old and -Os is favoured but since this is in a different thread from where sock is assigned null this maybe a problem with a modern gcc at -O3 (I've seen this in real life on x86 before but no idea re: arm).

Without the stop and HttpClient local:

14:36:07.975 -> Client.stop: Fetching content length from httpforever.com:80
14:36:08.160 -> Return code from connect0
14:36:08.979 -> took=1005ms length: 5124
14:36:10.982 -> Client.stop: Fetching content length from httpforever.com:80
14:36:11.114 -> Return code from connect0
14:36:11.909 -> took=957ms length: 5124
14:36:13.939 -> Client.stop: Fetching content length from httpforever.com:80
14:36:14.071 -> Return code from connect0
14:36:14.939 -> took=1020ms length: 5124
14:36:16.952 -> Client.stop: Fetching content length from httpforever.com:80
14:36:17.115 -> Return code from connect0
14:36:17.904 -> took=965ms length: 5124
14:36:19.902 -> Client.stop: Fetching content length from httpforever.com:80
14:36:19.902 -> Return code from TCPSocket::open-3005

With the stop and HttpClient global:

14:29:35.430 -> Client.stop: Fetching content length from httpforever.com:80
14:29:35.595 -> Return code from connect0
14:29:36.420 -> took=998ms length: 5124
14:29:38.404 -> Client.stop: Fetching content length from httpforever.com:80
14:29:41.326 -> Return code from connect0
14:29:42.087 -> took=3688ms length: 5124
14:29:44.099 -> Client.stop: Fetching content length from httpforever.com:80
14:29:44.296 -> Return code from connect0
14:29:45.123 -> took=1013ms length: 5124
14:29:47.100 -> Client.stop: Fetching content length from httpforever.com:80
14:29:47.265 -> Return code from connect0
14:29:48.053 -> took=957ms length: 5124
14:29:50.060 -> Client.stop: Fetching content length from httpforever.com:80
14:29:50.060 -> Return code from TCPSocket::open-3005

With the stop and HttpClient local:

14:28:28.124 -> Client.stop: Fetching content length from httpforever.com:80
14:28:28.322 -> Return code from connect0
14:28:29.181 -> took=1037ms length: 5124
14:28:31.155 -> Client.stop: Fetching content length from httpforever.com:80
14:28:31.319 -> Return code from connect0
14:28:32.145 -> took=970ms length: 5124
14:28:34.126 -> Client.stop: Fetching content length from httpforever.com:80
14:28:34.361 -> Return code from connect0
14:28:35.149 -> took=1026ms length: 5124
14:28:37.163 -> Client.stop: Fetching content length from httpforever.com:80
14:28:37.333 -> Return code from connect0
14:28:38.156 -> took=994ms length: 5124
14:28:40.164 -> Client.stop: Fetching content length from httpforever.com:80
14:28:40.164 -> Return code from TCPSocket::open-3005

does the HttpClient send header "Connection: close"?

Just took a look at HttpClient.cpp. I am surprised the sketch gets as far as it does given the multitude of small iClient->print statements and my experiences with such.

You might want to check out the functions in 4.1.5\libraries\SocketWrapper\src\utility. These http handlers look much more efficient - but I've not used them

1 Like

Thanks for the pointer. Iā€™m just sorting out my cmake to accommodate a local build of mbed rather than the Arduino supplied build and then Iā€™m gonna start testing various layers of the stack.