Retrieving data from a website

I would like to retrieve data (numbers, and perhaps text) from a website, using the Arduino Mega and an Ethernet Shield.

This question has been asked a number of times the last days, and this is something I have not solved myself yet. Is there a standard way to do this ? Are there libraries for this ?

Some webpages have a lot of html code. Somewhere inside that code is a number I would like to read every day. Looking at an html source code, I can identify a special sequence of html code to look for, but how do I find that ? Sometimes the html code are lines, but sometimes all the html code is on a single line. Reading a single line is not possible. Reading chunks of html code is not possible, since the special sequence of html code could be split over two chunks. I don't want to read at a certain offset, since the code could change (change of date, advertising, and so on).

A shifting buffer would be possible, where each character is placed into a buffer, and everytime the buffer is checked for the special sequence of html code. Or is that too complicated ?

I would like to retrieve data (numbers, and perhaps text) from a website

Perhaps you need to to use a better term to describe where you want to get the data.

Is there a standard way to do this ?

Of course. Make a GET request, and read the server’s response.

Are there libraries for this ?

The Ethernet library.

Looking at an html source code, I can identify a special sequence of html code to look for, but how do I find that ?

That depends on how you save the server response.

Reading chunks of html code is not possible

Sure it is. The chunk size is 1. Read and store each chunk. If you are looking for data between two sets of html tags, you can move the array data around, and reset the index when you encounter a closing html tag that is not the one you want.

Or is that too complicated ?

It’s essential.

My question is how to find numbers in a large html page. I know how to use the Ethernet shield.
I don’t want to be restricted to the html tags, like hello
So the only option is to read a character, add it to a buffer, shift the buffer to left, and test if the search string is found ?

I don't want to be restricted to the html tags

How do you propose, then, to determine what IS interesting?

For example, if I would like to know how many followers @arduino has at twitter, I would look for “follower_stats\" data-nav=\"followers\" >\n<strong>” The number following that would be the number of followers.

A simple example of capturing some text.

//zoomkat 5-13-13
//simple client test
//for use with IDE 1.0
//open serial monitor and send an e to test
//for use with W5100 based ethernet shields

#include <SPI.h>
#include <Ethernet.h>
byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED }; //physical mac address
char serverName[] = "web.comporium.net"; // zoomkat's test web page server
EthernetClient client;

String readString, readString1;
int x=0; //for counting line feeds
char lf=10; //line feed character
//////////////////////

void setup(){

  if (Ethernet.begin(mac) == 0) {
    Serial.println("Failed to configure Ethernet using DHCP");
    // no point in carrying on, so do nothing forevermore:
    while(true);
  }
  Serial.begin(9600); 
  Serial.println("Better client test 5/13/13"); // so I can keep track of what is loaded
  Serial.println("Send an e in serial monitor to test"); // what to do to test
}

void loop(){
  // check for serial input
  if (Serial.available() > 0) //if something in serial buffer
  {
    byte inChar; // sets inChar as a byte
    inChar = Serial.read(); //gets byte from buffer
    if(inChar == 'e') // checks to see byte is an e
    {
      sendGET(); // call sendGET function below when byte is an e
    }
  }  
} 

//////////////////////////

void sendGET() //client function to send/receive GET request data.
{
  if (client.connect(serverName, 80)) {  //starts client connection, checks for connection
    Serial.println("connected");
    client.println("GET /~shb/arduino.txt HTTP/1.1"); //download text
    client.println("Host: web.comporium.net");
    client.println("Connection: close");  //close 1.1 persistent connection  
    client.println(); //end of get request
  } 
  else {
    Serial.println("connection failed"); //error message if no client connect
    Serial.println();
  }

  while(client.connected() && !client.available()) delay(1); //waits for data
  while (client.connected() || client.available()) { //connected or data available
    char c = client.read(); //gets byte from ethernet buffer
    Serial.print(c); //prints raw feed for testing
    if (c==lf) x=(x+1); //counting line feeds
    if (x==9) readString += c; //building readString
   }

  Serial.println();  
  Serial.println();
  Serial.print("Current data row:" );
  Serial.print(readString); //the 10th line captured
  Serial.println();
  readString1 = (readString.substring(0,8)); //extracting "woohoo!"
  Serial.println();
  Serial.print("How we feeling?: ");
  Serial.println(readString1);
  Serial.println();      
  Serial.println("done");
  Serial.println("disconnecting.");
  Serial.println("==================");
  Serial.println();
  readString = ("");
  readString1 = ("");  
  client.stop(); //stop client
}

Thank you zoomkat, I could do that already.

I want to search for a specific string, and read data, search for another string, read other data. You use a String class, I don't think I can use that, but have to use a shifting buffer. When I would publish the sketch, I also need some kind of timeout, instead of delay.

The below textfinder utility might be of interest.

http://playground.arduino.cc/Code/TextFinder

//
// Read Yahoo Weather API XML
// 03.09.2012
// http://arduino-praxis.ch


#include <SPI.h>
#include <Ethernet.h>
#include <TextFinder.h>

byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xAD };
byte ip[] = { 192, 168, 1, 102 };
byte gateway[] = { 192, 168, 1, 1 };
byte subnet[] = { 255, 255, 255, 0 };

// Server Yahoo
IPAddress server(87,248,122,181);

EthernetClient client;
TextFinder  finder( client );  

char place[50];
char hum[30];


void setup()
{
  // Start Ehternet
  Ethernet.begin(mac, ip);
  // Start Serial Port
  Serial.begin(9600);
  Serial.println("Setup...");
}


void loop()
{
  if (client.connect(server, 80))
  {
    // Call Wetter-API
    // w: ID from your City
    // http://weather.yahooapis.com/forecastrss?w=12893459&u=c
    ///
    Serial.println("Connect to Yahoo Weather...");
    client.println("GET /forecastrss?w=12893459&u=c HTTP/1.0");
    client.println("HOST:weather.yahooapis.com\n\n");
    client.println();
    Serial.println("Connected...");
  } 
  else
  {
    Serial.println(" connection failed");
  } 
 

  if (client.connected())
  {
    
    // Humidity
   if ( (finder.getString("<yweather:atmosphere humidity=\"", "\"",hum,4)!=0) )
   {
     Serial.print("Humidity:  ");
     Serial.println(hum);
   } 
   else
   {
     Serial.print("No Humidity Data");
   }
    
    
    // Place/City
    if ( (finder.getString("<title>Conditions for ", " ",place,50)!=0) )
    {
      Serial.print("City:  ");
      Serial.println(place);
    }
    
    
    // Temperature
    if(finder.find("temp=") )
    {
     int temperature = finder.getValue();
     Serial.print("Temp C:  ");
     Serial.println(temperature);
   }
   else
   {
     Serial.print("No Temperature Data");
   }
   
         
  // END XML
  }
  else
  {
    Serial.println("Disconnected"); 
  }
 
  client.stop();
  client.flush();
  delay(60000); 
}

Parsing that is a little more involved.

// Include description files for other libraries used (if any)
#include <SPI.h>
#include <Ethernet.h>

// Define Constants
// Max string length may have to be adjusted depending on data to be extracted
#define MAX_STRING_LEN  20

// Setup vars
char tagStr[MAX_STRING_LEN] = "";
char dataStr[MAX_STRING_LEN] = "";
char tmpStr[MAX_STRING_LEN] = "";
char endTag[3] = {'<', '/', '\0'};
int len;

// Flags to differentiate XML tags from document elements (ie. data)
boolean tagFlag = false;
boolean dataFlag = false;

// Ethernet vars
byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED };
byte ip[] = { 192, 168, 1, 102 };
byte server[] = { 140, 90, 113, 200 }; // www.weather.gov

// Start ethernet client
EthernetClient client;

void setup()
{
  Serial.begin(9600);
  Serial.println("Starting WebWx");
  Serial.println("connecting...");
  Ethernet.begin(mac, ip);
  delay(1000);

  if (client.connect(server, 80)) {
    Serial.println("connected");
    client.println("GET /xml/current_obs/KRDU.xml HTTP/1.0");    
    client.println();
    delay(2000);
  } else {
    Serial.println("connection failed");
  }  
}

void loop() {

  // Read serial data in from web:
  while (client.available()) {
    serialEvent();
  }

  if (!client.connected()) {
    Serial.println();
    Serial.println("Disconnected");
    Serial.println("==================================");
    Serial.println("");
    client.stop();

    // Time until next update
    //Serial.println("Waiting");
    for (int t = 1; t <= 15; t++) {
      delay(60000); // 1 minute
    }

    if (client.connect(server, 80)) {
      //Serial.println("Reconnected");
      client.println("GET /xml/current_obs/KRDU.xml HTTP/1.0");    
      client.println();
      delay(2000);
    } else {
      Serial.println("Reconnect failed");
    }      
  }
}

// Process each char from web
void serialEvent() {

   // Read a char
	char inChar = client.read();
   //Serial.print(".");
  
   if (inChar == '<') {
      addChar(inChar, tmpStr);
      tagFlag = true;
      dataFlag = false;

   } else if (inChar == '>') {
      addChar(inChar, tmpStr);

      if (tagFlag) {      
         strncpy(tagStr, tmpStr, strlen(tmpStr)+1);
      }

      // Clear tmp
      clearStr(tmpStr);

      tagFlag = false;
      dataFlag = true;      
      
   } else if (inChar != 10) {
      if (tagFlag) {
         // Add tag char to string
         addChar(inChar, tmpStr);

         // Check for </XML> end tag, ignore it
         if ( tagFlag && strcmp(tmpStr, endTag) == 0 ) {
            clearStr(tmpStr);
            tagFlag = false;
            dataFlag = false;
         }
      }
      
      if (dataFlag) {
         // Add data char to string
         addChar(inChar, dataStr);
      }
   }  
  
   // If a LF, process the line
   if (inChar == 10 ) {

/*
      Serial.print("tagStr: ");
      Serial.println(tagStr);
      Serial.print("dataStr: ");
      Serial.println(dataStr);
*/

      // Find specific tags and print data
      if (matchTag("<temp_f>")) {
	      Serial.print("Temp: ");
         Serial.print(dataStr);
      }
      if (matchTag("<relative_humidity>")) {
	      Serial.print(", Humidity: ");
         Serial.print(dataStr);
      }
      if (matchTag("<pressure_in>")) {
	      Serial.print(", Pressure: ");
         Serial.print(dataStr);
         Serial.println("");
      }

      // Clear all strings
      clearStr(tmpStr);
      clearStr(tagStr);
      clearStr(dataStr);

      // Clear Flags
      tagFlag = false;
      dataFlag = false;
   }
}

/////////////////////
// Other Functions //
/////////////////////

// Function to clear a string
void clearStr (char* str) {
   int len = strlen(str);
   for (int c = 0; c < len; c++) {
      str[c] = 0;
   }
}

//Function to add a char to a string and check its length
void addChar (char ch, char* str) {
   char *tagMsg  = "<TRUNCATED_TAG>";
   char *dataMsg = "-TRUNCATED_DATA-";

   // Check the max size of the string to make sure it doesn't grow too
   // big.  If string is beyond MAX_STRING_LEN assume it is unimportant
   // and replace it with a warning message.
   if (strlen(str) > MAX_STRING_LEN - 2) {
      if (tagFlag) {
         clearStr(tagStr);
         strcpy(tagStr,tagMsg);
      }
      if (dataFlag) {
         clearStr(dataStr);
         strcpy(dataStr,dataMsg);
      }

      // Clear the temp buffer and flags to stop current processing
      clearStr(tmpStr);
      tagFlag = false;
      dataFlag = false;

   } else {
      // Add char to string
      str[strlen(str)] = ch;
   }
}

// Function to check the current tag for a specific string
boolean matchTag (char* searchTag) {
   if ( strcmp(tagStr, searchTag) == 0 ) {
      return true;
   } else {
      return false;
   }
}

Thank you ! That was what I was looking for.

I use this function to scrape values from html.

void getECData()
{
  Serial.print("Connecting...");
  if (client.connect(serverName, 80)) {  //starts client connection, checks for connection
    Serial.println("connected");
    client.println("GET /servernode/envCanPull.php HTTP/1.1"); //download text
    client.println("Host: 192.168.2.235");
    client.println("User-Agent: arduino-ethernet");
    client.println("Connection: close");  //close 1.1 persistent connection  
    client.println(); //end of get request
  } else {
    Serial.println("connection failed");
  }
  if (client.connected()) {
    if(client.find("<td id=\'ecTemp\'>")){
       ecTemp = client.parseFloat();
       Serial.print("ecTemp: ");
       Serial.println(ecTemp);  // value is printed
     } else { 
      Serial.print("Could not find field - ecTemp");
     }
     if(client.find("<td id=\'ecHum\'>")){
       ecHum = client.parseFloat();
       Serial.print("ecHum: ");
       Serial.println(ecHum);  // value is printed
     } else {
      Serial.print("Could not find field - ecHum");
     }
   } else {
    Serial.println("Disconnected");
  }
  client.stop();
  client.flush();
  //delay(5000); // 5 seconds between each connect attempt
}

Thanks ! Using it like that is how the TextFinder library is ment to be used I think.