Reading HTML and Parsing it

Hi,

In my project I make a GET or POST request to and API, In the response I get raw HTML.

<html><div>Inner Text</div><html>

From this HTML I want to get the Inner Text, note this value is always changing so not known.

Is there anyway I can parse the HTML using selectors to get my value?

So far all I have been able to find is JSON and HTTP parsing - unless I am understanding it wrong.

Thanks,
Anton

What sort of variable is the HTML in ?
string or String ?

I actually have not started writing code yet, so I can store it which ever way I like - I think.

Ideally I am looking for a libary that will allow be to select the inner text of the html tags using selectors or xpath or something i.e. html > div.innertext - if that makes sense.

There have been previous discussions of getting data from html files, so the search function in the upper right of the page is probably a good source. If the data is always in the same spot in the html code, then you might be able to count line feeds and similar. Also search for "textfinder".

zoomkat:
There have been previous discussions of getting data from html files, so the search function in the upper right of the page is probably a good source. If the data is always in the same spot in the html code, then you might be able to count line feeds and similar. Also search for "textfinder".

Thanks i'll have a look into text finder, really surprised no one has developed a HTML parsing library, one that I can find anyway.

"Thanks i'll have a look into text finder, really surprised no one has developed a HTML parsing library, one that I can find anyway."

I'm sure web "scraping" for phone numbers, email address, key words/phrases, and similar is well developed. If the position of the data is well defined and does not change in the page html, you probably can write the code you need to find the data and capture it. Do you have an example of the page?

Maybe use:

  char buffer[100];
  client.readBytesUntil('<', buffer, (sizeof buffer) - 1); // Find start of "<html>"
  client.readBytesUntil('<', buffer, (sizeof buffer) - 1); // Find start of "<div>"
  client.readBytesUntil('>', buffer, (sizeof buffer) - 1); // Find end of "<div>"
  int count =  client.readBytesUntil('<', buffer, (sizeof buffer) - 1); // Read to start of "</div>"
  buffer[count-1] = '\0';  // Replace '<' with null terminator.
  // buffer now contains Inner Text

Warning: This assumes that “Inner Text” does not contain any ‘<’ characters. If it MIGHT contain such characters, the parsing problem becomes much greater.

Note: HTML text is likely to contain escaped characters which you will have to check for and replace. There are probably libraries to do that.

Thanks for the reply's - was going to try this but had to order a ftdi for my sparkfun esp8266 thing as it did not have one built in as I thought...

Also found this: Web scraper for ardunio running on ESP8266 based boards. · GitHub

I think using the readBytesUntil will be a go start!

Thanks!