The topic of web scraper appeared on this blog in 2018. At that time, she was solving the problem of how to create a simple web scraper on the ESP8266 / ESP32 platform. That particular implementation of the web scraper used ESP in client mode, using two independent connections to two web servers. The implementation used a websocket for one connection and an HTTPClient for the other.
Websocket loaded the web page line by line of the source code and sent it to another web server via HTTPClient. The web server that received the line of the website was able to extract its content on the basis of a regular expression and, in the case of an e-mail address or telephone number, store this data in a MySQL database and dynamically expand the existing database of such data.
However, a full-fledged web scraper should run on one device and should not be dependent on other systems, which complicates its management and controllability. For this reason, today I decided to extend this topic for a full-fledged client-side webscraper on the Arduino platform with Ethernet shield and WiFi platforms ESP8266, ESP32. Web scraper is exclusively educational for this article.
The information obtained was not saved, it was obtained only once in order to demonstrate the function of a webscraper in an article using a web socket as when using a web page by any human browser. The data was handled ethically without publishing the retrieved content on other websites. The data was contained only in the RAM memory of the microcontrollers for buffering purposes.
The first step in implementing a web scraper on such an embeeded platform must be to analyze the specific website from which we want to obtain data. Part of the analysis is also to find guidelines that can simplify the whole process of web scraping.
If we load dynamic data from a web page, we can assume where it will be located in the structure of the loaded HTML source code of the website, for example between a paired td .... / td HTML tag, or for better specification we can search for information in more detail based on class, identifiers and other attributes that HTML elements can acquire. However, this should also be kept in mind that we have to load the source code in the range of these tags and not line by line, as the html structure can have tags and information in separate lines.
However, let's show an example of web scraping on the mentioned platforms Arduino, ESP8266, ESP32.
- Scraping an Internet discussion
Let's look at an online discussion of a series and look at one of the comments and then the HTML source code of the page in these places.
From the source code analysis, it is clear that the entire comment is wrapped in a div container with an item class. Subsequently, the individual messages with the classes: name, date, title and message contain information: name, date with time, title (title) and message. Based on individual spans, we are able to parse the given information. This way we can get from the page all the data that appears on it, as we know which HTML elements are among.
- Scraping of the advertising portal
Advertising portals are some of the easiest for web scraping. Imagine a model search situation, looking for ads in the Trucks category. In addition to the data from the title of the advertisement in the search, we can also get various additional - otherwise not directly visible information, namely a link to the entire ad, which we can then use for scraping the body of the ad, which we can get other interesting information - phone number, advertiser name, year of manufacture vehicles, mileage.
We also found other interesting information from the source code analysis. We see a direct link to the body of the advertisement, which can be used for crawling, the price of the vehicle in Slovak crowns, which is converted into euros. Furthermore, it is possible to find out all the information (category, status, offer / purchase, date). Note the interest in the form of secure HTML characters inside HTML elements that require additional conversion of characters to a readable form, such as €.
The mentioned platforms Arduino, ESP8266, ESP32 can also perform a specific GET / POST request for HTTP / HTTPS (on HTTPS only platforms ESP8266 and ESP32). This allows you to run, for example, an HTML form with the necessary input, which can filter the output of the page based on a particular page (keyword on advertising and search portals), or based on the parameter we get the exact result we expect and can subsequently parse from the data.
We will show an example of scraping from a POST HTTP request to the page of the Ministry of the Interior of the Slovak Republic, where it is possible to verify a stolen vehicle based on entering the registration number into the HTML form.
The form contains a text entry that represents the registration number of the vehicle.
input name = "ec" id = "ec" type = "text"
Based on the request, we can make a request to the destination web page, while in the body of the request we also include the ec parameter, which processes the web page form and offers us a statement for it (if the given record exists).
When the request is executed, the web server responds with an HTML page that contains information about the specified condition. If there is a positive record, value = 1, the page will also list other information about the vehicle, which we can process with a scraper. Among the information it is possible to find the designation, model, color of the vehicle. Date and time of EVČ theft, its VIN number and the like. A value running 1 will also be responded to by an application running on the microcontroller, which will display information about the announced search and about contacting the police.
If a positive record is not found, the page prints the value = 0 without further information about the vehicle. The web scaper used the UART input (variable my_datas), which represented the searched vehicle registration number. In addition to the ec (EVČ) parameter, the form can also process the vehicle's VIN number under the vin parameter. As an output, the HTML page is the same as the vehicle data in the case of a positive result for the stolen vehicle.
For websocket connections it is possible to use built-in WiFiClient / EthernetClient examples in the Arduino IDE for all mentioned platforms. Web scraper should be written as efficiently as possible in terms of memory usage.
In my test implementations, I used the String class to load, which is dynamic and through which it is possible to load texts with variable length. The string is stored in RAM. Since the Arduino Uno has only 2kB of RAM, it is necessary to load smaller lengths of source code. ESP8266 has 96kB RAM, ESP32 over 500kB RAM. RAM also runs an application that takes up a certain percentage of memory.
Part of the implementation of the web scraper application, which based on the content of lines and terminators is able to print something on the UART + download information from the website - Type, Make, Vehicle model, etc ... It is loaded individually line by line, i. after the line terminator or subsequent lines if the information was one line lower in the HTML code.
The project may be interesting for expansion with the ESP-CAM module, which contains a 2Mpix camera, which can be used, for example, to load the vehicle's EVC in real time and request a page on the minv page to verify whether the vehicle is stolen.
Disadvantages of web scrapers built on the Arduino platform, ESP8266, ESP32:
- Web scrapers cannot authenticate through Captcha security
- Scraper does not run Javascript
- The scraper will not see the dynamic data on the page executed by Javascript (it only reads the data when the page is launched, i.e. its source code)
- Relatively little memory for more complex and extensive texts