Go Down

Topic: Web scrapper/crawler for Arduino or ESP8266 or ESP32 (Read 898 times) previous topic - next topic

martinius96

Oct 10, 2018, 10:46 am Last Edit: Oct 10, 2018, 10:51 am by martinius96
I offer a built-in web scraper program that I've completed for several months. The Web scraper will be linked to the target site that is specified in the program and the source code of the site will be sent to the database where the following information is processed and extracted from the entire source code:
Telephone number
E-mail adress
Product price
Product name
etc., there are several examples of various websites with different types of information for e-shops with electronics, clothing.

The HTTP site can be used with Arduino and Ethernet shield W5100 or Wiznet Ethernet module W5500. For HTTPS pages, ESP8266 (eg NodeMCU) or ESP32 - DevKit can be used. All the boards read the web after the characters and the rows of the site are sent to the database where they are processed by the next PHP file on the side of the site for data slicing. So it is necessary to have a webserver on the Internet or on the local network. Data downloads PHP using regular expressions.

As each of the above boards connects to the site and reads only the source code - can not run client-side scripts, so the board is not visible in different applications like Google Analytics, Smartlook, and so on, the crawler ban risk is minimized. The program works by plugging it into the Web once per hour, downloading the source code, and sending it back to parts.

The ESP32 can also be connected to the corporate network under the 802.1x protocol. ESP8266 supports PSK encryption of wifi networks. I'll explain everything, show what and how it works. Uploading the program is a 20 second question. The program also includes a watchdog when it restarts the board to restart it. It's important for the information to be found in the source code of the page to retrieve the data.
E-mail: martinius96@gmail.com
Website: https://arduino.php5.sk
Arduino and website programmer

PaulS

Quote
Web scrapper/crawler for Arduino or ESP8266 or ESP32
I doubt that many people are interested in a web scrapper.
The art of getting good answers lies in asking good questions.

ballscrewbob

A web scrape to me is often more than just a few snippets but a LIFT of the site to x levels deep for local offline use.

Nice idea though all the same and I am sure somebody may adapt it for other uses.

It may not be the answer you were looking for but its the one I am giving based on either experience, educated guess, google or the fact that you gave nothing to go with in the first place so I used my wonky crystal ball.

martinius96

I like web oriented solutions, where you can have dynamical data from some website...
For instance news.
Crawl each hour and save title, date, article..
In future, you can make RSS feeds from that and lot more.
I think in near future, there will be farms with devices like that to crawl websites in short time.

Big advantage for me is that, boards aren't running Javascript, so you can visit page and some non-robot protection cannot know you were on website with board.
If there is protected content, you can login with HTTP authentization :)
I like that so much. Maybe in future i will add some codes...
For instance nowadays i have finished scraper (PHP - server side) with regular expression to find e-mails in these formats:
Arduino and website programmer

xl97

I'm a bit confused... (so please explain)..

Why?

If you need a webserver anyways... what is the benefit of using an Arduino/ESP8266..etc?

Why not just have the same PHP program on your web server reach out and scrape?

Also.. your 'scrape' seems very particular? name/phone/email..    your scraping conditions/criteria would most likely need to be tweaked for each use...no?  (maybe I'm not understand it all correctly?)..

Are you scraping (taking inventory) of a whole site?  ie:  all its source code/markup?  or just specifically targeting DOM elements from know sites you plan on scraping?



Maybe point out WHY this is a good idea to use?  Or why to use this over a regular web app?


I see you posted this:

"boards aren't running Javascript"...  which is ok.. but now-a-days   most sites wont allow you to do much WITHOUT having javascript enabled..


example article:
https://developers.slashdot.org/story/18/11/01/010259/google-wont-let-you-sign-in-if-you-disabled-javascript-in-your-browser

There are already many headless/browserless devices out there that scrap pages..etc..

Go Up