Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
Almost every PHP developer has ever scraped some data from the Web. Often we need some data, which is available only on some website and we want to pull this data and save it somewhere. It looks like we open a browser, walk through the links and copy data that we need. But the same thing can be automated via script. In this tutorial, I will show you the way how you can increase the speed of you scraper making requests asynchronously.
- React tends to be used for more presentational purposes i.e. Displaying the data you have scraped and not the actual scraping. If you are going to use javascript for scraping I would suggest using your node backend to do this (assuming you are using node). Create a route that your React app can call and let your backend code do the work.
- Jul 17, 2019 Instagram Scraper Instagram on the web uses React, which means we won’t see any dynamic content util the page is fully loaded. Puppeteer is available in the Clould Functions runtime, allowing you to spin up a chrome browser on your server. It will render JavaScript and handle events just like the browser you’re using right now.
- Unlike ReactPHP HTTPClient, clue/buzz-react buffers the response and fulfills the promise once the whole response is received. Actually, it is a default behavior and you can change it if you need streaming responses. So, as you can see, the whole process of scraping is very simple: Make a.
- Feb 23, 2021 Web scraping, in simple terms, is the act of extracting data from websites. It can either be a manual process or an automated one. However, extracting data manually from web pages can be a tedious and redundant process, which justifies an entire ecosystem of multiple tools and libraries built for automating the data-extraction process.
Is there a web scraping library for react native? Cheerio has one which would've been really helpful but it doesn't allow clicking on the page which I need. I was gonna use puppeteer because it would work perfect, but after some reading people have said it doesn't work because RN doesn't have all the node dependencies but those are a bit older.
The Task
We are going to create a simple web scraper for parsing movie information from IMDB movie page:
Here is an example of the Venom movie page. We are going to request this page to get:
- title
- description
- release date
- genres
IMDB doesn’t provide any public API, so if we need this kind of information we have to scrape it from the site.
Why should we use ReactPHP and make requests asynchronously? The short answer is speed. Let’s say that we want to scrape all movies from the Coming Soon page: 12 pages, a page for each month of the upcoming year. Each page has approximately 20 movies. So in common, we are going to make 240 requests. Making these requests one after another can take some time…
And now imagine that we can run these requests concurrently. In this way, the scraper is going to be significantly fast. Let’s try it.
Set Up
Before we start writing the scraper we need to download the required dependencies via composer.
We are going to use asynchronous HTTP client called buzz-react a library written by Christian Lück. It is a simple PSR-7 HTTP client for ReactPHP ecosystem.
For traversing the DOM I’m going to use Symfony DomCrawler Component:
CSS-selector for DomCrawler allows to use jQuery-like selectors to traverse:
Now, we can start coding. This is our start:
We create an instance of the event loop and HTTP client. Next step is making requests.
Making Request
Public interface of the client’s main ClueReactBuzzBrowser class is very straightforward. It has a set of methods named after HTTP verbs: get(), post(), put() and so on. Each method returns a promise. In our case to request a page we can use get($url, $headers = []) method:
The code above simply outputs the requested page on the screen. When a response is received the promise fulfills with an instance of PsrHttpMessageResponseInterface. So, we can handle the response inside a callback and then return processed data as a resolution value from the promise.
Unlike ReactPHP HTTPClient, clue/buzz-react buffers the response and fulfills the promise once the whole response is received. Actually, it is a default behavior and you can change it if you need streaming responses.
So, as you can see, the whole process of scraping is very simple:
- Make a request and receive the promise.
- Add a fulfillment handler to the promise.
- Inside the handler traverse the response and parse the required data.
- If needed repeat from step 1.
Traversing DOM
The page that we need doesn’t require any authorization. If we look at the source of the page, we can see that all data that we need is already available in HTML. The task is very simple: no authorization, form submissions or AJAX-calls. Sometimes analysis of the target site takes several times more time than writing the scraper, but not this time.
After we have received the response we are ready to start traversing the DOM. And here Symfony DomCrawler comes into play. To start extracting information we need to create an instance of the Crawler. Its constructor accepts HTML string:
Inside the fulfillment handler, we create an instance of the Crawler and pass it a response cast to a string. Now, we can start using jQuery-like selectors to extract the required data from HTML.

Title
The title can be taken from the h1 tag:
Method filter() is used to find an element in the DOM. Then we extract text from this element. This line in jQuery looks very similar:
Genres And Description
Genres are received as text contents of the corresponding links.
Method extract() is used to extract attribute and/or node values from the list of nodes. Here in ->extract(['_text']) statement special attribute _text represents a node value. The description is also taken as a text value from the appropriate tag
Release Date
Things become a little tricky with a release date:
As you can see it is inside <div> tag, but we cannot simply extract the text from it. In this case, the release date will be Release Date: 16 February 2018 (USA) See more ». And this is not what we need. Before extracting the text from this DOM element we need to remove all tags inside of it:
Here we select all <div> tags from the Details section. Then, we loop through them and remove all child tags. This code makes our <div>s free from all inner tags. To get a release date we select the fourth (at index 3) element and grab its text (now free from other tags).
The last step is to collect all this data into an array and resolve the promise with it:
Collect The Data And Continue Synchronously
Now, its time to put all pieces together. The request logic can be extracted into a function (or class), so we could provide different URLs to it. Let’s extract Scraper class:
It accepts an instance of the Browser as a constructor dependency. The public interface is very simple and consists of two methods: scrape(array $urls)) and getMovieData(). The first one does the job: runs the requests and traverses the DOM. And the seconds one is just to receive the results when the job is done.
Now, we can try it in action. Let’s try to asynchronously scrape two movies:
In the snippet above we create a scraper and provide an array of two URLs for scraping. Then we run an event loop. Make bootable osx. It runs until it has something to do (until our requests are done and we have scrapped everything we need). As a result instead of waiting for all requests in total, we wait for the slowest one. The output will be the following:
You can continue with these results as you like: store them to different files or save into a database. In this tutorial, the main idea was to show to make asynchronous requests and parse responses.
Adding Timeout
Our scraper can be also improved by adding some timeout. What if the slowest request becomes too slow? Instead of waiting for it, we can provide a timeout and cancel all slow requests. To implement request cancellation we will use event loop timers. The idea is the following:
- Get the request promise.
- Create a timer.
- When the timer is out cancel the promise.
Now, we need an instance of the event loop inside our Scraper. Let’s provide it via constructor:
Then we can improve scrape() method and add optional parameter $timeout:
If there is no provided $timeout we use default 5 seconds. When the timer is out it tries to cancel the provided promise. In this case, all requests that last longer than 5 seconds will be cancelled. If the promise is already settled (the request is done) method cancel() has no effect.
For example, if we don’t want to wait longer than 3 seconds the client code is the following:
A Note on Web Scraping: some sites don’t like being scrapped. Often scraping data for personal use is generally OK. But you should always scrap nicely. Try to avoid making hundreds of concurrent requests from one IP. The site may don’t like it and may block your scraper. To avoid this and improve your scraper read the next article about throttling requests.
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.
Learning Event-Driven PHP With ReactPHP
The book about asynchronous PHP that you NEED!
A complete guide to writing asynchronous applications with ReactPHP. Discover event-driven architecture and non-blocking I/O with PHP!
Review by Pascal MARTIN
Minimum price: 5.99$We’d like to continue the sequence of our posts about Top 5 Popular Libraries for Web Scraping in 2020 with a new programming language - JavaScript.
JS is a quite well-known language with a great spread and community support. It can be used for both client and server web scraping scripting that makes it pretty suitable for writing your scrapers and crawlers.
Most of these libraries' advantages can be received by using our API and some of these libraries can be used in stack with it.
So let’s check them out. Cisco webex teams mac.
The 5 Top JavaScript Web Scraping Libraries in 2020#
1. Axios#
Axios is a promise-based HTTP client for the browser and Node.js.But why exactly this library? There are a lot of libraries that can be used instead of a well-known request: got, superagent, node-fetch. But Axios is a suitable solution not only for Node.js but for client usage too.
Simplicity of usage is shown below:

Promises are cool, isn’t it?
To get this library you can use one of the preferable ways:
Using npm:
Using bower:
Using yarn:
GitHub repository: https://github.com/axios/axios
2. Cheerio#
Cheerio implements a subset of core jQuery. In simple words - you can just swap your jQuery and Cheerio environments for web scraping. And guess what? It has the same benefit that Axios has - you can use it from client and Node.js as well.
For the sample of usage, you can check another of our articles: Amazon Scraping. Relatively easy.
Also, check out the docs:
- Official docs URL: https://cheerio.js.org/
- GitHub repository: https://github.com/cheeriojs/cheerio
3. Selenium#
Selenium is a popular Web Driver that has a lot of wrappers for most programming languages. Quality Assurance engineers, automation specialists, developers, data scientists - all of them at least once have used this perfect tool. For Web Scraping it’s like a swiss knife - no additional libraries needed. Any action can be performed with a browser like a real user: page opening, button click, form filling, Captcha resolving and much more.
Selenium may be installed via npm with:
Web Scraping Reactor
And the usage is a simple too:
Web Scraping React Interview
More info can be found via the documentation:
- Official docs URL: https://selenium-python.readthedocs.io/
- GitHub repository: https://github.com/SeleniumHQ/selenium
4. Puppeteer#
There are a lot of things we can say about Puppeteer: it’s a reliable and production-ready library with great community support. Basically Puppeteer is a Node.js library that offers a simple and efficient API and enables you to control Google’s Chrome or Chromium browser. So you can run a particular site's JavaScript (as well as with Selenium) and scrape single-page applications based on Vue.js, React.js, Angular, etc.
We have a great example of using Puppeteer for scraping Angular-based website, you can check it here: AngularJS site scraping. Easy deal?
Also, we’d like to suggest you check out a great curated list of awesome Puppeteer resources: https://github.com/transitive-bullshit/awesome-puppeteer
As well, useful official resources:
- Official docs URL: https://developers.google.com/web/tools/puppeteer
- GitHub repository: https://github.com/GoogleChrome/puppeteer
5. Playwright#
Not as well-known a library as Puppeteer, but can be named as Puppeteer 2, since Playwright is a library maintained by former Puppeteer contributors. Unlike Puppeteer it supports Chrome, Chromium, Webkit and Firefox backend.
To install it just run the following command:
To be sure, that the API is pretty much the same, you can take a look at the example below:
Web Scraping Recipes
- Official docs URL: https://github.com/microsoft/playwright/blob/master/docs/README.md
- GitHub repository: https://github.com/microsoft/playwright
Web Scraping Real Estate
Conclusion#
It’s always up to you to decide what to use for your particular web scraping case, but it’s also pretty obvious that the amount of data on the Internet increases exponentially and data mining becomes a crucial instrument for your business growth.
Web Scraping React Software
But remember, instead of choosing a fancy tool that may not be of much use, you should focus on finding out a tool that suits your requirements best.
