Php Scraping data with Goutte

By yuseferi, 17 October, 2016

We're still on our way to better data scraping. We did a great job last time, but it's time for improvements. Now I want to introduce another even higher level tool for data scraping - Goutte. It's originally written by Fabien Potencier, the creator of the Symfony framework and is now maintained by FriendsOfPHP.

Instrument: Goutte

Goutte, a simple PHP Web Scraper

This library is distributed using composer, so the installation process is quite simple.

Installation

To install Goutte you just need to run:

composer require fabpot/goutte

That's it.

User's behaviour imitation

As you know, DomCrawler has already done one part of the job - data retrieving. Goutte accompanies it and provides another part - sending data. Let's look at how they both play together.

First of all, we need to create our Client:

$client = new \Goutte\Client();

With DomCrawler we can select links on a page:

$crawler = $client->request('GET', $url);
$link = $crawler->selectLink('Link text')->link();

With Goutte we can follow them:

$pageCrawler = $client->click($link);

The `click` method has two possible scenarious of work:

1) It just follows a link and returns a new page

2) If the link is inside a form, it will submit the form and return a new page

This is pretty convenient.

Next example covers actions with buttons and forms:

$buttonCrawler = $crawler->selectButton('Button text');

You can't click this button directly, but you can find a parent form for it; fill and submit it:

$form = $buttonCrawler->form();
$pageCrawler = $client->submit(
$form,
[
'param1' => 'value1',
'param2' => 'value2',
]
);

Goutte also supports history like browsers do. You can imitate the `back` and `forward` browser actions, too:

$pageCrawler = $client->back();
$pageCrawler = $client->forward();

What if the target site blocks frequent requests from the same IP?

Sometimes when we scrape a lot of data from the same site, we can be banned for a lot of reasons. Here are just a few of them:

- a lot of requests that follow one after another

- a lot of requests from the same IP

If you're lucky enough and the problem is only in timing, then you can easily solve it by adding a random delay to your requests:

sleep(mt_rand(1, 3));

If they block you by IP, you can still handle this problem. You can create a list of proxies and use them in your clients:

$proxies = [
//...
];
$goutteClient
->getClient()
->setDefaultOption('proxy', $proxies[mt_rand(0, count($proxies - 1))])
;

This way you can randomly choose a proxy server to do all requests from your client.

How to make everything shine?

Another problem you may encounter is time needed to scrape a lot of data. If the site you have to scrape data from is quite big, let's say it contains more than 100-200 thousand items, then the data scraping process will take a lot of time. We can solve this problem in a few ways:

Asynchronous requests with GuzzleHttp\Client

You can use the lower layer Guzzle client and rely on its future option. Let's imagine that we have to scrape data from an endpoint, and it takes a while to get a response from it:

    <?php
     
    // some long processes...
    sleep(5);

For example, we have to scrape data from a set of such urls. We can dramatically reduce time needed to complete this task using the future option:

$urls = [
//...
];
$goutte = new \Goutte\Client();
 
// Then we need to get the Guzzle client
 
/** @var GuzzleHttp\Client $guzzle */
$client = $goutte->getClient();
 
foreach ($urls as $url) {
// All we need is to set up the `future` option to `true`

/** @var \GuzzleHttp\Ring\Future\FutureInterface $res */
$futureResponse = $client->get($url, ['future' => true]);

$futureResponse->then(function ($response) {
// Get data from the response and put it into a DB
});
}

Here we just let the client know that we're not interested in instant response data processing and we just set up callbacks for future calls. It's very similar to jQuery and you should find it really easy to understand and use. You can also set up callbacks for errors too. This way all our requests will be processed asynchronously and we will save a lot of time.

Request Pool

If you don't like the approach with callbacks above, you can use an alternative way to do the same work. Again, we have a set of urls and the Goutte client:

$urls = [
//...
];
$goutte = new \Goutte\Client();

Then we need to create and prepare requests:

$requests = [];
 
/** @var GuzzleHttp\Client $guzzle */
$guzzle = $goutte->getClient();
 
foreach ($urls as $url) {
$requests[] = $guzzle->createRequest('GET', $url);
}

And, finally we can send them all using the Pool::batch method:

    $responses = \GuzzleHttp\Pool::batch($guzzle, $requests);
     
    foreach ($responses as $response) {
    // Do whatever you want with the response
    }

The batch method will asynchronously send the all requests we prepared and then will return the GuzzleHttp\BatchResults object. That object is countable, iterable and you can walk through it via cycle. Each element in the BatchResults object is ResponseInterface.

You may notice, that we dug into Guzzle from Goutte a little, but you can still wrap Guzzle responses into Crawler and get benefits from both.

The Process component

This component allows you to run commands as sub-processes, asynchronously. You can parallel your task and scrape data much, much faster. I'll describe it in details next time. If you're really interested in it and can't wait, then you can read official docs about the Process component.

Pros

+ can imitate basic user actions

+ quite fast

+ supports async requests

+ doesn't require a browser

Cons

- doesn't support JavaScript

- can't take pictures

Conclusion

As Goutte is a thin wrapper on top of a few great components, it's quite challenging to show all its strong sides and API methods in such a small article. That's why I recommend you read of Guzzle official docs. With Goutte in our arsenal, we can write quite powerful scripts with minimum effort. If you need to scrape data from a site without a lot of JavaScript and related troubles - use Goutte, in another case - use CasperJs.