How to write a simple webscraper with PHP

Posted on

In this article I will help you build a simple webscraper in PHP. For the base we will use the php library fabpot/goutte.

To get started with our project we will install the dependency manager composer. You can find out how to install composer on their official website: https://getcomposer.org/

After the installation you should be able to use composer on the command line.

Now open a terminal in the directory of your project. To require the dependencies type in the following commands.

composer require fabpot/goutte

This will create the files “composer.json”, “composer.lock” and a directory named “vendor”.

If you want to know more composer: Composer: It’s All About the Lock File

Now that we have the dependencies installed we create a file named index.php

<?php

/*
 * This loads the composer auto loader so we can use 
 * classes without having to require them manually first.
 */
require __DIR__ . '/vendor/autoload.php';

By using Goutte we can now start building our webscraper. For this example we try to get all the links from the Hacker News front page.

<?php

/*
 * This loads the composer auto loader so we can use
 * classes without having to require them manually first.
 */
require __DIR__ . '/vendor/autoload.php';

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$crawler = $client->request('GET', 'https://news.ycombinator.com/');


foreach ($crawler->filter('.athing td.title > a') as $article){
 /**
 * @var $article DOMElement
 */
 echo 'Title: '.$article->textContent."\n";
 echo 'Link: '. $article->getAttribute('href')."\n\n";
}

Let’s go step by step through the code.

use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

With those lines we import the classes. With this we can use them without specifying the whole namespace in the code.

You can learn more about namespace in the official php documentation: Using namespaces: Basics
$client = new Client();
$crawler = $client->request('GET', 'https://news.ycombinator.com/');

Next we create a new instance of the Goutte client. Then we use the request method to make a HTTP GET request to the Hacker News front page.

You can find more usage examples on the Github repository of Goutte: FriendsOfPHP/Goutte

This returns a crawler with the source code of the response. We can use this to filter and search the html response.

foreach ($crawler->filter('.athing td.title > a') as $article){

We now filter the response html for all submitted articles by using a CSS selector.

For more examples how to use the Crawler: The DomCrawler Component

The result is a Iterator which we can use in a foreach construct.

  /**
 * @var $article DOMElement
 */
 echo 'Title: '.$article->textContent."\n";
 echo 'Link: '. $article->getAttribute('href')."\n\n";
}

Foreach article we filter we can now output the title and the link.

To use the webscraper we can run it over the command line.

php index.php

We should now see a neat list with all the links from the Hacker News front page. Now you should be able to create a simple webscraper for any other service on the web.

If you have any problems following this tutorial, please leave a comment and I’ll respond as soon as I can.

 

Leave a Reply