Building a News Scraper With PHP

Do you want to extract data and information from news articles? Is copying and pasting that information too time-consuming? Consider building a news scraper with PHP.

You can use the news scraper to extract data from various news sites. Then, you can use the data however you like.

Before You Begin

To build a news scraper with PHP, you need to have PHP5+ or PHP7+ on your computer. You should also have some experience using PHP. If you’ve never used a PHP web scraper before, you may also need to learn some basic HTML.

You can start by looking for news articles that you want to use with your news scraper. Consider if you want to start with a news site that you enjoy.

If you don’t have one in mind, decide which browser you will use for web scraping news articles. Any browser can work well, though some have different functions regarding HTML retrieval.

Choose a PHP Library

When web scraping news articles, you will need to use two PHP libraries. First, you will use the Simple HTML DOM library to scrape the news data. This library works with PHP5+, and it can support invalid HTML.

That way, you can use it to extract data from any news source even if there are errors. You can also use one line of code to get the HTML contents. It also works with jQuery to help you find tags on a page of HTML.

Next, you will need to use the SimpleXMLElement library to convert the PHP data into an XML file. Then, you will be able to save and view the information later.

Download HTML DOM Parser

If you haven’t used HTML DOM parser before, you will need to download it.

After the download completes, you can open the ZIP folder to extract files. Specifically, you will want to extract a file with the name:

Simplehtmldom_1_5.zip

You should see a folder with the name “simple_dom” which you can use to house your news scraper while you work on it. If you use a PHP web scraper for other things, you can follow these same initial steps to get the project going.

Whether you have a Windows PC or a Mac, you can download the HTML DOM parser. It’s a great option for web scraping news articles and other information.

Create a New PHP File

Go to the “simple_dom” folder and find the file named “simple_html_dom.php” which should be near the top of the folder. Then, create a file with the name:

Scraper.php

Go into the new file and add the following code to it.

<?php

require_once ‘simple_html_dom.php’;

Whether you want to extract a few news stories or a lot of them, you can use this new file. Depending on the information you want to record, you can create a database to classify or analyze information.

You can use PHP web scraping to extract a lot of information quickly. Then, you can use the information for whatever you want.

Extract the HTML

Next, you will use the file_get_html function to extract HTML from the news site of your choice. If you haven’t already, decide on the page you want to scrape. Then, you can open the page in a new tab in your favorite browser.

The code should look like:

<?php

require_once ‘simple_html_dom.php’;

//get html content from the site.

$dom = file_get_html(‘https://www.example.com‘, false);

You can replace the example domain with the domain you want to use with your news scraper. If you want to scrape multiple news sites, you can use the same code for each. However, you can change the domain part to match.

Scrape the Fields

Now it’s time to start web scraping news articles. Go to the tab with the news article you want to use. Then, right-click on that page to get a menu with a few options.

You should see the option to “Inspect” or something similar, depending on the browser you use. Click on that to bring up the HTML for the web page you’re viewing.

Once you have the HTML up, you can move your cursor throughout the HTML to highlight certain areas. Then, you can find the information that you want to scrape from the news article.

Take note of the HTML tags that refer to the information you want. You can then use the HTML tags such as id or class, or you may want to use div tags in your news scraper. You’ll use these tags to filter the information when you put it into PHP.

In your PHP program, use the following code:

//collect all user’s reviews into an array

$answer = array();

if(!empty($dom)) {

$divClass = $title = ”;$i = 0;

foreach($dom->find(“[insert tag/info here]”) as $divClass) {

//title

foreach($divClass->find(“.title”) as $title ) {

$answer[$i][‘title’] = $title->plaintext;

}

//ipl-ratings-bar

foreach($divClass->find(“example”) as $exmple ) {

$answer[$i][‘rate’] = trim($example->plaintext);

}

//content

foreach($divClass->find(‘div[class=text show-more__control]’) as $desc) {

$text = html_entity_decode($desc->plaintext);

$text = preg_replace(‘/'/’, “‘”, $text);

$answer[$i][‘content’] = html_entity_decode($text);

}

$i++;

}

print_r($answer); exit;

You may need to edit certain elements, such as “[insert tag/info here]” to fit with your goals for the news scraper. In some cases, you can delete entire sections. Consider which pieces of data you want to scrape from the news site, and you can add those to the code.

View the Output

Next, you can view the output from your PHP coding. You can see the information you requested from the news article. Make sure it includes everything you want to scrape, and you can edit the code if you missed something.

Store the Output

When building a news scraper with PHP, you also need to use an XML file to store the output. You will need to use SimpleXMLElement to convert the PHP array to an XML element that you can add to the file.

You’ll use code such as “array_to_xml”, “$xmlContent”, and “$xml_user_info” to store the output. Here’s an example of coding you can use.

//function definition to convert array to xml

function array_to_xml($array, &$xml_user_info) {

foreach($array as $key => $value) {

if(is_array($value)) {

$subnode = $xml_user_info->addChild(“Review$key”);

foreach ($value as $k=>$v) {

$xml_user_info->addChild(“$k”, $v);

}

}else {

$xml_user_info->addChild(“$key”,htmlspecialchars(“$value”));

}

return $xml_user_info->asXML();

}

//creating object of SimpleXMLElement

$xml_user_info = new SimpleXMLElement(“<?xml version=”1.0”?><root></root>”);

//function call to convert array to xml and return whole xml content with tag

$xmlContent = array_to_xml($answer,$xml_user_info);

If you don’t convert PHP to XML, you won’t be able to save and view your news scraper later. Luckily, you don’t need any special programs for the conversion.

Create a New XML File

Now, you can create an XML file to store your content. You can name the file something relevant to the news article, such as the article title. Then, you can store “$xmlContent” in the new file.

Here’s how the code may look:

// Create a xml file

$my_file = ‘example.xml’;

$handle = fopen($my_file, ‘w’) or die(‘Cannot open file: ‘.$my_file);

//success and error message based on xml creation

if(fwrite($handle, $xmlContent)) {

echo ‘XML file have been generated successfully.’;

}

else{

echo ‘XML file generation error.’;

}

Finally, you can run the entire code that you have, and you can make sure the XML file looks good. You can go back through the steps and edit any issues if the output doesn’t have everything you want.

After Building Your News Scraper

Software and web development has moved away from manual, line-by-line coding. We live in an age of rapidly deployable websites using services like Code Capsules, resources like GitHub, and pre-built frameworks.

So, building a news scraper with PHP can sound complicated. However, it doesn’t have to be complex if you have the right software, coding, and a bit of PHP and HTML experience.

Once you build your first web scraper with PHP, you can create more. Then, you can scrape as many news articles and web pages as you want.

Christoph is a code-loving father of two beautiful children. He is a full-stack developer and a committed team member at Zenscrape.com – a subsidiary of saas.industries. When he isn’t building software, Christoph can be found spending time with his family or training for his next marathon.