PHPit - Totally PHP » Screen scraping your way into RSS

Screen scraping your way into RSS

(Page 2 out of 3)

For this article, we�ll use PHPit as an example, despite the fact that PHPit already has RSS feeds.

We�ll want to generate a RSS feed from the content listed on the frontpage. The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file(��, �[the url here]�)); IF your web host allows it. If you can�t use file() you�ll have to use a different method of getting the page, e.g. using the CURL library.

Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in

�s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get all the content items.

For PHPit, the pattern that match the content is

[Content Here]

. You can verify this yourself by going to the main page of PHPit, and viewing the source.

Now that we have a match we can get all the content items. The next step is to retrieve the individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace() on the each content items.

By now we have the following code;

// Screen scraping your way into RSS
// Example script, by Dennis Pallett
// http://phpit.net/tutorials/screenscrap-rss

// Get page
$url = "http://phpit.net/";
$data = implode("", file($url));

// Get content items
preg_match_all ("/

([^`]*?)<\/div>/", $data, $matches);

Like I said, the next step is to retrieve the individual information, but first let�s make a beginning on our feed, by setting the appropriate header (text/xml) and printing the channel information, etc.

// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "\n";
?>
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

PHPit Latest Content
The latest content from PHPit (http://phpit.net), screen scraped!
http://phpit.net
en-us

Now it�s time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match(). After that the RSS for the item is printed.

// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/\">([^`]*?)<\/a><\/h3>/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match ("//", $match, $temp);
$url = $temp['1'];
$url = trim($url);

// Third, get text
preg_match ("/

([^`]*?)/", $match, $temp);
$text = $temp['1'];
$text = trim($text);

// Fourth, and finally, get author
preg_match ("/By ([^`]*?)<\/span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);

// Echo RSS XML
echo "\n";
echo "\t\t\t" . strip_tags($title) . "\n";
echo "\t\t\thttp://phpit.net" . strip_tags($url) . "\n";
echo "\t\t\t" . strip_tags($text) . "\n";
echo "\t\t\t echo $text . "\n";
echo " ]]>\n";
echo "\t\t\t" . strip_tags($author) . "\n";
echo "\t\t\n";
}
?>

And finally, the RSS file is closed off.

That�s all. If you put all the code together, like in the demo script, then you�ll have a perfect RSS feed.

« Previous: Introduction

Next: Conclusion »

This article was posted on Thursday, August 19th, 2004 at 6:47 am. You can follow any responses to this article through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Responses to “Screen scraping your way into RSS”

What the? Says:
January 22nd, 2006 at 3:19 am
When trying to view the code:

Security Alert! Breach has been logged for review (IP Address: xxx.xxx.xxx.xxx).
Robert M Says:
February 16th, 2006 at 5:30 pm
I’m getting the same in Firefox.

Security Alert! Breach has been logged for review (IP Address: xxx.xxx.xxx.xxx).
Jim M Says:
April 20th, 2006 at 8:44 pm
Viewing the code in Internet Explorer 6 works fine. Thanks for posting this example! :)
rigorouslogik Says:
July 25th, 2006 at 12:24 am
The code works great!
How could I scrape just meta data or links ect into a text file?
Thanks for any input.

About the author: Dennis Pallett is the main contributor to PHPit. He owns several websites, including ASPit and Chill2Music. He is currently still studying.

Article Index

Bookmark Article: Add to Blink

Add to Del.icio.us

Add to Digg

Add to Furl

Add to Google

Add to Simpy

Add to Y!MyWeb

Add to Spurl

Download Article: Download this article as a PDF file

Screen scraping your way into RSS

4 Responses to “Screen scraping your way into RSS”

Leave a Reply