Crawling in Perl - A Quick Tutorial
CS 594/494 Fall 2000
by John Eblen
Getting Started
This tutorial shows you the basics of building Perl programs that
can access and crawl the web, using Perl's libwww packages. To use these
packages, you will need the following lines at the top of your program:
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # allows you to extract the links off of an HTML page.
In order to retrieve the contents of a page on the web, you can do the
following:
$contents = get($URL);
where URL is the URL of the page. $URL should be a full address, such as
http://www.perl.com/. The entire contents of the page are
then stored in $contents. That was pretty easy, wasn't it? Perhaps
a little too easy, you might think, and you would be right. While this
method works fine in most cases, it does not allow you to specify
parameters for downloading, the most important one being a timeout
value. Thus, any program using this method can become stuck on this
particular line, waiting forever for a download that will never occur.
By using a ctrl-c handler, you can rig the program to stop a download
when ctrl-c is pressed, if you don't mind watching your program as it
runs. Also, you can try to use a timeout signal, but that does not seem
to work on our system (here at UT/CS).
A Virtual Browser - a Better Way to Download Pages
As you might have guessed, Perl provides a more sophisticated way to
get pages. The following lines will initialize a virtual browser
that can be used for all your browsing needs and will set the timeout to
10 seconds. Note
that these packages take advantage of the object-oriented programming
features provided by Perl, so the syntax may look strange to
those who are new to Perl or to those who have never studied OO
programming in Perl.
$browser = LWP::UserAgent->new();
$browser->timeout(10);
Experimentation shows that the actual timing out usually takes a little
longer than the amout given here, but at least the browser will
time out.
The next three lines will download the page indicated by $URL.
my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);
if ($response->is_error()) {printf "%s\n", $response->status_line;}
From an OO programming perspective, here's what happens.
The first line creates a request object for $URL. The next line does
the actual downloading of that request and returns a response object.
The is_error() is a method of the object $response telling us whether or
not there is an error, which if there were would be stored in the
status_line instance variable of the response object. If you are not
familiar with OO programming, you don't really need to understand all the
vocabulary used in this paragraph in order to use these packages.
Finally, the following retrieves the contents of the downloaded page:
$contents = $response->content();
Extracting Links From a Web Page
In order to do actual crawling, though, the above is not enough. You
need to download web pages, but you also need to be able to further
follow the links off of those pages. That is where the LinkExtor
module comes in. The following code will take $URL and store the links
of that URL into an array, @links:
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
$page_parser->parse($contents)->eof;
@links = $page_parser->links;
Both $URL and $contents must be correct in order for this code to work, as
it appends $URL to any relative links it finds in the page.
WARNING:
The $URL should end in "/" in order for the appending to be done correctly.
The following code will display all of the links found on the page:
foreach $link (@links) {print "$$link[2]\n";}
Using What You've Learned
With the following tools, you should now be able to build a simple but
effective crawler in Perl. For any website you can begin at the main
page for that site, download it, extract the links, and repeat the
process with each of those links. You can continue until you retrieve
no more links inside that domain. Of course there is an issue of
exponential growth of links that have to be searched, but by using
smart programming, you can control the process. Be sure to restrict
the crawler to links inside the original domain, or you could
potentially head out and start crawling the entire web!
As you may have guessed, this tutorial provides an extremely sketchy,
bare bones description of the above packages. Here, you have seen only
what you would probably need for your projects and how to do only the
most essential operations. You can find much more information inside
the module files themselves that implement the packages. These
module files end in .pm. So look for response.pm,
request.pm, etc. Note: you can find the .pm files
on the UTK/CS system at /mix/usr/local/lib/utk_perl5.