Web Site Downloader

Introduction

WebDownloader can download a whole web site (see the note in the next paragraph) to the local disk. It will duplicate the files, including html files and picture files, exactly as they are at the web site. You can choose to have the WebDownloader either invoke Lynx to download individual files from the HTTP server (which is slower) or open a TCP connection with the HTTP server directly to download the files (which is faster).

Currently, the WebDownloader is not very intelligent. It only follows the URL's. If an absolute URL is referenced in an HTML page, it will be checked to see if it has the same machine name and domain name. If this is ture, it will be downloaded. Otherwise, it won't be downloaded. In some cases, an absolute URL which uses the IP address referring to the same machine in the root URL will not be downloaded. The absolute URL's are not changed to relative URL's, which may cause some trouble if you are browing the files off-line. It can easily be done using the sed utility of the UNIX system after all files have been downloaded. If a relative URL is referenced in an HTML page, it will always be downloaded provided that it exists at the web site. Therefore, the WebDownloader works best with those web pages that use relative URL's consistently.

Watch the latest videos on YouTube.com

Depending on how the author writes the web pages at the web site, some files may not be downloaded. For example, if there is one image file at the web site, but it is never referenced by the <IMG SRC> tag in any other HTML pages, the image file will not be downloaded. On the other hand, if an HTML page references a non-existing local URL or a URL that is protected by password or is not readable, the WebDownloader will attempt to download it, but the downloaded file will not be correct.

The HTML tags that this version of WebDownloader follows are:

Important Notice

The WebDownloader does not honor robot tags (e.g. <META NAME="robots" CONTENT="index,follow">) and probably it would be a COPYRIGHT VIOLATION to employ it on sites without prior permission of the site's owner.

Source Code

You can download the source codes and executables here.

The WebDownlader can be used to mirror a web site or download web pages for off-line browsing. It can also be used to check the integrity of a web site to get rid of unreferenced files or detect bad links. If you are using a web cache, you can automatically pre-load the cache with frequently visited web sites. You can also easily modify the code (actually only by changing a == to !=) to make the WebDownloader become a WebCrawler, which starts from one URL and craws over the web by following the links. It will be very interesting to see how you can start from one URL and end up in another URL that is completely unrelated.




Luojian Chen
Department of Computer Science
The University of Tennessee, Knoxville
Knoxville, TN 37996-1301
lchen@cs.utk.edu

This page has been accessed times since June 25, 1998. (counter reset to 0 in December, 1998?)


Home Research Teaching Software Schedule Resume Weather Links Contact Comments Statistics


Copyright © 1997 - 1999 Luojian Chen / lchen@cs.utk.edu
Last updated: Sat Jan 16 12:43:21 1999