Software and Processes
PAGE CONTENTS 5 minute read.
The Digital Projects Unit’s web archiving techniques.
There are two main pieces to our web archiving process: harvesting the live web content and providing access to the resulting archived files.
To display the archived content on the Web, we are migrating from the International Internet Preservation Consortium (IIPC) maintained version of OpenWayback (no longer in development) to Webrecorder’s [pywb]. These applications rely on client and server-side scripts to rewrite links, so web requests are made for documents in the archive’s WARC files rather than trying to pull content from the live Web.
We begin by examining the site(s) that we will be archiving, looking for areas we do or do not want to capture and identifying potential crawler traps. We browse the site manually as a user might and also look at source code when crawler access appears questionable. If necessary, we write scripts to extract elusive URIs that may then be added to the crawl’s seed list.
With knowledge gained from site examination, we program our crawler to follow rules that instruct it to harvest content that we have deemed to be within our desired scope.
Next we may do a test crawl to verify that our crawler has been configured in a way that allows us to download all of the URIs needed to render the archived site true to its live version. Completing a test capture also indicates how much time it should take to execute the final crawl which depends on factors such as the amount of content, how it is organized, and any delays needed to keep from overwhelming the target server.
Once a crawl is complete, we create a CDX file that is an index of the downloaded items stored in the WARC files and a second index that maps the WARC file names to their accessible locations. We then configure an instance of pywb, running at localhost, to use these indexes for viewing/navigating the archived website for quality. With the help of browser development tools, we can discover files that were not archived in our crawl.
If we missed desired areas of content, we modify our crawl configuration and execute another crawl to obtain the documents we lack.
Downloaded content is stored in WARC files. We do not manipulate these files once they are written, allowing us to keep a true record of a site at the time of its capture.
Current Challenges and Limitations
Although there is an active community developing and improving upon the tools and methods used in web archiving, there continues to be a common set of problems encountered during the process.
External links and externally hosted media, such as video, may be problematic to harvest since we must rely on third parties to supply the files. Even when we are able to download the media content files, some embedded media players do not function properly when replaying a site. Sites that embed media but also provide a link to a direct download of the media file help to ensure that users will be able to access these files from the archive.
When configuring a crawler for a harvest, we consider settings, including:
- How many threads (processes) should be run at once
- How long the crawler should wait before retrievals
- How many times URIs should be retried
- Whether or not the crawler should comply with robots.txt
- In what format downloaded content should be written
The settings we apply change from crawl to crawl based on factors such as the current resources/hardware we have available, what permission we have gained to harvest a site, time limitations in place, and our goals for a specific capture. Two settings that remain consistent for every crawl specify our crawl operator information. Our crawler informs the web servers it visits of a URL where a webmaster noticing traffic from us may visit to read about our crawling activity. Additionally, we provide an e-mail address, so if a webmaster finds our crawler causing trouble for his or her servers, such as by making too many requests too quickly, he or she is able to contact us about the issue.
During configuration, we also define scope rules for the crawler to follow. Some of our most commonly applied rules are:
- Accept URIs based on SURT prefix
- Accept and reject URIs based on regular expressions
- Reject URIs based on too many path segments (potential crawler trap)
- Accept URIs based on number of hops from seed
- Use a transclusion rule that accepts embedded content hosted by otherwise out of scope domains
- Accept a URI based on an in-scope page linking to it
- Use a prerequisite rule that accepts otherwise out-of-scope URIs that are required to get something that is in scope