Research

The Digital Libraries’ recent research related to web archiving.

Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives, 2022

The UNT Libraries in partnership with the University of Illinois Chicago’s Computer Science Department received IMLS support for an applied research grant with the long-term objective of improving access to digital resources housed in web archives. This applied research project will build on findings from a previously funded IMLS research grant (LG-71-17-0202-17) that was a first effort in training machine models to help identify high-value documents and publications within web archives. This project seeks to incorporate existing bibliographic metadata related to state government document collections to better train machine learning models and allow for a reduction in human effort, as the process is still time consuming and requires highly-trained content curators.

Programmatic Extraction of ‘Documents’ from Web Archives, 2017

The UNT Libraries and the University of Illinois Chicago’s Department of Computer Science received IMLS support under the National Digital Platform category for a two year research project to evaluate the use of machine learning algorithms to successfully identify and extract publications contained in existing Web archives. Identifying these documents will empower libraries, archives, and museums to meet their curatorial missions.

Current Quality Assurance Practices in Web Archiving, 2014

UNT’s team surveyed people and institutions involved in web archiving to understand the current climate and future needs for quality assurance. For more information, see our paper and presentation.

Classification of the End-of-Term Archive: Extending Collection Development to Web Archives (eotcd), 2010-2012

In this project, funded by the Institute of Museum and Library Services (IMLS), UNT partnered with the Internet Archive to investigate innovative solutions allowing libraries to better characterize, identify, and select archived Web materials in accordance with their collection development policies. The project used the SuDocs system to classify the materials in the 2008–2009 End-of-Term (EOT) Archive, collected by UNT and its partners, which represents the entirety of the federal government’s public Web presence immediately before and after the 2009 change in presidential administrations. The project also identified metrics to translate measurable units for selected materials in Web archives to units more familiar to libraries and more recognizable by university administrators. For more information, see the eotcd final report and archived web site.

The Web-at-Risk, 2004-2007

Funded by the National Digital Information and Infrastructure Preservation Program (NDIIPP) at the Library of Congress, this project was a collaborative effort of the California Digital Library, UNT, and New York University to develop tools to enable curators to build collections of web-published materials. The project produced the Web Archiving Service as well as significant research in needs assessment. Many of the reports produced by the project are available in the UNT Digital Library.