IMLS 2022

Leveraging Existing Bibliographic Metadata to Improve Automatic Document Identification in Web Archives

The UNT Libraries in partnership with the University of Illinois Chicago’s Computer Science Department received Institute of Museum and Library Services (IMLS) support for an applied research grant with the long-term objective of improving access to digital resources housed in web archives. This applied research project will build on findings from a previously funded IMLS research grant (LG-71-17-0202-17) that was a first effort in training machine models to help identify high-value documents and publications within web archives. This project seeks to incorporate existing bibliographic metadata related to state government document collections to better train machine learning models and allow for a reduction in human effort, as the process is still time consuming and requires highly-trained content curators. The award period for this project is August 1, 2022 until July 31, 2024.

Project Personnel

Mark Phillips, Ph.D. serves as Principal Investigator for the project. He has extensive experience in grant-funded projects for digital libraries and web archives as well as experience in grant-funded research projects. His responsibilities include: overall project supervision and budget oversight; editing and submission of required reports and grant documentation; participation in project meetings; drafting project reports; and official communication with IMLS. Phillips is responsible for coordinating the bibliographic metadata dataset building and the acquisition of web archiving data used in the project. He supervises one of the graduate research assistants, and coordinates external project communication and outreach.

Cornelia Caragea, Ph.D. serves as Co-Principal Investigator for the project. She has an extensive background in the areas of machine learning, deep learning, and natural language processing (NLP): she has worked on numerous externally funded projects in a variety of roles (e.g., PI, Co-PI) at the University of Illinois Chicago and the University of North Texas. Her project responsibilities will include developing machine learning, deep learning, and NLP methodologies used by the project, supervising the Computer Science graduate research assistant, performing data analysis and evaluation, and drafting project reports and publications.

Praneeth Rikka serves as a Graduate Research Assistant at the UNT Libraries on this project and contributes to the project in the areas including data collection, metadata mapping and normalization.

Project partnership

The Library of Michigan and Archive-It will work with the project team to create datasets containing bibliographic metadata from existing catalog and digital collections and web archives collected by the Library of Michigan related to the state domain of Michigan.

Advisory Board

  • Jefferson Bailey (web archives) - Internet Archive, Archive-It
  • Bernadette Bartlett (state publications) - Library of Michigan
  • Martin Klein (web archives, machine learning) - Los Alamos National Lab, Research Library
  • Raymond Mooney (machine learning) - University of Texas at Austin, Computer Science
  • Mark Myers (state publications) - Texas State Library and Archive Commission (TSLAC)
  • Tracy Seneca (web archiving, digital collections) - University of Illinois Chicago, University Libraries
  • Oksana Zavalina (cataloging and metadata) - University of North Texas, College of Information