The Library Of Congress Is A Training Data Playground For AI Companies

The Library of Congress has become a valuable resource for AI companies due to its extensive digital archives, which are in the public domain and free from copyright restrictions, attracting significant interest from developers. However, while the library provides access through its API, it faces challenges with data scraping that can impact website performance, and there are concerns about the accuracy of AI models interpreting historical documents.

The Library of Congress, a historic institution housing over 180 million items, has recently become a significant resource for AI companies seeking training data. Among its vast collections are notable artifacts such as portraits of Rosa Parks, letters from Thomas Jefferson, and the Gutenberg Bible. The library’s extensive digital archives, which include 185 petabytes of data, are attracting AI developers eager to utilize this wealth of information to enhance their models. Judith Conlin, the Chief Information Officer at the Library of Congress, highlighted the growing interest from large language model companies in the library’s digital materials.

The library’s digital resources are particularly appealing because they are in the public domain, meaning they are not subject to copyright restrictions. This contrasts with the trend of many artists and organizations who are limiting access to their data to prevent AI companies from scraping it. The Library of Congress, however, has opted to make its data freely available, providing a rare opportunity for AI developers who often face challenges in obtaining licensed data from publishers or resorting to synthetic data, which can lead to inferior model performance.

The library’s API, which allows users to download data in a machine-readable format, has seen a significant increase in traffic since its launch in September 2022. It now receives between 20 million to 40 million hits monthly, with about a million visits each month specifically to the API. This surge in interest reflects the demand for the library’s diverse collections, which encompass content in over 400 languages and cover a wide range of subjects, making it a valuable resource for AI training.

Despite the library’s openness, there are restrictions on how data can be accessed. Users must collect data through the API, as direct scraping from the website is prohibited. This limitation has created challenges for the library, as some AI companies attempt to scrape data quickly, which can negatively impact the performance of the library’s website. Conlin noted that this practice has become a hurdle for the library, necessitating measures to manage website performance while still providing access to its archives.

In addition to providing data, major tech companies like OpenAI, Amazon, and Microsoft are also interested in leveraging AI to assist librarians and researchers with tasks such as catalog navigation and document summarization. However, there are concerns regarding the accuracy of AI models trained on contemporary data, as they may misinterpret historical contexts. Natalie Smith, the library’s director of digital strategy, pointed out that AI models often exhibit a bias towards modern concepts, which can lead to inaccuracies when applied to historical documents.