The Library Of Congress Is A Training Data Playground For AI Companies

Forbes

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued.  Read the full story on Forbes: https://www.forbes.com/sites/rashishrivastava/2024/09/17/the-library-of-congress-is-a-training-data-playground-for-ai-companies/  Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1  Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:  https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript  Stay Connected Forbes newsletters: https://newsletters.editorial.forbes.com Forbes on Facebook: http://fb.com/forbes Forbes Video on Twitter: http://www.twitter.com/forbes Forbes Video on Instagram: http://instagram.com/forbes More From Forbes:  http://forbes.com  Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.

Transcript

00:00Today on Forbes, the Library of Congress is a training data playground for AI companies.

00:07Black and white portraits of Rosa Parks, letters penned by Thomas Jefferson,

00:12and the giant Bible of Mainz, a 15th century manuscript known to be one of the last handwritten Bibles in Europe.

00:19These are among the 180 million items, including books, manuscripts, maps, and audio recordings,

00:25housed within the Library of Congress.

00:29Every year, hundreds of thousands of visitors walk through the Library's high-ceilinged, pillared halls,

00:34passing beneath Renaissance-style domes embellished with murals and mosaics.

00:39But of late, the more than 200-year-old Library has attracted a new type of patron,

00:44AI companies that are eager to access the Library's digital archives

00:49and the 185 petabytes of data stored within it, to develop and train their most advanced AI models.

00:57For reference, one petabyte is equal to 1,000 terabytes, or one million gigabytes.

01:04Judith Conklin, Chief Information Officer at the Library of Congress, told Forbes,

01:09"...we know that we have a large amount of digital material that large language model companies are very interested in.

01:15It's extraordinarily popular."

01:18The upsurge in interest in the Library's data is also reflected in the numbers.

01:23The Congress.gov website, which is managed by the Library of Congress and hosts data about bills, statutes, and laws,

01:30gets anywhere between 20 million to 40 million monthly hits on its API,

01:35an interface that allows programmers to download the Library's data in a machine-readable format.

01:41Conklin said the traffic to the Congress.gov API has consistently grown since it became available in September 2022.

01:49The Library's API now gets about a million visits every month.

01:54The Library's digital archives host an abundance of rare, original, and authoritative information.

02:00It's also diverse. The collections feature content in more than 400 languages, spanning art, music, and most disciplines.

02:07But what makes this data especially appealing to AI developers is that these works are in the public domain,

02:13and not copyrighted or otherwise restricted.

02:16While a growing group of artists and organizations are locking up their data to prevent AI companies from scraping it,

02:22the Library of Congress has made its data reserves freely available to anyone who wants it.

02:28For AI companies that have already mined the entirety of the Internet,

02:31scraping everything from YouTube videos to copyrighted books, to train their models,

02:36the Library is one of the few remaining free resources.

02:40Otherwise, they must strike licensing deals with publishers or use AI-generated, so-called synthetic data,

02:46which can be problematic, leading to degraded responses from the model.

02:51The only caveat? People who want access to the Library's data must collect it via the API,

02:57a portal through which anyone, from a genealogist to an AI researcher, can download data.

03:03But they are prohibited from scraping content directly from the site,

03:06a common practice among AI companies and one that Conklin said has become a real, quote,

03:11hurdle for the Library because it slows public access to its archives.

03:16She said, quote,

03:29The hunt for data is just one part of the story.

03:32Companies like OpenAI, Amazon, and Microsoft are also courting the world's largest library as a customer.

03:39They claim AI models can help librarians and subject matter specialists

03:43with tasks like navigating catalogs, searching records, and summarizing long documents.

03:49This is certainly possible, but there are some rough edges that need to be ironed out first.

03:54Natalie Smith, the Library of Congress's Director of Digital Strategy,

03:58told Forbes that AI models trained on contemporary data sometimes struggle with historical accuracy,

04:05identifying a person holding a book as someone holding a cell phone, for example.

04:10She said, quote,

04:20For full coverage, check out Rashi Srivastava's piece on Forbes.com.

04:26This is Kieran Meadows from Forbes. Thanks for tuning in.

Category

Transcript

Recommended