January 7, 2013
The U.S. Library of Congress is well on its way toward creating an archive of more than 170 billion tweets.
The Library has amassed all public tweets — 21 billion of them — from 2006 to 2010 courtesy of Twitter. The government agency also has secured the 150 billion public tweets posted since then.
The goal is to create a secure, sustainable process for receiving and safeguarding a constant stream of tweets through present day as well as to fashion a structure for organizing the entire archive by date.
“This month, all those objectives will be completed,” said the Library’s director of communications, Gayle Osterberg in a blog post. “We now have an archive of approximately 170 billion tweets and growing. The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012.”
With roughly 500 million tweets appearing in Twitter’s public feed each day, archiving will be no small feat.
The current focus, Osterberg said, is addressing the considerable technology challenges to make the archive completely accessible to researchers.
“Twitter is a new kind of collection for the Library of Congress but an important one to its mission,” she said. “As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.
“Although the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.”
Although the Library of Congress has not laid out exactly how the ongoing archive will be used, the agency has written a white paper [PDF] that summarizes its work to-date and outlines its current progress and challenges.
The white paper indicated the agency has two full copies of the archive of 170 billion tweets — this archive consists of roughly 133 Terabytes of data. The white paper stated each tweet contains about 50 accompanying metdata fields.