Archiving every single tweet on Twitter: Two parallel initiatives

Tweets provide vital snapshots into current events. Though I use Twitter extensively in my citizen journalism as well as private communications, the platform itself does not provide an easy way to search and index tens of thousands of updates. Historical events such as the immediate fall-out of the Presidential Election in Sri Lanka on 26th January, covered in detail on Twitter, are over time difficult for a researcher or historian to access. This is why I wrote up Updates capturing aftermath of presidential elections on Groundviews, though I wish this task was made easier with in-built features on Twitter allowing for more easy archival and grouping of updates.

Welcome news in this regard, though not really a solution to the lack of archival tools and services on Twitter is that both the US Library of Congress and search giant Google will catalogue every single public tweet made since the service launched in 2007. Twitter reported in February 2010 that nearly 50 million tweets are now published every day, or around 600 every second.

Calculated using Gigatweet, at the time of writing this post, there were over 12.3 billion public tweets.

Google recently announced that it will provide access to all of Twitter’s public archives, which will allow users to dig through tweets by topic and date. Not to be outdone, the Library of Congress announced a similar initiative.

Matt Raymond, one the Library’s official bloggers,notes that “important tweets in the past few years include the first-ever tweet from Twitter co-founder Jack Dorsey, President Obama’s tweet about winning the 2008 election, and a set of two tweets from a photojournalist who was arrested in Egypt and then freed because of a series of events set into motion by his use of Twitter.”

As I have covered in detail on this blog (Data loss whether you backup or not…), Ars Technica notes that,

Digital technologies pose a problem for the Library and other archival institutions, though. By making data so easy to generate and then record, they push archives to think hard about their missions and adapt to new technical challenges. While archiving the entire Web and all its changes is simply impossible, the Library of Congress has collected a curated, limited subset of Web content “since it began harvesting congressional and presidential campaign websites in 2000.” Today, it has 167TB of Web data.

50 years hence, will these vast repositories of tweets hold any meaning? Would semantic search engines make better sense of them than what we have today? Coupled with the likes of Yahoo’s Time Capsule from four years ago and efforts by the Internet Archive, will this archival of information result in more knowledge, or just fill up hard drives with nonsense? More importantly, how can we make sense out of this information, for example, to help with peacebuilding and research in conflict transformation?

Update: 29 April 2010

The US Library of Congress has released a helpful FAQ about the nature and extent to which they will archive Twitter.

2 thoughts on “Archiving every single tweet on Twitter: Two parallel initiatives

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s