Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)

doi:10.5281/zenodo.3833782

Published November 24, 2016 | Version v1

Dataset Open

Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)

Gayo-Avello, Daniel¹

1. University of Oviedo

Contact person:

Daniel Gayo-Avello¹

Data collectors:

Data curator:

Daniel Gayo-Avello¹

1. University of Oviedo

Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.

The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).

It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).

Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.

The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.

To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.

In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).

In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:

March 21, 2006 to June 18, 2006: 88.4% ±0.5 (from 5,512 tweets).
June 18, 2006 to September 16, 2006: 82.7% ±0.5 (from 14,820 tweets).
September 16, 2006 to December 15, 2006: 85.7% ±0.5 (from 107,975 tweets).
December 15, 2006 to March 15, 2007: 88.2% ±0.5 (from 852,463 tweets).
March 15, 2007 to June 13, 2007: 89.6% ±0.5 (from 6,341,665 tweets).
June 13, 2007 to September 11, 2007: 88.6% ±0.5 (from 11,171,090 tweets).
September 11, 2007 to December 10, 2007: 87.9% ±0.5 (from 15,545,532 tweets).
December 10, 2007 to March 9, 2008: 89.0% ±0.5 (from 23,164,663 tweets).
March 9, 2008 to June 7, 2008: 66.5% ±0.5 (from 56,416,772 tweets; see below for more details on this).
June 7, 2008 to September 5, 2008: 78.3% ±0.5 (from 62,868,189 tweets; see below for more details on this).
September 5, 2008 to December 4, 2008: 87.3% ±0.5 (from 89,947,498 tweets).
December 4, 2008 to March 4, 2009: 86.9% ±0.5 (from 169,762,425 tweets).
March 4, 2009 to June 2, 2009: 86.4% ±0.5 (from 474,581,170 tweets).
June 2, 2009 to July 31, 2009: 85.7% ±0.5 (from 589,116,341 tweets).

The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.

At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.

In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted *and* non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).

Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.

For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).

If you use this dataset in any way please cite that preprint (in addition to the dataset itself).

If you need to contact me you can find me as @PFCdgayo in Twitter.

Files

90-day-samples.zip

Files (3.1 GB)

Name	Size	Download all
90-day-samples.zip md5:781f9adf8a3f1cc468b2e78d5a316604	1.9 MB	Preview Download
date-tweet-id.tsv md5:7f401a55d04bf71ee7f0290555911558	25.8 kB	Download
generate-ids.m md5:6a0d01d05ab911a77c8e48b697d2b083	2.1 kB	Download
Twitter-historical-20060321-20090731-sample.txt md5:f69d55938220d5c6de7a8e771fde987f	721.2 kB	Preview Download
Twitter-historical-20060321-20090731.zip md5:e7b847847380a236107c0e982b68b77b	3.1 GB	Preview Download

Additional details

Is supplement to: Preprint: https://arxiv.org/abs/1611.08144 (URL)

Allen, Erin. "Update on the Twitter Archive at the Library of Congress." Library of Congress Blog. Vol. 4. 2013.
Bruns, Axel, and Katrin Weller. "Twitter as a first draft of the present: and the challenges of preserving it for the future." Proceedings of the 8th ACM Conference on Web Science. 2016.
King, Ryan. "Announcing snowflake." Twitter Engineering Blog (2010).
McCreadie, Richard, et al. "On building a reusable Twitter corpus." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 2012.
McGill, Andrew. "Can twitter fit inside the library of congress?." The Atlantic 4 (2016).
Raymond, Matt. "How tweet it is!: Library acquires entire Twitter archive." Library of Congress blog. Vol. 14. 2010.
Rogers, Richard. "Debanalizing Twitter: The transformation of an object of study." Proceedings of the 5th Annual ACM Web Science Conference. 2013.
Scola, Nancy. "Library of Congress' Twitter archive is a huge# FAIL." Politico. com, available at: http://www. politico. com/story/2015/07/library-of-congress-twitter-archive-119698. html (accessed 9 October 2015). 2015.
Zhuang, Yi. "Building a complete tweet index." Twitter Blogs. November 18 (2014).
Zimmer, Michael. "The Twitter Archive at the Library of Congress: Challenges for information practice and information policy." First Monday (2015).

	All versions	This version
Views	1,789	1,780
Downloads	614	609
Data volume	942.3 GB	932.9 GB

Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)

Contact person:

Data collectors:

Data curator:

Files

90-day-samples.zip

Files (3.1 GB)

Additional details

Related works

References

Twitter historical dataset: March 21, 2006 (first tweet) to July 31, 2009 (3 years, 1.5 billion tweets)

Creators

Contributors

Contact person:

Data collectors:

Data curator:

Description

Files

90-day-samples.zip

Files (3.1 GB)

Additional details

Related works

References