Dataset Open Access
Disclaimer: This dataset is distributed by Daniel Gayo-Avello, an associate professor at the Department of Computer Science in the University of Oviedo, for the sole purpose of non-commercial research and it just includes tweet ids.
The dataset contains tweet IDs for all the published tweets (in any language) bettween March 21, 2006 and July 31, 2009 thus comprising the first whole three years of Twitter from its creation, that is, about 1.5 billion tweets (see file Twitter-historical-20060321-20090731.zip).
It covers several defining issues in Twitter, such as the invention of hashtags, retweets and trending topics, and it includes tweets related to the 2008 US Presidential Elections, the first Obama’s inauguration speech or the 2009 Iran Election protests (one of the so-called Twitter Revolutions).
Finally, it does contain tweets in many major languages (mainly English, Portuguese, Japanese, Spanish, German and French) so it should be possible–at least in theory–to analyze international events from different cultural perspectives.
The dataset was completed in November 2016 and, therefore, the tweet IDs it contains were publicly available at that moment. This means that there could be tweets public during that period that do not appear in the dataset and also that a substantial part of tweets in the dataset has been deleted (or locked) since 2016.
To make easier to understand the decay of tweet IDs in the dataset a number of representative samples (99% confidence level and 0.5 confidence interval) are provided.
In general terms, 85.5% ±0.5 of the historical tweets are available as of May 19, 2020 (see file Twitter-historical-20060321-20090731-sample.txt). However, since the amount of tweets vary greatly throughout the period of three years covered in the dataset, additional representative samples are provided for 90-day intervals (see the file 90-day-samples.zip).
In that regard, the ratio of publicly available tweets (as of May 19, 2020) is as follows:
The apparent drop in available tweets from March 9, 2008 to September 5, 2008 has an easy, although embarrassing, explanation.
At the moment of cleaning all the data to publish this dataset there seemed to be a gap between April 1, 2008 to July 7, 2008 (actually, the data was not missing but in a different backup). Since tweet IDs are easy to regenerate for that Twitter era (source code is provided in generate-ids.m) I simply produced all those that were created between those two dates. All those tweets actually existed but a number of them were obviously private and not crawlable. For those regenerated IDs the actual ratio of public tweets (as of May 19, 2020) is 62.3% ±0.5.
In other words, what you see in that period (April to July, 2008) is not actually a huge number of tweets having been deleted but the combination of deleted *and* non-public tweets (whose IDs should not be in the dataset for performance purposes when rehydrating the dataset).
Additionally, given that not everybody will need the whole period of time the earliest tweet ID for each date is provided in the file date-tweet-id.tsv.
For additional details regarding this dataset please see: Gayo-Avello, Daniel. "How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself." arXiv preprint arXiv:1611.08144 (2016).
If you use this dataset in any way please cite that preprint (in addition to the dataset itself).
If you need to contact me you can find me as @PFCdgayo in Twitter.
Allen, Erin. "Update on the Twitter Archive at the Library of Congress." Library of Congress Blog. Vol. 4. 2013.
Bruns, Axel, and Katrin Weller. "Twitter as a first draft of the present: and the challenges of preserving it for the future." Proceedings of the 8th ACM Conference on Web Science. 2016.
King, Ryan. "Announcing snowflake." Twitter Engineering Blog (2010).
McCreadie, Richard, et al. "On building a reusable Twitter corpus." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 2012.
McGill, Andrew. "Can twitter fit inside the library of congress?." The Atlantic 4 (2016).
Raymond, Matt. "How tweet it is!: Library acquires entire Twitter archive." Library of Congress blog. Vol. 14. 2010.
Rogers, Richard. "Debanalizing Twitter: The transformation of an object of study." Proceedings of the 5th Annual ACM Web Science Conference. 2013.
Scola, Nancy. "Library of Congress' Twitter archive is a huge# FAIL." Politico. com, available at: http://www. politico. com/story/2015/07/library-of-congress-twitter-archive-119698. html (accessed 9 October 2015). 2015.
Zhuang, Yi. "Building a complete tweet index." Twitter Blogs. November 18 (2014).
Zimmer, Michael. "The Twitter Archive at the Library of Congress: Challenges for information practice and information policy." First Monday (2015).