Dataset Open Access

URLs from tweets for a 2014 sample of Twitter users and for a set of computer scientists

Robert Jäschke

Data collector(s)
Asmelash, Teka Hadgu

The files in this dataset are used to analyse the tweeting behaviour of computer scientists on Twitter. They comprise

  • a set of 989,529 tweet-URL pairs (tweets_2014_researcher.tsv.bz2) from 2014 from 6,271 users of the computer scientists sample in https://zenodo.org/record/12942 specified by time, tweet id, user id, and URL,
  • a set of 300,053,850 tweet ids (tweets_2014_sample.tsv.bz2) from the 1% Twitter stream sample from 2014,
  • a set of 605,080 tweet-URL pairs (tweets_2014_sample_6694_users.tsv.bz2) from the 1% Twitter stream sample from 2014 for 6,694 users specified by time, tweet id, user id, and URL,
  • a set of the top 10,000 host names (MAG_hosts_10000.tsv) from the Microsoft Academic Graph data (http://blogs.msdn.com/b/msr_er/archive/2015/06/26/announcing-the-microsoft-academic-graph-let-the-research-begin.aspx), specified by rank, URL count, and host name, and
  • a set of 340 host names of URL shortening services (url_shortening_services.tsv).

In addition, the following rankings (based on the odds ratio) of domains, hosts, and URLs that appear in both the researcher dataset and the sample are included:

  • domains_by_odds_ratio.tsv.bz2 - a ranking of 61,860 domains,
  • hosts_by_odds_ratio.tsv.bz2 - a ranking of 80,384 hosts,
  • publisher_domains_by_odds_ratio.tsv.bz2 - a ranking of 924 publisher domains,
  • publisher_urls_by_odds_ratio.tsv.bz2 - a ranking of 4,227 publisher URLs.

This is an updated and extended version of 10.5281/zenodo.154583 where a new sample of users has been used, resulting in an updated file tweets_2014_sample_6694_users.tsv.bz2. In addition, domain, host, and URL rankings have been added.
Files (2.3 GB)
Name Size
domains_by_odds_ratio.tsv.bz2 md5:299e3ec2469d3a91582e592a2fc0aa1e 445.0 kB Download
hosts_by_odds_ratio.tsv.bz2 md5:bd959f2b67bc50e746a4740d8969f18c 619.7 kB Download
MAG_hosts_10000.tsv md5:bf92fe9d92a45949d44037a81356b82b 298.2 kB Download
publisher_domains_by_odds_ratio.tsv.bz2 md5:10e489478e9076e76d158c18e95f51bc 8.1 kB Download
publisher_urls_by_odds_ratio.tsv.bz2 md5:e5f563f85a2ea56fac3b20109e1c2402 84.3 kB Download
tweets_2014_researcher.tsv.bz2 md5:6c466537064b5a5574734f418893b199 32.0 MB Download
tweets_2014_sample.tsv.bz2 md5:d0ea5705cb86480a0f22a1c7439533b4 2.3 GB Download
tweets_2014_sample_6694_users.tsv.bz2 md5:2dff10a6301cb97c53a653a65019199c 12.2 MB Download
url_shortening_services.tsv md5:1f040245142c7309b9c46f897f79f7ce 3.0 kB Download

Share

Cite as