Reverse geo-tagging included; duplicates removed

doi:10.5281/zenodo.11661

Published September 10, 2014 | Version V1.0

Software Open

Reverse geo-tagging included; duplicates removed

George Fisher¹

1. George Fisher Advisors LLC

All of the tweets for this project have been processed and consolidated into a single file that can be downloaded with this link:

https://s3-us-west-2.amazonaws.com/healthcare-twitter-analysis/HTA_noduplicates.gz
1.85 Gb zipped / 15.80 Gb unzipped

Each of the 4 million rows in this file is a tweet in json format containing the following information:

All the Twitter data in exactly the json format of the original
Unix time stamp
All the Topsy data
- originating file name
- score
- author screen name
- URLs

60% of the records have geographic information ...

Latitude & Longitude
Country name & ISO2 country code
City
For country code "US"
- Zipcode
- Telephone area code
- Square miles inside the zipcode
- 2010 Census population of the zipcode
- County & FIPS code
- State name & USPS abbreviation

The basic technique for using this file in Python is the following:

import json with open("HTA_noduplicates.json", "r") as f: # convert each row in turn into json format and process for row in f: tweet = json.loads(row) text = tweet["text"] # text of original tweet ... # etc.

Python provides very powerful analytical and plotting features but R is also very handy; R does not work well with large datasets but Python can be used to create a targeted subset file that R can read (or Excel, or anything else for that matter).

For long-running jobs, I used Amazon Web Service's EC2 running Ubuntu 14.04, accessed via PuTTY and WebSCP; for local processing I used a Windows 7 laptop with the data on a terabyte external hard drive.

The Status Report in the main repo contains

a comprehensive explanation of the dataset
examples of analyses done with this dataset
a list of references to other healthcare-related Twitter analyses
instructions for using Amazon Web Services
sample programs using this file with Python, R and MongoDB.

Files

healthcare_twitter_analysis-V1.0.zip

Files (19.8 MB)

Name	Size	Download all
healthcare_twitter_analysis-V1.0.zip md5:ee75f67fe8831c08b75ac6d1d60c523b	19.8 MB	Preview Download

Additional details

Is supplement to: https://github.com/grfiv/healthcare_twitter_analysis/tree/V1.0 (URL)

	All versions	This version
Views	265	133
Downloads	21	16
Data volume	452.4 MB	356.3 MB

Reverse geo-tagging included; duplicates removed

Creators

Description

Files

healthcare_twitter_analysis-V1.0.zip

Files (19.8 MB)

Additional details

Related works