Raw data received from data finder as a starting point:
C:\external\SCORE_Fielding-Miller_covid_R3pV date was long now str8 (57 missing values generated) Contains data from data-raw\Script\merged_covid_usa_v2.dta obs: 340,819 vars: 16 4 Oct 2020 20:00 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── storage display value variable name type format label variable label ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── date str8 %9s date_proper float %td.. county str33 %33s state str24 %24s fips long %12.0g cases long %8.0g Cases deaths int %8.0g S1602_C01_001E double %10.0g S2701_C05_011E double %10.0g S2701_C05_012E double %10.0g S1701_C03_001E double %10.0g DP05_0001E long %10.0g DP05_0024PE double %10.0g LABOR long %10.0g Value A00002_002 float %9.0g Population Density (Per Sq. Mile) sip_effect float %9.0g ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sorted by: fips Note: Dataset has changed since last saved.
(248,335 observations deleted)
Cleaning up observations with missing geography
. drop if county == "Unknown" (1,040 observations deleted)
. mdesc Variable │ Missing Total Percent Missing ────────────────┼─────────────────────────────────────────────── date │ 0 91,444 0.00 date_proper │ 57 91,444 0.06 county │ 57 91,444 0.06 state │ 57 91,444 0.06 fips │ 95 91,444 0.10 cases │ 57 91,444 0.06 deaths │ 57 91,444 0.06 S1602_C~001E │ 114 91,444 0.12 S2701_C~011E │ 97 91,444 0.11 S2701_C~012E │ 97 91,444 0.11 S1701_C~001E │ 130 91,444 0.14 DP05_0001E │ 97 91,444 0.11 DP05_0024PE │ 97 91,444 0.11 LABOR │ 2,910 91,444 3.18 A00002_002 │ 97 91,444 0.11 sip_effect │ 0 91,444 0.00 ────────────────┼───────────────────────────────────────────────
fips
codes.. preserve . keep if mi(fips) (91,349 observations deleted) . fre county county ──────────────────────┬──────────────────────────────────────────── │ Freq. Percent Valid Cum. ──────────────────────┼──────────────────────────────────────────── Valid Kansas City │ 38 40.00 40.00 40.00 New York City │ 57 60.00 60.00 100.00 Total │ 95 100.00 100.00 ──────────────────────┴──────────────────────────────────────────── . * fre state . restore
. preserve . keep if mi(date_proper) (91,387 observations deleted) . * fre fips . distinct fips │ Observations │ total distinct ───────┼────────────────────── fips │ 57 57 . restore
Among them NY boroughs?
. count if inlist(fips, 36005, 36047, 36061, 36081, 36085) 5
Excluding cases where no fips code is available and no link to deaths data.
. drop if mi(fips) (95 observations deleted) . drop if mi(date_proper) (57 observations deleted)
Authors specifythat analysis focuses on 50 states - that most likely means exclusions of Northern Mariana Islands
and Puerto Rico
. All counties from there were excluded.
American Samoa AS 60 Guam GU 66 Northern Mariana Islands MP 69 Puerto Rico PR 72 Virgin Islands VI 78
. drop if state == "Northern Mariana Islands" (0 observations deleted) . drop if state == "Puerto Rico" (0 observations deleted) . . * first two could be also achieved with . * drop if fips >= 60000 . * su fips
Examples from the date of analyses
. * ta county if mi(S1602_C01_001E) // & date == "2020-04-26" . * ta county if mi(S1701_C03_001E) // & date == "2020-04-26" . * ta county if mi(LABOR) // & date == "2020-04-26" . distinct county if mi(S1602_C01_001E) & date_proper == date("2020-04-26", "YMD") │ Observations │ total distinct ────────┼────────────────────── county │ 1 1 . distinct county if mi(S1701_C03_001E) & date_proper == date("2020-04-26", "YMD") │ Observations │ total distinct ────────┼────────────────────── county │ 1 1 . distinct county if mi(LABOR) & date_proper == date("2020-04-26", "YMD") │ Observations │ total distinct ────────┼────────────────────── county │ 92 92 . . fre state if mi(LABOR) & date_proper == date("2020-04-26", "YMD") state ─────────────────────────────┬──────────────────────────────────────────── │ Freq. Percent Valid Cum. ─────────────────────────────┼──────────────────────────────────────────── Valid Alaska │ 14 15.22 15.22 15.22 Colorado │ 3 3.26 3.26 18.48 District of Columbia │ 1 1.09 1.09 19.57 Florida │ 3 3.26 3.26 22.83 Georgia │ 9 9.78 9.78 32.61 Kentucky │ 3 3.26 3.26 35.87 Louisiana │ 3 3.26 3.26 39.13 Maryland │ 1 1.09 1.09 40.22 Michigan │ 3 3.26 3.26 43.48 Missouri │ 1 1.09 1.09 44.57 Nevada │ 2 2.17 2.17 46.74 New Jersey │ 3 3.26 3.26 50.00 North Carolina │ 1 1.09 1.09 51.09 Pennsylvania │ 2 2.17 2.17 53.26 Texas │ 2 2.17 2.17 55.43 Virginia │ 34 36.96 36.96 92.39 West Virginia │ 5 5.43 5.43 97.83 Wisconsin │ 2 2.17 2.17 100.00 Total │ 92 100.00 100.00 ─────────────────────────────┴────────────────────────────────────────────
LABOR
particularly affected.
No reasonable response from data finder.
These cases will be excluded from the analyses since this is default Stata’s behaviour and no description on handling missingness was provided in paper.
Five of New York’s counties were aggregated for counting cases/deaths in the paper. Most likely following data format from NYT.
Counties/boroughs and their codes are:
Data was excluded at earlier stage since no info on cases was found for them and no fix for fips
code was made.
Explanation needed from data finder here?
It might be lucky that we only test the rural counties?
Not sure what the impact on spatial set up is? Were these counties aggregated in shape files to form one NY?
Already in the dataset
. ren S1602_C01_001E nonenglish . order nonenglish, a(deaths) . la var nonenglish "% nonenglish speaking hh" . univar nonenglish, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── nonenglish 91273 2.0 2.9 0.0 0.4 1.0 2.3 38.6 ───────────────────────────────────────────────────────────────────────────────
Already in the dataset
. gen farmwork = (LABOR / DP05_0001E) * 100 (2,805 missing values generated) . order farmwork, a(nonenglish) . la var farmwork "% engaged in farmwork" . univar farmwork, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── farmwork 88487 2.1 3.1 0.0 0.5 1.2 2.5 45.9 ─────────────────────────────────────────────────────────────────────────────── . drop LABOR DP05_0001E
Already in the dataset, but has to be constructed from two vars
. gen uninsured = S2701_C05_011E + S2701_C05_012E . order uninsured, a(farmwork) . la var uninsured "% uninsured under 65" . univar uninsured, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── uninsured 91292 19.9 10.0 3.2 12.5 18.1 25.0 90.0 ─────────────────────────────────────────────────────────────────────────────── . drop S2701_C05_011E S2701_C05_012E
Already in the dataset
. ren S1701_C03_001E poverty . order poverty, a(uninsured) . la var poverty "% below poverty line" . univar poverty, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── poverty 91259 15.4 6.2 2.3 10.9 14.7 18.9 55.1 ───────────────────────────────────────────────────────────────────────────────
Already in the dataset
. ren DP05_0024PE older . order older, a(poverty) . la var older "% aged 65 and above" . univar older, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── older 91292 17.2 4.3 3.7 14.4 16.8 19.3 54.2 ───────────────────────────────────────────────────────────────────────────────
Already in the dataset
. ren A00002_002 pop_dens . order pop_dens, a(older) . la var pop_dens "Pop density" . univar pop_dens, d(1) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── pop_dens 91292 330.7 1101.5 0.0 29.4 68.3 206.1 18565.5 ───────────────────────────────────────────────────────────────────────────────
Defined from pop density
. gen nonurban = pop_dens <= 1000 . order nonurban, a(date_proper) . la var nonurban "Non-urban counties flag" . la de nonurban 0 "urban" 1 "non-urban" . la val nonurban nonurban . fre nonurban if date_proper == date("2020-04-26", "YMD") nonurban -- Non-urban counties flag ────────────────────┬──────────────────────────────────────────── │ Freq. Percent Valid Cum. ────────────────────┼──────────────────────────────────────────── Valid 0 urban │ 142 5.09 5.09 5.09 1 non-urban │ 2649 94.91 94.91 100.00 Total │ 2791 100.00 100.00 ────────────────────┴────────────────────────────────────────────
There are no records for states with 0 cases so first date in the database for each county is the date needed.
. sort fips date_proper, stable . by fips: egen date_case1 = min(date_proper) . format date_case1 %tdCCYY-NN-DD . gen time_case1 = date_proper - date_case1 . *drop date_case1 . la var date_case1 "Date of case 1" . la var time_case1 "Days since case 1 in county"
Dates of SIP for each state were provided by data finder.
Five states had it imputed to 0 as specified by paper.
There was one incosistency between paper and data finder - using values from preprint (Wyoming
).
Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── sip_effect │ 82,264 2020-03-28 4.901681 2020-03-19 2020-04-06
. gen temp = date_proper if cases >= 100 (78,586 missing values generated) . sort fips date_proper, stable . by fips: egen date_case100 = min(temp) (63,598 missing values generated) . drop temp . format date_case100 %tdCCYY-NN-DD . . gen time_case100 = date_case100 - sip_effect (64,694 missing values generated) . *drop date_case100 . . * this states are 0 as per preprint specs . replace time_case100 = 0 if inlist(state, "Arkansas", "Iowa", "Nebraska", "North Dakota", "Wyoming") (7,821 real changes made) . . * this state is not mentioned in the preprint but should be 0 according to data finder . replace time_case100 = 0 if inlist(state, "South Dakota") (1,207 real changes made) . . la var date_case100 "Date of case 100" . la var time_case100 "Days between case 100 in county and SIP in state"
There is a problem with generating this variable for counties that did not reach 100 cases by the time analyses were conducted.
Preprint is silent on what was done in such cases. The only two reasonable strategies would be to run analyses without counties with missing information (which would result in drastic sample size reduction) or impute it to 0. The latter has been applied in this case.
. distinct fips if mi(time_case100) │ Observations │ total distinct ───────┼────────────────────── fips │ 55666 1860 . . replace time_case100 = 0 if mi(time_case100) & mi(date_case100) (55,666 real changes made)
Deaths reported by NY Times
. la var deaths "Deaths" . univar deaths, d(0) ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── deaths 91292 5 44 0 0 0 1 1962 ─────────────────────────────────────────────────────────────────────────────── . univar deaths, d(0) by(nonurban) -> nonurban=urban ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── deaths 6678 50 152 0 0 3 25 1962 ─────────────────────────────────────────────────────────────────────────────── -> nonurban=non-urban ────────────── Quantiles ────────────── Variable n Mean S.D. Min .25 Mdn .75 Max ─────────────────────────────────────────────────────────────────────────────── deaths 84614 2 9 0 0 0 1 327 ───────────────────────────────────────────────────────────────────────────────
SCORE recomendation regarding time frame of the data:
For this replication, SCORE recommends three analyses be performed: one analysis that only uses the dates that have occurred since the original analysis, one analysis that combines all available dates, a third analysis that only uses dates that were used in the original analysis
The first and second analyses are the same in such cases - they both would use data with the latest available date.
This script generates data with the date matching preprint analyses which is 2020-04-26
.
(88,501 observations deleted) nonurban -- Non-urban counties flag ────────────────────┬──────────────────────────────────────────── │ Freq. Percent Valid Cum. ────────────────────┼──────────────────────────────────────────── Valid 0 urban │ 142 5.09 5.09 5.09 1 non-urban │ 2649 94.91 94.91 100.00 Total │ 2791 100.00 100.00 ────────────────────┴──────────────────────────────────────────── file data\merged_covid_usa_prepared_original.dta saved Contains data from data\merged_covid_usa_prepared_original.dta obs: 2,791 vars: 18 23 Nov 2020 11:22 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── storage display value variable name type format label variable label ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── date str8 %9s nonurban byte %9.0g nonurban Non-urban counties flag county str33 %33s state str20 %20s fips long %12.0g cases long %8.0g Cases deaths int %8.0g Deaths nonenglish double %10.0g % nonenglish speaking hh farmwork float %9.0g % engaged in farmwork uninsured float %9.0g % uninsured under 65 poverty double %10.0g % below poverty line older double %10.0g % aged 65 and above pop_dens float %9.0g Pop density date_case1 int %td.. Date of case 1 time_case1 byte %9.0g Days since case 1 in county sip_effect int %td.. SIP effect start date date_case100 int %td.. Date of case 100 time_case100 byte %9.0g Days between case 100 in county and SIP in state ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sorted by: fips
Spatially contiguous. Prepared with 01_spatial-sample.Rmd
script.
(1 var, 157 obs) file data\fips_sample.dta saved ─────────────────────┬───────────────────────────────────────────────────────── merge specs │ matching type │ 1:1 mv's on match vars │ none unmatched obs from │ both ─────────────────────┼───────────────────────────────────────────────────────── master file │ data\merged_covid_usa_prepared_original.dta obs │ 2791 vars │ 18 match vars │ fips (key) ───────────────────┼───────────────────────────────────────────────────────── using file │ data\fips_sample.dta obs │ 157 vars │ 1 match vars │ fips (key) ─────────────────────┼───────────────────────────────────────────────────────── result file │ data\merged_covid_usa_prepared_original.dta obs │ 2823 vars │ 20 (including _merge) ────────────┼───────────────────────────────────────────────────────── _merge │ 2666 obs only in master data (code==1) │ 32 obs only in using data (code==2) │ 125 obs both in master and using data (code==3) ─────────────────────┴───────────────────────────────────────────────────────── (2,698 observations deleted) file data\merged_covid_usa_prepared_original_sample.dta saved Contains data from data\merged_covid_usa_prepared_original_sample.dta obs: 125 vars: 18 23 Nov 2020 11:22 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── storage display value variable name type format label variable label ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── date str8 %9s nonurban byte %9.0g nonurban Non-urban counties flag county str33 %33s state str20 %20s fips long %12.0g cases long %8.0g Cases deaths int %8.0g Deaths nonenglish double %10.0g % nonenglish speaking hh farmwork float %9.0g % engaged in farmwork uninsured float %9.0g % uninsured under 65 poverty double %10.0g % below poverty line older double %10.0g % aged 65 and above pop_dens float %9.0g Pop density date_case1 int %td.. Date of case 1 time_case1 byte %9.0g Days since case 1 in county sip_effect int %td.. SIP effect start date date_case100 int %td.. Date of case 100 time_case100 byte %9.0g Days between case 100 in county and SIP in state ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sorted by:
file data\fips_original.csv saved