Project Tycho dataset documentation
Project Tycho version 2.0
Project Tycho data format version 1.1

Documentation
-------------
Last updated: 2022-07-19
Content:
1. Origin of datasets
2. Dataset format information
3. Suggested data management before analysis
4. User license information
5. Data sources
6. Citation
7. Contact information
8. Appendix

-------------
1. ORIGIN OF DATASETS
The Project Tycho Repository for Global Health Data aims to advance the availabilty and use of data for improving global health. A Project Tycho dataset includes case counts for a disease condition in a country. Data for Project Tycho datasets can come from various sources and have been pre-processed into the standard Project Tycho data format. 

Project Tycho datasets contain information from external sources of disease surveillance data, such as the United States Centers for Disease Control or the World Health Organization. These datasets include both previously available data and data that were not public before inclusion in Project Tycho datasets. The Project Tycho team has obtained permission for redistribution of data that was not previously public from agencies that collected and owned the data. All pre-compiled Project Tycho datasets contain count data that are identical to counts published in the original source and no counts have been modified in any way by the Project Tycho team, except for aggregation of individual case count data into daily counts when that was the best data available for a disease and location (see Section 8. APPENDIX, below, for more information on the affected sources). The Project Tycho team has curated datasets by adding new variables, such as standard identifiers for reported conditions, locations, and pathogens, and by re-representing reported information in a standard data format.

2. DATASET FORMAT INFORMATION
Project Tycho datasets in version 2.0, data format version 1.1, are formatted according to a standard comma-separated file format with 27 variables. This format can be viewed at https://www.tycho.pitt.edu/dataformat_v1.1 and at https://fairsharing.org/bsg-s000718.

3. SUGGESTED DATA MANAGEMENT BEFORE ANALYSIS
Project Tycho datasets contain case counts for a specific condition (e.g., measles) and country (e.g., United States) reported for specific time intervals. In addition to case counts, datasets include information about these counts (attributes), such as the location, age group, subpopulation, diagnostic certainty, place of acquisition, and the source from which we obtained case counts. One dataset can include many time series of case count intervals; e.g., for each combination of attributes, such as measles cases for 0-5 years old, with probable diagnosis, domestic acquisition, reported by the US CDC. Depending on the user purpose, we recommend the following data processing steps before analysis:

	A. Analyze missing data: Project Tycho datasets do not include time intervals for which no case count was reported (count time series in datasets are often incomplete) and users will need to add time intervals for which no count value is available. Counts for which the location or other attribute is listed as “unknown” are also excluded. Project Tycho datasets include time intervals for which a case count value of zero was reported.
	
	B. Separate cumulative from non-cumulative time interval series: Project Tycho case count time series can be in a cumulative or fixed-interval format. 
	B1. Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates. Cumulative case count time series result from case reporting for “all previous weeks” instead of “the most recent week only”. An example of a cumulative case count time series is:
		i. time interval 1: Jan 1-Jan 7: 10 cases
		ii. time interval 2: Jan 1-Jan 14: 15 cases
		iii. time interval 3: Jan 1-Jan 21: 17 cases
		iv. etc.
	B2. Fixed-interval case count time series consist of mutually exclusive time intervals that all start and end on different date and all have identical length (day, week, month, year), for example:
		i. time interval 1: Jan 1-Jan 7: 10 cases
		ii. time interval 2: Jan 8-Jan 14: 7 cases
		iii. time interval 3: Jan 15-Jan 21: 3 cases
		iv. etc.
	B3. A note about cumulative count time series: Where the start date of a cumulative case count time series was not available in the source data, the start date was inferred using the following process: 
		(1) if the cumulative start date is specified in the source data’s documentation, that date is used; 
		(2) if fixed-interval case counts are available for the same period, and start from zero or one, the start date of the first fixed-interval count is used for the cumulative; 
		(3) if the cumulative count time series does not start from zero or one, and the cumulative start date for the location in question is available from another source starting from an earlier time period, covers the same area, and has case counts that do start from zero or one, that start date will be used (for example, we may infer the start date for cumulative data in the Alabama Department of Public Health Website Dashboard, from which we have cumulative case counts starting in March 2020, from the United States Centers for Disease Control and Prevention, COVID-19 Response data, from which we have cumulative case counts for Alabama starting in January 2020).
		
	C. Check geographical locations: All geographic locations at the country and first-order administrative division (admin1) level have been represented at the same geographic level as in the data source, provided an ISO code or codes could be identified. For example, if one data source has specified a location as a country, Project Tycho considers the location as a country for that source, even if for another source, the location is represented as an admin1. Due to this, data for some specific locations may be found in more than one country-condition datasets. 
	
	D. Check for boolean operators: Boolean operators <AND>, <OR>, and <NOT> are permitted in location fields, demographic fields, and DiagnosisCertainty, if more than one category applies. For example, if case count data is reported for the admin2 labeled “Dukes and Nantucket”, we specify the admin2 as “DUKES COUNTY <OR> NANTUCKET COUNTY”. 

4. USER LICENSE INFORMATION
Project Tycho datasets are available under a Creative Commons Attribution 4.0 International License (CC BY 4.0), see www.tycho.pitt.edu/license and https://creativecommons.org/licenses/by/4.0/. The CC BY 4.0 requires that users cite Project Tycho datasets as suggested below.

Although each data source might have its own copyright and redistribution information, the Project Tycho team has reviewed each data source used (please see Section 5: DATA SOURCES below for details) and Project Tycho datasets only include data that we have permission to redistribute. If you believe that any data contained in a Project Tycho dataset infringes your copyright or other intellectual property rights, you may notify Project Tycho by contacting us via email at tycho@phdl.pitt.edu.

5. DATA SOURCES
This dataset includes data from the following sources:
	A. COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University: https://github.com/CSSEGISandData/COVID-19
		i. License/rights: Creative Commons Attribution 4.0 International (CC BY 4.0), https://github.com/CSSEGISandData/COVID-19/blob/master/README.md
		ii. Citation/additional information: Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1
	B. European Centre for Disease Prevention and Control Website: https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19
		i. License/rights: https://www.ecdc.europa.eu/en/copyright
	C. World Health Organization COVID-19 Dashboard: https://covid19.who.int/
		i. License/rights: https://covid19.who.int/info?openIndex=2


6. CITATION
MIDAS Coordination Center., Counts of COVID-19 reported in UGANDA: 2020-2021 (version 2.0, April 1, 2018): Project Tycho data release, DOI: 10.25337/T7/ptycho.v2.0/UG.840539006

7. CONTACT INFORMATION
In case of questions or ideas, please contact Project Tycho via email (tycho@phdl.pitt.edu) or via the website (www.tycho.pitt.edu).

8. APPENDIX

The following data sources include individual case count data that was aggregated to daily counts for inclusion in the Project Tycho datasets:

	A. Republic of Philippines Department of Health COVID-19 Website Dashboard (https://doh.gov.ph/covid19tracker)
		i. Individual case count data was aggregated for COVID-19 cases, deaths, and recoveries both by sex and age, and sex only at the country, first-order administrative division, second-order administrative division, and city levels for each specific date 
		ii. The dates used for aggregation are: 
			a. The date a case was publicly announced as a confirmed case for cases
			b. The date a patient died for deaths
			c. The date a patient recovered for recoveries

	B. Georgia Department of Public Health Website (https://dph.georgia.gov/covid-19-daily-status-report)
		i. Individual case count data was aggregated for COVID-19 cumulative deaths only by age, race and sex at the county level, until 2020-09-30.
		ii. The date used for aggregation of the cumulative deaths is the date of download (data were downloaded daily)

	C. Montana Department of Health & Human Services COVID-19 Website Dashboard (https://montana.maps.arcgis.com/apps/MapSeries/index.html?appid=7c34f3412536439491adcc2103421d4b)
		i. Individual case count data was aggregated for COVID-19 cases only by age group and sex at the county level.
		ii. The date used for aggregation of the cases is the date of report. 

