This Boocock_GenomicEpidemiologyLACOVID2020_BMCReadme.txt file was generated on 2021-12-14 by James Boocock GENERAL INFORMATION 1. Title of Dataset: Genomic epidemiology of the Los Angeles COVID-19 outbreak and the early history of the B.1.43 strain in the US. 2. Author Information A. Principal Investigator Contact Information Name: Leonid Kruglyak Institution: UCLA Address: Department of Human Genetics, David Geffen School of Medicine Email: lkruglyak@mednet.ucla.edu C. Alternate Contact Information Name: James Boocock Institution: UCLA Address: Department of Human Genetics, David Geffen School of Medicine Email: james.boocock@ucla.edu 3. Date of data collection (single date, range, approximate date): 2020 4. Geographic location of data collection: Los Angeles County 5. Information about funding sources that supported the collection of the data: We thank the UCLA David Geffen School of Medicine Dean Office for their support, the Fast Grants, Inc for funding of this work. A generous donation was provided by Jane Semel. This work was supported by funding from the Howard Hughes Medical Institute (to LK) and Damon Runyon Cancer Research Foundation (DFS-43-20 to YY). DATA & FILE OVERVIEW 1. File List: data ├── Duplicate\ Sent\ Genomes_Deidentified.csv ├── all.fasta (Our LA processed V-seq and NEB genomes) ├── all_3_none.fasta ├── all_sequences2.fasta (Our LA genomes that were submitted to gisaid) ├── b_1_43_lockdown.log (Beast log file for estimating the effective population size before and after LA's lockdown) ├── branch_lengths.json ├── gisaid │   ├── good_genomes.txt │   ├── metadata_2020-07-10_23-53.tsv (metadata for the gisaid sequences used in our analysis) │   └── sequences_2020-07-10_23-53.fasta (gisaid sequences used in analysis) ├── lineages.csv (Phased B.1.43 genome lineages) ├── lineages_merged.csv (Gisaid and LA genome pangolin lineages) ├── long_sample_list_august19th.csv (Sample contamination list) ├── nextstrain_ncov_la_all_b_one_two_timetree.nwk (B.1.43 timetree with LA and other genomes) ├── nextstrain_ncov_la_all_b_one_two_tree.nwk (B.1.43 mutation tree with LA and other genomes) ├── nextstrain_ncov_north-america_usa_los-angeles_hets_curate_three_timetree.nwk (Timetree of all LA genomes along with worldwide samples) ├── nextstrain_ncov_north-america_usa_los-angeles_hets_curate_three_tree.nwk (Mutation tree of all LA genomes along with worldwide samples) ├── nt_muts.json (json mutation list for all the LA and other strains) ├── phased.fasta (Phased B.1.43 genomes LA) ├── phased_genomes_two_lineages.fasta (Phased B.1.43 genomes) ├── qc_report.tsv (QC report for our datasets, used to filter genomes) └── tree_json_b_1_43.json (B.1.43 tree in JSON format) METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: Refer to the methods section of the paper "Genomic epidemiology of the Los Angeles COVID-19 outbreak and the early history of the B.1.43 strain in the US" for a detailed description of how the data was generated. 2. Methods for processing the data: Refer to the methods section of the paper "Genomic epidemiology of the Los Angeles COVID-19 outbreak and the early history of the B.1.43 strain in the US" for a detailed description of how the data was generated. 3. Instrument- or software-specific information needed to interpret the data: The 'analysis_scripts' folder that can be found at our github has scripts for processing everything in this dataset https://github.com/theboocock/COVID-NGS2 DATA-SPECIFIC INFORMATION FOR: Duplicate\ Sent\ Genomes_Deidentified.csv 1. Number of variables: 6 2. Number of cases/rows: 66 3. Variable List: id, unique idenntifier, string test, type of test, string specdate, data specimen collected, date spec, type of specimen, string date, date test taken, date group, duplicate group, string 4. Missing data codes: 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: gisaid/good_genomes.txt List of GISAID genomes that passed our filter 1. Number of variables: 1 2. Number of cases/rows: 59982 3. Variable List: gisaid id, gisaid IDs, string 4. Missing data codes: n 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: gisaid/metadata_2020-07-10_23-53.tsv 1. Number of variables: 26 2. Number of cases/rows: 62,644 3. Variable List: strain, strain name, string virus, virus name, string gisaid_epi_isl, gisaid epi isl, string genbank_accession, genbank accession if it exists, string date, date genome collected, string region, region where genome was collected, string country, country where genome was collected, string division, division where genome was collected, string location, location where genome was collected, string region_exposure, region where exposure occurred, string conutry_exposure, countryy where exposure occurred, string division_exposure, division where exposure occurred, string segment, sequence segement usually genome, string length, length of sequence, number host, host where virus was collected, string age, age of host, number sex, sex of host, string pangolin_lineage, pangolin lineage, string GISAID_clade, GISAID clade assignment, string originating_lab, lab that collected the specimen, string submitiing_lab, lab that submitted the specimen, string authors, authors of the genome, string url, url where genome was downloaded, string title, manuscript title, string paper_url, url of paper that goes with the genomes, string date_submitted, date genomes where submitted to GISAID, string 4. Missing data codes: ?, missing 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: lineages.csv 1. Number of variables: 6 2. Number of cases/rows: 22 3. Variable List: taxon, sample name, string lineage, pangolin lineage, string probability, probability of assignment, number pangoLEARN_version, version of pangoLEARN used to assign lineage, string status, status of lineage assignment, string note, note about sample, string 4. Missing data codes: n 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: lineages_merged.csv 1. Number of variables: 6 2. Number of cases/rows: 22 3. Variable List: taxon, sample name, string lineage, pangolin lineage, string probability, probability of assignment, number pangoLEARN_version, version of pangoLEARN used to assign lineage, string status, status of lineage assignment, string note, note about sample, string 4. Missing data codes: n 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: lineages_merged.csv 1. Number of variables: 2. Number of cases/rows: 3. Variable List: 4. Missing data codes: n 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: long_sample_list_august19th.csv 1. Number of variables: 7 2. Number of cases/rows: 15 3. Variable List: merged_id, sample id, string uid, universal identifier, string ct, qPCR ct value, strig het_sample_count, number of het sites, number date_fix, date of collection, date lib_type, sequencing library type, string p1, sequencing primer 1, strig 4. Missing data codes: n 5. Specialized formats or other abbreviations used: DATA-SPECIFIC INFORMATION FOR: qc_report.tsv 1. Number of variables: 107 2. Number of cases/rows: 260 3. Variable List: merged_id, sample id, string mapped_human, number of reads mapping to the human genome, number mapped_rrna, number of reads mapping to the human rRNA locus, number mapped_sars2_dedup, number of reads mapping to the sars2 genome after PCR deduplication, number mapped_sars2_dedup, number of reads mapping to the sars2 genome without PCR deduplication, number unmapped, number of unmapped reads, number of unmappped reads, number total_read_count, total number of reads, number sars2_percent_dup, percentage of reads mapping to the sars2 genome after PCR deduplication, number sars2_percent_dedup, percentage of reads mapping to the sars2 genome without PCR deduplication, number coverage_3x, coverage of the sars2 genome at >= 3x, number read_group_name, read group name in bam, string uid_lib, unique sample identifier and library type combination, string lib_names, library names mapped, string fastqs_one, read one fastq location, string fastqs_two, read two fastq location, string library_id_internal, internal library id, string coverage_mean_dedup, mean sars2 genome coverage after PCR deduplication, number coverage_mean_no_dedup, mean sars2 genome coverage without PCR deduplication, number coverage_5x, coverage of the sars2 genome at >= 5x, number taxon, sample name alternative, string lineage, pangolin lineage, string probability, pangolin assignment probability, number pangoLEARN_version, version of pangoLEARN used to assign lineages, string status, status of the pangoLEARN analysis, string note, note from the pangoLEARN analysis, string pangolin_phased, pangolin lineage from the phased data, string prob_phased, probability of assignment from the phased data, string gisaid_phased, gisaid lineage for the phased genomes, string uid, unique sample identifier, string lib_name_short, short library name, string uid.x, uid duplicate column, string sample_type, sample type, string virus, virus type, string gisaid_epi_isl, gisaid id, string gennbak_accession, genbank accession, string region, region where genome was collected, string division, division where genome was collected, string location, location where genome was collected, string region_exposure, region where person was exposed, string country_exposure, country where genome was exposed, string division_exposure, division where genome was exposed, string segment, sequenced segment, string length, sequence length, string host, virus host, string age, host age, number sex, host sex, string originating_lab, lab where sample originated, string submitting_lab, lab where sample was submitted, string authors, authors of paper, string url, url where samples were found, string title, paper title, string date_submitted, date sample submitted, string library_type, sequencing library type, string uid_sample_type, uid combined with the library type, string id_library_type, uid combied with the library type, string sample_name_fasta, sample name of strain in the fasta files, string 4. Missing data codes: n 5. Specialized formats or other abbreviations used: