﻿This Bishop2021_README.txt file was generated on 2021-10-14 by Anusha Bishop

GENERAL INFORMATION

1. Title of Dataset: A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning

2. Author Information
	A. Principal Investigator Contact Information
		Name: Norah Saarman
		Institution: Utah State University
		Address: 
		Email: norah.saarman@usu.edu

	B. Associate or Co-investigator Contact Information
		Name: Anusha Bishop
		Institution: UC Berkeley
		Address: 
		Email: anusha.bishop@berkeley.edu

	C. Alternate Contact Information
		Name: 
		Institution: 
		Address: 
		Email: 

3. Date of data collection: 2016 - 2019

4. Geographic location of data collection: Kenya and Tanzania

5. Information about funding sources that supported the collection of the data: see corresponding paper (Bishop et al., 2021).


SHARING/ACCESS INFORMATION

1. Licenses/restrictions placed on the data: NA

2. Links to publications that cite or use the data:  https://doi.org/10.1111/eva.13237

3. Links to other publicly accessible locations of the data: NA

4. Links/relationships to ancillary data sets: NA

5. Was data derived from another source? NA

6. Recommended citation for this dataset: 
Bishop, A. P., Amatulli, G., Hyseni, C., Pless, E., Bateta, R., Okeyo, W. A., ... & Saarman, N. P. (2021). A machine learning approach to integrating genetic and ecological data in tsetse flies (Glossina pallidipes) for spatially explicit vector control planning. Evolutionary Applications, 14(7), 1762.

DATA & FILE OVERVIEW

1. File List: 
Bishop2021_HabitatSuitability_Data.csv - contains the data used in the habitat suitability model (i.e. information about the trap locations). Abbreviations: TrapNo (Trap Number), Lat (Latitude), Long (Longitude), NumberDays (number of days between StartDate (date traps were set out) and EndDate (date flies were collected from traps)).

Bishop2021_GenConModel_AllData.csv - contains the data used in the genetic connectivity model. All columns starting with "BIO" are the median values of each bioclimatic variable along straight paths between sites. The "kernel" column contains the median values along straight paths between sites from the kernel density layer. The "pixvals" column contains the geographic distance between sites in units of pixels (1 km resolution). The "Distance" column contains the Cavalli-Sforza and Edwards’ chord (CSE) genetic distances between sites. See methods of the paper (Bishop et al., 2021) for more detail.

Gpd_KenTza_11loci_659indv_genepop.txt - contains the microsatellite genotypes for the 659 individuals used in this study in GenePop format (https://genepop.curtin.edu.au/) and the Gpd_KenTza_11loci_659indv_sample_info.csv file provides information about these individuals.

2. Relationship between files, if important: NA

3. Additional related data collected that was not included in the current data package: NA

4. Are there multiple versions of the dataset? no

METHODOLOGICAL INFORMATION

1. Description of methods used for collection/generation of data: 
see corresponding paper (Bishop et al., 2021)

2. Methods for processing the data: 
see corresponding paper (Bishop et al., 2021)

3. Instrument- or software-specific information needed to interpret the data: 
see corresponding paper (Bishop et al., 2021)

4. Standards and calibration information, if appropriate: 
NA

5. Environmental/experimental conditions: 
NA

6. Describe any quality-assurance procedures performed on the data: 
NA

7. People involved with sample collection, processing, analysis and/or submission: 
See corresponding paper (Bishop et al., 2021)


DATA-SPECIFIC INFORMATION FOR: Bishop2021_HabitatSuitability_Data.csv

1. Number of variables: 16

2. Number of cases/rows: 354

3. Variable List: 
County
Sub.county (Sub County)
Parish
Village
TrapNo (Trap Number)
Lat (Degrees Latitude)
Long (Degrees Longitude)
Elevation (meters)
Male (number of males)
Female (number of females)
Total (total number of flies)
StartDateMDY (start date for trap)
EndDateMDY (end date for trap)
NumberDays (number of days between StartDate (date traps were set out) and EndDate (date flies were collected from traps)).
EndMonth (month of end date)
EndYear (year of end date)

4. Missing data codes: NA

5. Specialized formats or other abbreviations used: NA

DATA-SPECIFIC INFORMATION FOR: Bishop2021_genConModel_AllData.csv
1. Number of variables: 25

2. Number of cases/rows: 197

3. Variable List: 
See Table S1 (Bishop et al., 2021)

4. Missing data codes: NA

5. Specialized formats or other abbreviations used: NA

DATA-SPECIFIC INFORMATION FOR: Gpd_KenTza_11loci_659indv_sample_info.csv

1. Number of variables: 10

2. Number of cases/rows: 659

3. Variable List: 
indivID (fly ID)
siteID (site ID)
pixelLat (pixel latitude)
pixelLong (pixel longitude)
corLat (degree latitude)
corLong (degree longitude)
centroidLong (centroid longitude)
centroidLat (centroid latitude)
region (geographic region)
cluster (genetic cluster)
See Bishop et al., 2021 for more details

4. Missing data codes: NA

5. Specialized formats or other abbreviations used: NA
