This README_0gb5mkm08.txt file was generated on 2022-08-24 by Pengcheng Fu


GENERAL INFORMATION

1. Title of Dataset: Data from: Population genomics reveal deep divergence and strong geographical structuring in the Hengduan Mountains.

2. Author Information
	Corresponding Investigator 
		Name: Dr Pengcheng Fu
		Institution: Luoyang Normal University, China
		Email: fupengc@sina.com


3. Date of data collection: 2018-2021

4. Geographic location of data collection: Hengduan Mountais, China

5. Funding sources that supported the collection of the data: National Natural Science Foundation of China, Award: 31600296; Chinese Scholarship Council

6. Recommended citation for this dataset: Fu P. (2022), Data from: Population genomics reveal deep divergence and strong geographical structuring in the Hengduan Mountains, Dryad, Dataset


DATA & FILE OVERVIEW

1. Description of dataset

We used restriction site-associated DNA sequencing to generate 1,907 single nucleotide polymorphisms (SNPs) and four-kb of plastid sequence in species of the Gentiana hexaphylla complex (Gentianaceae). We performed genetic clustering with spatial and non-spatial models, phylogenetic reconstructions, and ancestral range estimation, with the aim of addressing the processes influencing the diversification of G. hexaphylla in the HM. Here, the SNP data and plastid sequence alignments are provided.
 

2. File List: 
	File 1 Name: Gentiana_hexaphylla_plastid_sequences.fas
	File 1 Description: Assembled plastid sequence from RAD-seq data

	File 2 Name: Gentiana_hexaphylla_SNP_m3n2M2.vcf
	File 2 Description: SNP data called with m=3,n=2, M=2 in Stacks 2.0

	File 3 Name:  Gentiana_hexaphylla_SNP_m3n3M3.vcf
	File 3 Description: SNP data called with m=3,n=3, M=3 in Stacks 2.0

	File 4 Name: Gentiana_hexaphylla_SNP_m3n4M4.vcf
	File 4 Description: SNP data called with m=3,n=4, M=4 in Stacks 2.0

	File 4 Name: G.hexaphylla_plastid_ML_tree.contree
	File 4 Description: ML tree file built from plastid data


METHODOLOGICAL INFORMATION

For RAD library construction and sequencing (Miller, Dunham, Amores, Cresko, & Johnson, 2007), each sample was digested with the restriction enzyme EcoRI followed by ligation of the P1 adapter by T4 ligase. Fragments were pooled, randomly sheared and size-selected to 350–550 bp. A second adapter (P2) was then ligated. The ligation products were purified and PCR-amplified, followed by gel purification and size selection for fragments in the range of 350–550 bp. Paired-end reads 150 bp in length were generated using the Illumina Novaseq 6000 (Tianjin, China).

Raw reads were filtered and trimmed with Trimmomatic v0.32 (Bolger, Lohse & Usadel, 2014) with default parameters to remove adaptor sequences and low-quality reads and sites, and then checked for quality with FastQC v0.11.2. We used Stacks v2.0 (Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011; Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013) to identify orthologous loci across individuals. Clean sequences were de novo assembled using denovo_map, with a minimum stack depth of three (m = 3), and we allowed a range of different mismatches between stacks within and between individuals (M = n = 2, 3 or 4). At least 75% of individuals in a population were required to retain a locus (-r 0.75), and SNPs identified in all individuals with minor allele frequency (MAF) less than 5% were removed (--min-maf 0.05). SNPs with missing frequency of less than fifty percent among individuals (--max-missing 0.5) were retained using vcftools version 0.1.13 (Danecek et al., 2011). Linkage-disequilibrium (LD) SNP pruning was performed in vcftools to exclude variants from each pair closer than 100 bp (--thin 100). Heterogeneous loci were filtered out in TASSEL 5 (Bradbury et al., 2007) to exclude SNPs originating from different paralogs.

To obtain plastid sequences of each sample, clean reads were assembled using the GetOrganelle pipeline (Jin et al., 2018) with default parameters. We used the published plastome of G. hexaphylla (MG192305) (Sun et al., 2018) as the reference.  Sequences were aligned using MAFFT (Katoh, Misawa, Kuma, & Miyata, 2002)


DATA-SPECIFIC INFORMATION FOR: Gentiana_hexaphylla_plastid_sequences.fas

1. Number of variables: 1

2. Number of cases/rows: 64

3. Variable List: 
	DNA sequence 

4. Missing data codes: 
	-

5. Abbreviations used: 
	N/A; not applicable

6. Other relevant information: 
	This is a fasta file containing DNA sequences of 64 samples.



DATA-SPECIFIC INFORMATION FOR: Gentiana_hexaphylla_SNP_m3n2M2.vcf

1. Number of variables: 1988

2. Number of cases/rows: 95

3. Variable List: 
	1988 SNPs called from RAD-seq data

4. Missing data codes: 
	-

5. Abbreviations used: 
	N/A; not applicable

6. Other relevant information: 
	This is a typical vcf file containing SNPd from 95 samples.



DATA-SPECIFIC INFORMATION FOR: Gentiana_hexaphylla_SNP_m3n3M3.vcf

1. Number of variables: 1875

2. Number of cases/rows: 95

3. Variable List: 
	1875 SNPs called from RAD-seq data

4. Missing data codes: 
	-

5. Abbreviations used: 
	N/A; not applicable

6. Other relevant information: 
	This is a typical vcf file containing SNPd from 95 samples.




DATA-SPECIFIC INFORMATION FOR: Gentiana_hexaphylla_SNP_m3n4M4.vcf

1. Number of variables: 1907

2. Number of cases/rows: 95

3. Variable List: 
	1907 SNPs called from RAD-seq data

4. Missing data codes: 
	-

5. Abbreviations used: 
	N/A; not applicable

6. Other relevant information: 
	This is a typical vcf file containing SNPd from 95 samples.




DATA-SPECIFIC INFORMATION FOR: G.hexaphylla_plastid_ML_tree.contree

1. Number of variables: none

2. Number of cases/rows: none

3. Variable List: 
	none

4. Missing data codes: 
	none

5. Abbreviations used: 
	N/A; not applicable

6. Other relevant information: 
	This is a typical tree file, built from plastid data.

REFEREBCES
1. Bolger, A. M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. doi: 10.1093/bioinformatics/btu170
2. Catchen, J. M., Amores, A., Hohenlohe, P., Cresko, W., and Postlethwait, J. H. (2011). Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, genomes, genetics, 1, 171–182. doi: 10.1534/g3.111.000240
3. Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A., and Cresko, W. A. (2013). Stacks: an analysis tool set for population genomics. Mol. Ecol., 22, 3124–3140. doi: 10.1111/mec.12354
4. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., et al. (2011). The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. doi: 10.1093/bioinformatics/btr330
5. Bradbury, P. J., Zhang, Z., Kroon, D. E., Casstevens, T. M., Ramdoss, Y., and Buckler, E. S. (2007). TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23, 2633–2635. doi: 10.1093/bioinformatics/btm308
6. Jin, J. J., Yu, W. B., Yang, J. B., Song, Y., DePamphilis, C. W., Yi, T. S., et al. (2020). GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Gen. Biol., 21, 1–31. doi: 10.1186/s13059-020-02154-5
7. Katoh, K., Misawa, K., Kuma, K. I., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066. doi: 10.1093/nar/gkf436
8. Sun, S. S., Fu, P. C., Zhou, X. J., Cheng, Y. W., Zhang, F. Q., Chen, S. L., et al. (2018). The complete plastome sequences of seven species in Gentiana sect. Kudoa (Gentianaceae): insights into plastid gene loss and molecular evolution. Front. Plant Sci., 9, 493. doi: 10.3389/fpls.2018.00493


