Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)
Description
1. Summary
This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as FACETS.
The main feature of these files is a uniform SNP density of approximately 1 SNP per kilobase (kb), which significantly improves the performance and robustness of the analysis.
2. Rationale
Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:
- Excessive file size: They contain millions of rare variants, which considerably slows down the pre-processing step (
snp-pileup). - Non-uniform density: SNP 'hotspots' with very high density can introduce bias into segmentation algorithms.
These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.
3. Contents of the Deposit
facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz- The SNP reference for use with BAM files where chromosomes are named 'chr1', 'chr2', etc.
facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbi- The Tabix index for the above file.
facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz- The SNP reference for use with BAM files where chromosomes are named '1', '2', etc.
facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbi- The Tabix index for the above file.
uniformize_vcf_density.py- The Python script used to generate these reference files.
README.md- This information file.
4. Generation Workflow (Transparency)
The generation process for these files is fully reproducible:
- Primary Source: Data was derived from the official dbSNP b157 VCF for the GRCh38/hg38 assembly (
GCF_000001405.40.gz), downloaded from the NCBI FTP server. - Primary Chromosome Filtering: The VCF was first filtered to retain only the primary assembly chromosomes (1-22, X, Y, M), excluding alternate haplotypes and unplaced contigs.
- Initial SNP Filtering: The source VCF was subsequently filtered using
bcftoolsto retain only common (INFO/COMMON=1), bi-allelic SNPs. - Chromosome Renaming: NCBI-style chromosome names (
NC_...) were converted to the two standard nomenclatures (chrandno-chr) using the official GRCh38.p14 assembly report. - Density Uniformization: The
uniformize_vcf_density.pyscript was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each 1 kb window.
5. Recommended Usage
- Download the VCF file (and its
.tbiindex) that matches the chromosome naming style of your BAM files. - Provide this VCF file as the SNP reference to the
snp-pileuptool or a Galaxy wrapper for FACETS.
Example command:
snp-pileup -q15 -Q20 facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz normal.bam tumor.bam | gzip > pileup.csv.gz
6. Authors and Citation
This resource was prepared by drosofff@gmail.com (ARTbio project) in collaboration with the Gemini language model 2.5 (Google). Source data is derived from NCBI dbSNP. If you use these files in your work, please cite this Zenodo deposit.
Files
readme.md
Files
(353.9 MB)
| Name | Size | |
|---|---|---|
|
md5:d81578727701e8a2024713a39380be3f
|
175.5 MB | Download |
|
md5:54e8f2ce5ff468efd059771c9380ac2f
|
1.7 MB | Download |
|
md5:a7a2123c9fe19a10736d1e5c8ab30952
|
175.0 MB | Download |
|
md5:f77f602c4508b9f3e432fec8c788adbd
|
1.7 MB | Download |
|
md5:a07866a6c483240bfcbb2886999591c9
|
3.1 kB | Preview Download |
|
md5:158fcd25c67db1e57be95df69ba22346
|
3.1 kB | Download |