There is a newer version of the record available.

Published September 29, 2025 | Version v2

Optimized SNP Reference VCFs for FACETS Analysis (hg38/GRCh38)

  • 1. ROR icon Centre National de la Recherche Scientifique

Description

1. Summary

This deposit contains two reference VCF files for the human genome (hg38/GRCh38). They list common Single Nucleotide Polymorphisms (SNPs) and are specifically designed for use with allele-specific copy number analysis tools, such as FACETS.

The main feature of these files is a uniform SNP density of approximately 1 SNP per kilobase (kb), which significantly improves the performance and robustness of the analysis.

2. Rationale

Official SNP files (e.g., from dbSNP) present two challenges for FACETS-like analyses:

  • Excessive file size: They contain millions of rare variants, which considerably slows down the pre-processing step (snp-pileup).
  • Non-uniform density: SNP 'hotspots' with very high density can introduce bias into segmentation algorithms.

These optimized reference files solve both problems by providing a lightweight, clean, and evenly distributed set of markers.

3. Contents of the Deposit

  • facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz
    • The SNP reference for use with BAM files where chromosomes are named 'chr1', 'chr2', etc.
  • facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz.tbi
    • The Tabix index for the above file.
  • facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz
    • The SNP reference for use with BAM files where chromosomes are named '1', '2', etc.
  • facets_reference_snps_hg38_uniform_1kb.no_chr_style.vcf.gz.tbi
    • The Tabix index for the above file.
  • uniformize_vcf_density.py
    • The Python script used to generate these reference files.
  • README.md
    • This information file.

4. Generation Workflow (Transparency)

The generation process for these files is fully reproducible:

  1. Primary Source: Data was derived from the official dbSNP b157 VCF for the GRCh38/hg38 assembly (GCF_000001405.40.gz), downloaded from the NCBI FTP server.
  2. Primary Chromosome Filtering: The VCF was first filtered to retain only the primary assembly chromosomes (1-22, X, Y, M), excluding alternate haplotypes and unplaced contigs.
  3. Initial SNP Filtering: The source VCF was subsequently filtered using bcftools to retain only common (INFO/COMMON=1), bi-allelic SNPs.
  4. Chromosome Renaming: NCBI-style chromosome names (NC_...) were converted to the two standard nomenclatures (chr and no-chr) using the official GRCh38.p14 assembly report.
  5. Density Uniformization: The uniformize_vcf_density.py script was run on the filtered files to select the most informative SNP (allele frequency closest to 0.5) within each 1 kb window.

5. Recommended Usage

  1. Download the VCF file (and its .tbi index) that matches the chromosome naming style of your BAM files.
  2. Provide this VCF file as the SNP reference to the snp-pileup tool or a Galaxy wrapper for FACETS.

Example command:

snp-pileup -q15 -Q20 facets_reference_snps_hg38_uniform_1kb.chr_style.vcf.gz normal.bam tumor.bam | gzip > pileup.csv.gz

6. Authors and Citation

This resource was prepared by drosofff@gmail.com (ARTbio project) in collaboration with the Gemini language model 2.5 (Google). Source data is derived from NCBI dbSNP. If you use these files in your work, please cite this Zenodo deposit.

Files

readme.md

Files (353.9 MB)

Name Size
md5:d81578727701e8a2024713a39380be3f
175.5 MB Download
md5:54e8f2ce5ff468efd059771c9380ac2f
1.7 MB Download
md5:a7a2123c9fe19a10736d1e5c8ab30952
175.0 MB Download
md5:f77f602c4508b9f3e432fec8c788adbd
1.7 MB Download
md5:a07866a6c483240bfcbb2886999591c9
3.1 kB Preview Download
md5:158fcd25c67db1e57be95df69ba22346
3.1 kB Download