UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Dataset Open Access

GRCh38.p13 Reference FASTA (bgzip'd with faidx)

Woon, Mark

This is derived from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz.

All non-primary sequences have been removed.

It has then been recompressed with bgzip and indexed with samtools:

curl -#fSL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz -o genomic.fna.gz
gunzip genomic.fna.gz
awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' genomic.fna | grep -v "^>chr\S*_" - | tr "\t" "\n" > genomic.short.fna
bgzip -c genomic.short.fna > reference.fna.bgz
samtools faidx reference.fna.bgz
tar -czvf GRCh38_reference_fasta.tar reference.fna.bgz reference.fna.bgz.fai reference.fna.bgz.gzi

 

This tar file contains:

  • reference.fna.bgz
  • reference.fna.bgz.fai
  • reference.fna.bgz.gzi

 

Files (882.8 MB)
Name Size
GRCh38_reference_fasta.tar
md5:a4047175ae90e2df36900f039f1cf260
882.8 MB Download
392
799
views
downloads
All versions This version
Views 39254
Downloads 799283
Data volume 715.8 GB249.8 GB
Unique views 34551
Unique downloads 638238

Share

Cite as