Dataset Open Access
Woon, Mark
This is derived from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz.
All non-primary sequences have been removed.
It has then been recompressed with bgzip and indexed with samtools:
curl -#fSL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz -o genomic.fna.gz
gunzip genomic.fna.gz
awk '{ if ((NR>1)&&($0~/^>/)) { printf("\n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("\t%s", $0); } }' genomic.fna | grep -v "^>chr\S*_" - | tr "\t" "\n" > genomic.short.fna
bgzip -c genomic.short.fna > reference.fna.bgz
samtools faidx reference.fna.bgz
tar -czvf GRCh38_reference_fasta.tar reference.fna.bgz reference.fna.bgz.fai reference.fna.bgz.gzi
This tar file contains:
Name | Size | |
---|---|---|
GRCh38_reference_fasta.tar
md5:a4047175ae90e2df36900f039f1cf260 |
882.8 MB | Download |
All versions | This version | |
---|---|---|
Views | 392 | 54 |
Downloads | 799 | 283 |
Data volume | 715.8 GB | 249.8 GB |
Unique views | 345 | 51 |
Unique downloads | 638 | 238 |