Virus+ Sequence Masked Human Reference Genome (hg19)
Description
A version of the human genome (hg19) originally masked for ribosomal, plant, animal, fungal and low-entropy sequences by Brian Bushnell (Bushnell Masked Human Genome) additionally masked for all possible viral sequences.
The following commands were used to generate the additional virus sequence masked reference database:
1) Download all RefSeq and Neighbor nucleotide records:
2) Shred the downloaded viral genomes using shred.sh from the bbtools package
shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30
3) Map shredded virus sequence to the hg19-masked human genome using bbmap.sh from the bbtools package
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz in=virus_shred.fasta.gz outm=map_human_all_viruses.sam minid=0.90
4) Mask virus sequenced mapped regions from the hg19-masked human genome using bbmask.sh from the bbtools package
bbmask.sh in=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz out=human_virus_masked.fasta.gz sam=map_human_all_viruses
.sam
5) Remove all N's to further reduce file size using seqkit
seqkit -is replace -p "n" -r "" human_virus_masked.fasta.gz > human_virus_masked.fasta_Ns_removed.gz
Additional References:
- http://seqanswers.com/forums/showthread.php?t=42552 for additional information on the original masking of hg19
- bbtools
- seqkit
- NCBI Virus Genome RefSeq
Files
Files
(889.0 MB)
Name | Size | Download all |
---|---|---|
md5:510890f08aa07fa86c020f3057906bc9
|
889.0 MB | Download |