Virus+ Sequence Masked Mouse Reference Genome (GRCm38)
Description
A version of the mouse genome (GRCm38) masked for all possible viral sequences.
See Virus+ Masked Human Genome for a masked human reference database.
The following commands were used to generate the additional virus sequence masked reference database:
1) Download all RefSeq and Neighbor nucleotide records:
2) Shred the downloaded viral genomes using shred.sh from the bbtools package
shred.sh in=refseq_virus_reformated.fasta out=virus_shred.fasta.gz length=85 minlength=75 overlap=30
3) Map shredded virus sequence to the GRCm38 genome using bbmap.sh from the bbtools package
bbmap.sh ref=GRCm38.fa.gz in=virus_shred.fasta.gz outm=map_mouse_all_viruses.sam minid=0.90
4) Mask virus sequenced mapped regions from the GRCm38 genome using bbmask.sh from the bbtools package
bbmask.sh in=GRCm38.fa.gz out=GRCm38_virus_masked.fasta.gz sam=map_mouse_all_viruses.sam
5) Remove all N's to further reduce file size using seqkit
seqkit -is replace -p "n" -r "" GRCm38_virus_masked.fasta.gz > mouse_virus_masked.fasta_Ns_removed.gz
Additional References:
Files
Files
(792.4 MB)
Name | Size | Download all |
---|---|---|
md5:39a273d9ba642c23c6a2f9986b99d203
|
792.4 MB | Download |