Published September 8, 2025 | Version 232
Dataset Open

Mash Sketch of RefSeq Bacterial Reference Genomes

  • 1. Public Health Laboratory, Department of Health and Human Services, State of Utah

Description

The mash reference that can be downloaded from the mash documentaion is for RefSeq version 70.

I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now. 

RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.

This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.

The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.

The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.

Don't hesitate to submit an issue if this needs to get updated.

I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.

This is how the mash reference file was created:

# Step 1. Download Datasets and Dataformat
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat
chmod +x datasets dataformat


# Step 2. Download Mash

 wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar
tar -xvf mash-Linux64-v2.3.tar
 

# Step 3. Get a list of all the genomes # Note: this also changes how some of the names are represented datasets summary genome taxon bacteria --reference --as-json-lines | \ dataformat tsv genome --fields accession,organism-name --elide-header | \ sed 's/\[//g' | \ sed 's/\]//g' | \ sed 's/["'\'']//g' | \ sed 's/endosymbiont of /endosymbiont_of_/g' > \ ids.txt # Step 4. Download the reference files and sketch them # Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. # The best way to do this is to download the process each reference file individually, and then combine it to the whole. # This obviously does not need to be followed if not under those same limitations. while read line do id=$(echo $line | awk '{print $1}') ge=$(echo $line | awk '{print $2}') if [ ! -n "$ge" ] ; then ge="unknown" ; fi sp=$(echo $line | awk '{print $3}') if [ ! -n "$sp" ] ; then sp="unknown" ; fi datasets download genome accession $id unzip ncbi_dataset.zip cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta if [ ! -f RefSeqSketches_${version}.msh ] then mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version} else mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id} mv RefSeqSketches_${version}.msh tmp.msh mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh rm tmp.msh ${ge}_${sp}_${id}.msh fi rm ${ge}_${sp}_${id}.fasta rm -rf ncbi_dataset/ rm ncbi_dataset.zip rm README.md rm md5sum.txt done < ids.txt


To use
# download file
wget <insert url for file>

mask sketch sample.fasta RefSeqSketches_<version>.msh > mash_results.txt

# These results are unsorted, so many find it useful to sort them.

sort -gk3 mash_results.txt > sorted_mash_results.txt

      
The should look like the following:

2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_pyogenes_GCF_900475035.1.fasta	0.0116661	0	643/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_dysgalactiae_GCF_016128095.1.fasta	0.0782587	0	107/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_canis_GCF_900636575.1.fasta	0.132399	2.34894e-153	32/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_agalactiae_GCF_001552035.1.fasta	0.164662	1.32611e-72	16/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_castoreus_GCF_000425025.1.fasta	0.174408	2.34302e-58	13/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_didelphis_GCF_000380005.1.fasta	0.182269	8.30736e-49	11/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_uberis_GCF_900475595.1.fasta	0.186761	5.62934e-44	10/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_iniae_GCF_000831485.1.fasta	0.191731	3.33152e-39	9/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_ictaluri_GCF_000188015.2.fasta	0.197292	1.75608e-34	8/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_phocae_GCF_001302265.1.fasta	0.203604	2.46548e-30	7/1000
 

Files

Files (149.6 MB)

Name Size Download all
md5:aaa9132f7775c7482d1f7405ac7f7d2f
149.6 MB Download

Additional details

Software

Repository URL
https://github.com/erinyoung/update_mash_dist
Programming language
Shell
Development Status
Active

References

  • Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x. PMID: 27323842; PMCID: PMC4915045.