There is a newer version of the record available.

Published October 8, 2024 | Version 226
Dataset Open

Mash Sketch of RefSeq Bacterial Reference Genomes

  • 1. Public Health Laboratory, Department of Health and Human Services, State of Utah

Description

The mash reference that can be downloaded from [the mash documentaion](https://mash.readthedocs.io/en/latest/data.html) is for RefSeq version 70.

I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now. 

RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.

This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.

The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.

The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.

I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.

This is how the mash reference file was created:

# Step 1. Download Datasets and Dataformat

wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat
chmod +x datasets dataformat

# Step 2. Download Mash
wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar
tar -xvf mash-Linux64-v2.3.tar

# Step 3. Get a list of all the genomes
# Note: this also changes how some of the names are represented
datasets summary genome taxon bacteria --reference --as-json-lines | \
  dataformat tsv genome --fields accession,organism-name --elide-header | \
  sed 's/\[//g' | \
  sed 's/\]//g' | \
  sed 's/["'\'']//g' | \
  sed 's/endosymbiont of /endosymbiont_of_/g' > \
  ids.txt

# Step 4. Download the reference files and sketch them
# Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. 
# The best way to do this is to download the process each reference file individually, and then combine it to the whole. 
# This obviously does not need to be followed if not under those same limitations.
while read line
do
  id=$(echo $line | awk '{print $1}')
  ge=$(echo $line | awk '{print $2}')
  if [ ! -n "$ge" ] ; then ge="unknown" ; fi
  sp=$(echo $line | awk '{print $3}')
  if [ ! -n "$sp" ] ; then sp="unknown" ; fi

  datasets download genome accession $id
  unzip ncbi_dataset.zip
  cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta
  if [ ! -f RefSeqSketches_${version}.msh ]
  then
    mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version}
  else          
    mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id}
    mv RefSeqSketches_${version}.msh tmp.msh
    mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh
    rm tmp.msh ${ge}_${sp}_${id}.msh
  fi

  rm ${ge}_${sp}_${id}.fasta
  rm -rf ncbi_dataset/
  rm ncbi_dataset.zip
  rm README.md
  rm md5sum.txt
done < ids.txt




To use

# download file
wget <insert url for file>

mask sketch sample.fasta RefSeqSketches_<version>.msh > mash_results.txt

# These results are unsorted, so many find it useful to sort them.

sort -gk3 mash_results.txt > sorted_mash_results.txt


      
The should look like the following:

2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_pyogenes_GCF_900475035.1.fasta	0.0116661	0	643/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_dysgalactiae_GCF_016128095.1.fasta	0.0782587	0	107/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_canis_GCF_900636575.1.fasta	0.132399	2.34894e-153	32/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_agalactiae_GCF_001552035.1.fasta	0.164662	1.32611e-72	16/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_castoreus_GCF_000425025.1.fasta	0.174408	2.34302e-58	13/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_didelphis_GCF_000380005.1.fasta	0.182269	8.30736e-49	11/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_uberis_GCF_900475595.1.fasta	0.186761	5.62934e-44	10/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_iniae_GCF_000831485.1.fasta	0.191731	3.33152e-39	9/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_ictaluri_GCF_000188015.2.fasta	0.197292	1.75608e-34	8/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_phocae_GCF_001302265.1.fasta	0.203604	2.46548e-30	7/1000

          

 

Files

Files (162.7 MB)

Name Size Download all
md5:6b47e6536e53c9f002e498cb39d0418b
162.7 MB Download

Additional details

Software

Repository URL
https://github.com/erinyoung/update_mash_dist
Programming language
Shell
Development Status
Active

References

  • Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x. PMID: 27323842; PMCID: PMC4915045.