Mash Sketch of RefSeq Bacterial Reference Genomes

Young, Erin

doi:10.5281/zenodo.13901153

Published October 8, 2024 | Version 226

Dataset Open

Mash Sketch of RefSeq Bacterial Reference Genomes

Young, Erin (Contact person)¹

1. Public Health Laboratory, Department of Health and Human Services, State of Utah

The mash reference that can be downloaded from [the mash documentaion](https://mash.readthedocs.io/en/latest/data.html) is for RefSeq version 70.

I do not inherently have a problem with RefSeq version 70, but RefSeq is well past version 200 now.

RefSeq updates four times year, and I needed an easy way to create and distribute a mash sketch file of the representative bacterial/prokaryotic genomes.

This is intended to be a place to hold the mash sketches from https://github.com/erinyoung/update_mash_dist.

The mash sketch file from erinyoung/update_mash_dist requires git lfs to be installed when cloning the repository, which is cumbersome for some users.

The update requency is intended to mirror that of RefSeq (i.e. 4 time a year), but... is likely to be less frequent than that.

I do have some prior zenodo repositories (https://zenodo.org/records/10519852 , https://zenodo.org/records/7887021 , and https://zenodo.org/records/7348463 ) which hold the same mash sketch reference, but the refseq version is in the title. I'd rather have one repository that gets updated rather than create new repositories each time.

This is how the mash reference file was created:

# Step 1. Download Datasets and Dataformat

wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/dataformat
chmod +x datasets dataformat

# Step 2. Download Mash
wget https://github.com/marbl/Mash/releases/download/v2.3/mash-Linux64-v2.3.tar
tar -xvf mash-Linux64-v2.3.tar

# Step 3. Get a list of all the genomes
# Note: this also changes how some of the names are represented
datasets summary genome taxon bacteria --reference --as-json-lines | \
  dataformat tsv genome --fields accession,organism-name --elide-header | \
  sed 's/\[//g' | \
  sed 's/\]//g' | \
  sed 's/["'\'']//g' | \
  sed 's/endosymbiont of /endosymbiont_of_/g' > \
  ids.txt

# Step 4. Download the reference files and sketch them
# Note: Since this is done in Github Actions (GA), I need to keep everything below 30G. 
# The best way to do this is to download the process each reference file individually, and then combine it to the whole. 
# This obviously does not need to be followed if not under those same limitations.
while read line
do
  id=$(echo $line | awk '{print $1}')
  ge=$(echo $line | awk '{print $2}')
  if [ ! -n "$ge" ] ; then ge="unknown" ; fi
  sp=$(echo $line | awk '{print $3}')
  if [ ! -n "$sp" ] ; then sp="unknown" ; fi

  datasets download genome accession $id
  unzip ncbi_dataset.zip
  cp ncbi_dataset/data/*/*_genomic.fna ${ge}_${sp}_${id}.fasta
  if [ ! -f RefSeqSketches_${version}.msh ]
  then
    mash sketch ${ge}_${sp}_${id}.fasta -o RefSeqSketches_${version}
  else          
    mash sketch ${ge}_${sp}_${id}.fasta -o ${ge}_${sp}_${id}
    mv RefSeqSketches_${version}.msh tmp.msh
    mash paste RefSeqSketches_${version} tmp.msh ${ge}_${sp}_${id}.msh
    rm tmp.msh ${ge}_${sp}_${id}.msh
  fi

  rm ${ge}_${sp}_${id}.fasta
  rm -rf ncbi_dataset/
  rm ncbi_dataset.zip
  rm README.md
  rm md5sum.txt
done < ids.txt

To use

# download file
wget <insert url for file>

mask sketch sample.fasta RefSeqSketches_<version>.msh > mash_results.txt

# These results are unsorted, so many find it useful to sort them.

sort -gk3 mash_results.txt > sorted_mash_results.txt

The should look like the following:

2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_pyogenes_GCF_900475035.1.fasta	0.0116661	0	643/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_dysgalactiae_GCF_016128095.1.fasta	0.0782587	0	107/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_canis_GCF_900636575.1.fasta	0.132399	2.34894e-153	32/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_agalactiae_GCF_001552035.1.fasta	0.164662	1.32611e-72	16/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_castoreus_GCF_000425025.1.fasta	0.174408	2.34302e-58	13/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_didelphis_GCF_000380005.1.fasta	0.182269	8.30736e-49	11/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_uberis_GCF_900475595.1.fasta	0.186761	5.62934e-44	10/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_iniae_GCF_000831485.1.fasta	0.191731	3.33152e-39	9/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_ictaluri_GCF_000188015.2.fasta	0.197292	1.75608e-34	8/1000
2024CK-00429-UT-M03999-240412_contigs.fa	Streptococcus_phocae_GCF_001302265.1.fasta	0.203604	2.46548e-30	7/1000

Files

Files (162.7 MB)

Name	Size	Download all
RefSeqSketches_226.msh md5:6b47e6536e53c9f002e498cb39d0418b	162.7 MB	Download

Additional details

Obsoletes: Dataset: https://zenodo.org/records/10519852 (URL); Dataset: https://zenodo.org/records/7887021 (URL); Dataset: https://zenodo.org/records/7348463 (URL)

Repository URL: https://github.com/erinyoung/update_mash_dist
Programming language: Shell
Development Status: Active

Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x. PMID: 27323842; PMCID: PMC4915045.

	All versions	This version
Views	408	19
Downloads	125	6
Data volume	20.0 GB	976.0 MB

Files (162.7 MB)

Related works

Software

References

Mash Sketch of RefSeq Bacterial Reference Genomes

Authors/Creators

Description

Files

Files (162.7 MB)

Additional details

Related works

Software

References