Published October 31, 2023 | Version 15-6f452
Software Open

soedinglab/MMseqs2: MMseqs2 Release 15-6f452

  • 1. Seoul National University
  • 2. LJK-GINP
  • 3. ELKMO
  • 4. Max Planck Institute
  • 5. Max-Planck institute for biophysical chemistry
  • 6. @common-workflow-language
  • 7. @clemlab, Caltech
  • 8. Sunagawa Lab @ ETH Zürich
  • 9. DOE Joint Genome Institute
  • 10. GIST
  • 11. University of Wisconsin - Madison
  • 12. Southern University of Science and Technology

Description

MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.

Breaking

  • Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (25688290) Thanks @bbuchfink

New Features and Enhancements

  • Implement additional prefilter modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9f)
  • Added createclusearchdb and mkrepseqdb modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458a, 80f8b0be, 542f3621, ad6dfc66, 91f2a6ac, 8310cd6b, 00190267, 76b7df1e)
  • Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32ec)
  • Rework ungappedprefilter to improve performance and expose additional parameters such as taxon filtering and db-load-mode to ungappedprefilter (8a893050, 800eb094, eb01b5b7, 20d3afc7)
  • Added gappedprefilter module for Smith-Waterman prefiltering, similar to ungappedprefilter (df77d9e6)
  • Reworked pairaln for the ColabFold greedy taxonomy pairing mode (15140153)
  • Implemented experimental module for A3M filtering (167bbd12, 499bb730)
  • Implemented weighted clustering (bd080e60, b36070af, fd1837b6) Thanks @AnnSeidel
  • Precomputed indices without k-mers can be created with --index-subset (314c1f0c, 8fe3bf9b)
  • Add result2neff module to extract Neff scores (4148e093) Thanks @neftlon
  • Add ppos format-output to convertalis for count of positive substitution scores (5edc79bc) Thanks @Dohyun-s
  • Speed-up FASTA parsing in kseq.h with memchr (98406dd7) Thanks @valentynbez @kloetzl

Bugfixes

  • Add min and max modes for result2stats (19dce033, 61e77340) Thanks @ClovisG
  • Fixed a segmentation fault in ca3m with the same database (f5f780ac) Thanks @ClovisG
  • Fix crash when some input file sizes are an exact multiple of 4096 in convertalis and gff2db (712f2887) Thanks @RuoshiZhang
  • Fixed issues for GTDB r214 database creation (4b522962) Thanks @apcamargo
  • Fix source number being limited to 16-bit (65k) (1d62fa0c)
  • kseq now correctly handles input sequences larger than 2^31 bytes (07ca4a7c)
  • Fixed unpackdb to work without a .lookup file and added support for writing compressed files (92d8cc37, 570e3eda)
  • createindex --check-compatible check the k-mer threshold correctly now (bb0a1b35)
  • Fixed prefilter exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55fa)
  • Corrected handling of multiline checks in createdb (6b938846)
  • Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b3) Thanks @AnnSeidel
  • Fixed logic in reciprocal-best-hit by removing resAB_sort (3bcbdbab) Thanks @StephanieSKim
  • Corrected handling of differently ordered parts of sequence databases in concatdbs (ea17d30f)
  • Fix --single-step-clustering misspelled in cluster warning (fa6c0938) Thanks @valentynbez

Build and Compatibility Updates

  • Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad6, 3e436173, b341b663, 932d32b1) Thanks @A-N-Other
  • Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d0, 05132de1)
  • Updated regression testing to fix errors in MPI test (21137666)

Developer

  • Introduced base: prefix to enable inheriting subprojects to find shadowed modules (i.e. Foldseek shadows createdb, but can use base:createdb to use the MMseq2's one) (90aa9133)
  • Exported build architecture in CMake so subprojects can use it (fce06b11)

Files

soedinglab/MMseqs2-15-6f452.zip

Files (14.1 MB)

Name Size Download all
md5:3ad07b9e4f3502952868ff96b0549f82
14.1 MB Preview Download

Additional details

Related works