Published October 31, 2023
| Version 15-6f452
Software
Open
soedinglab/MMseqs2: MMseqs2 Release 15-6f452
Creators
- Milot Mirdita
- Martin Steinegger1
- larsdriesch
- ClovisG2
- Eli Levy Karin3
- RuoshiZ
- Annika Jochheim4
- Clovis Norroy
- Hans-Georg Sommer
- Florian Breitwieser
- Hayden Hyunjoo Ji
- Johannes Soeding5
- Michael R. Crusoe6
- Shyam Saladi7
- Valentyn Bezshapkin8
- Antonio Fernandez-Guerra
- Antônio Camargo9
- Benjamin Lee
- Dohyun Kim10
- George Young
- Huan Fan11
- Luiz Irber
- Mark Wilson
- Sascha Steinbiss
- Silas Kieser
- Stephanie Kim1
- Tony E Lewis
- cutecutecat12
- neftlon
- 1. Seoul National University
- 2. LJK-GINP
- 3. ELKMO
- 4. Max Planck Institute
- 5. Max-Planck institute for biophysical chemistry
- 6. @common-workflow-language
- 7. @clemlab, Caltech
- 8. Sunagawa Lab @ ETH Zürich
- 9. DOE Joint Genome Institute
- 10. GIST
- 11. University of Wisconsin - Madison
- 12. Southern University of Science and Technology
Description
MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1
). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.
Breaking
- Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (25688290) Thanks @bbuchfink
New Features and Enhancements
- Implement additional
prefilter
modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9f) - Added
createclusearchdb
andmkrepseqdb
modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458a, 80f8b0be, 542f3621, ad6dfc66, 91f2a6ac, 8310cd6b, 00190267, 76b7df1e) - Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32ec)
- Rework
ungappedprefilter
to improve performance and expose additional parameters such as taxon filtering and db-load-mode toungappedprefilter
(8a893050, 800eb094, eb01b5b7, 20d3afc7) - Added
gappedprefilter
module for Smith-Waterman prefiltering, similar toungappedprefilter
(df77d9e6) - Reworked
pairaln
for the ColabFold greedy taxonomy pairing mode (15140153) - Implemented experimental module for A3M filtering (167bbd12, 499bb730)
- Implemented weighted clustering (bd080e60, b36070af, fd1837b6) Thanks @AnnSeidel
- Precomputed indices without k-mers can be created with
--index-subset
(314c1f0c, 8fe3bf9b) - Add
result2neff
module to extract Neff scores (4148e093) Thanks @neftlon - Add
ppos
format-output toconvertalis
for count of positive substitution scores (5edc79bc) Thanks @Dohyun-s - Speed-up FASTA parsing in
kseq.h
with memchr (98406dd7) Thanks @valentynbez @kloetzl
Bugfixes
- Add min and max modes for
result2stats
(19dce033, 61e77340) Thanks @ClovisG - Fixed a segmentation fault in ca3m with the same database (f5f780ac) Thanks @ClovisG
- Fix crash when some input file sizes are an exact multiple of 4096 in
convertalis
andgff2db
(712f2887) Thanks @RuoshiZhang - Fixed issues for GTDB r214 database creation (4b522962) Thanks @apcamargo
- Fix source number being limited to 16-bit (65k) (1d62fa0c)
kseq
now correctly handles input sequences larger than 2^31 bytes (07ca4a7c)- Fixed
unpackdb
to work without a.lookup
file and added support for writing compressed files (92d8cc37, 570e3eda) createindex --check-compatible
check the k-mer threshold correctly now (bb0a1b35)- Fixed
prefilter
exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55fa) - Corrected handling of multiline checks in
createdb
(6b938846) - Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b3) Thanks @AnnSeidel
- Fixed logic in reciprocal-best-hit by removing
resAB_sort
(3bcbdbab) Thanks @StephanieSKim - Corrected handling of differently ordered parts of sequence databases in
concatdbs
(ea17d30f) - Fix
--single-step-clustering
misspelled in cluster warning (fa6c0938) Thanks @valentynbez
Build and Compatibility Updates
- Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad6, 3e436173, b341b663, 932d32b1) Thanks @A-N-Other
- Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d0, 05132de1)
- Updated regression testing to fix errors in MPI test (21137666)
Developer
- Introduced
base:
prefix to enable inheriting subprojects to find shadowed modules (i.e. Foldseek shadowscreatedb
, but can usebase:createdb
to use the MMseq2's one) (90aa9133) - Exported build architecture in CMake so subprojects can use it (fce06b11)
Files
soedinglab/MMseqs2-15-6f452.zip
Files
(14.1 MB)
Name | Size | Download all |
---|---|---|
md5:3ad07b9e4f3502952868ff96b0549f82
|
14.1 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/soedinglab/MMseqs2/tree/15-6f452 (URL)