grp-bork/gunc: v1.1.0
Authors/Creators
Description
v1.0.7
Summary ^^^^^^^
This release adds support for two new reference databases (ProGenomes 3, GTDB r214) and
a custom database option. A new gunc check subcommand validates your environment before
submitting a long job, and gunc rescore is introduced as a clearer alias for
gunc summarise. In addition a test_data database type has been added, which comprised of a minimal test set (sample, db, taxonomy) which can be used in CI/CD pipeline.
A warning is now emitted when genomes have low reference representation scores.
Packaging has been modernised to pyproject.toml and the CI pipeline updated.
Features ^^^^^^^^
- Added support for progenomes_3 and gtdb_214 reference databases.
- Added support for test_data set, a minimal set of data that can be used in CI/CD pipelines).
- Added
--custom_genome2taxonomyoption to allow use of a custom reference database. - Diamond version pinned to 2.1.24; enforced at startup with a clear error message. Set
GUNC_SKIP_DIAMOND_VERSION_CHECK=1to bypass. - Added
test_dataoption togunc download_db(--db test_data): downloads a minimal diamond database and two test genomes (chimeric and clean) that can be used to verify a GUNC installation end-to-end. - Added
gunc rescoreas the preferred name for thesummarisesubcommand;gunc summariseremains as a backward-compatible alias. - Added
gunc checksubcommand to validate environment (tool dependencies, database file, custom genome-to-taxonomy TSV format, output directory write access) without running the pipeline. - All subcommands (
run,plot,merge_checkm,summarise) now log the output file path on completion. --file_suffixerror message now suggests the correct flag usage when no files are found.- Fixed
metavar="\\b"hack insummariseargparse definitions; replaced with meaningful placeholders (FILE,DIR,FLOAT). - Documentation: added
gunc summarisesection with worked example; fixed--file_suffixincorrectly listed as required; fixed--gunc_filehelp referencinggunc_scores.tsv(actual filename isGUNC.{db}.maxCSS_level.tsv); added--custom_genome2taxonomyfile format spec; added output column definitions table; updated DB names to underscore convention throughout.
Bugfixes ^^^^^^^^
- Fixed
summarisesubcommand incorrectly marking all genomes as passing GUNC. - Fixed
pass.GUNCcolumn being silently converted to strings in output TSV;summarisenow uses proper NaN detection instead of string comparison. - Fixed
summarisenot rescoring genomes with booleanFalseinpass.GUNC; previously only the string"False"was matched, so boolean values (the normal case) were silently skipped. - Fixed genome identity corruption in
split_diamond_outputwhen contig names contain/; now usesrsplitto always extract the genome name from the last path segment. - Fixed DB detection logic duplicated across three code paths with subtly different ordering; extracted into single
detect_db_from_filename()function. - Fixed
prodigal()leaving partial output files on disk when gene calling fails; partial files are now removed so the caller's size check correctly excludes failed genomes. - Fixed
extract_node_data()in visualisation missing colour entries forclassandordertax levels, causingKeyErrorwhen non-default--tax_levelsare used. - Extracted
plot=Truepath fromchim_score()into dedicatedget_base_data_for_plotting()function;chim_score()now has a single consistent return type. - Fixed empty diamond output files not being named correctly when a genome fails to map ( thanks to @pamelaferretti ).
- Fixed edge case where contamination score was incorrectly calculated when contamination portion was NaN.
- Fixed crash when no genes were called or mapped to the reference database.
- Fixed shell injection risk in
get_record_count_in_fasta.
Other ^^^^^
- Removed versioneer; version is now statically set.
- Fixed 8 flake8 errors: import ordering in
get_scores.pyandvisualisation.py, trailing whitespace ingunc.py, spuriousf-string prefixes ingunc_database.py. - Extracted
CSS_CHIMERIC_THRESHOLD = 0.45andTAX_LEVELSas named constants inget_scores.py; replaced all three scattered hardcoded copies of the threshold and tax level list acrossgunc.py,checkm_merge.py, andvisualisation.py. - Fixed all
sys.exit(string)calls invisualisation.pyandget_scores.pyto uselogger.error()+sys.exit(1)consistently with the rest of the codebase; added module-level logger toget_scores.py. - Fixed
add_empty_diamond_output()usingprint()for progress output; now useslogger.info(). - Fixed
check_diamond_version()usingshell=True; now uses list-form subprocess call. - Added guard against empty
gunc_outputlist beforepd.concat()inrun_gunc()to give a clear error instead of a crypticValueError. - Reference data files renamed to reflect database version (e.g.
genome2taxonomy_pg2.1ref.tsv). - Documentation updated: diamond version, all four database options,
--custom_genome2taxonomyflag. - Migrated packaging from
setup.py+setup.cfg+MANIFEST.in+requirements.txtto a singlepyproject.toml(PEP 621); fixedpackage_datapaths, license field (GPLv3), droppeduniversal=1, and added minimum version pins for numpy (>=1.20), scipy (>=1.7), and plotly (>=5.0). - Replaced all
from module import *in test files with explicit named imports; marked network-dependent tests intest_gunc_database.pywith@pytest.mark.integration; addedconftest.pyregistering theintegrationmarker. - Added tests for
summarise(),get_scores_using_supplied_cont_cutoff(),read_genome2taxonomy_reference()(all 4 DBs + custom + unknown),split_diamond_output()round-trip, anddetect_db_from_filename().
New Contributors
- @pamelaferretti made their first contribution in https://github.com/grp-bork/gunc/pull/53
Full Changelog: https://github.com/grp-bork/gunc/compare/v1.0.6...v1.1.0
Files
grp-bork/gunc-v1.1.0.zip
Files
(4.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ce0a688eb74f6afbe15673122fd628f3
|
4.6 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/grp-bork/gunc/tree/v1.1.0 (URL)
Software
- Repository URL
- https://github.com/grp-bork/gunc