There is a newer version of the record available.

Published February 12, 2024 | Version v2.1.0
Software Open

CDCgov/phoenix: v2.1.0

  • 1. CDC (work) + personal projects
  • 2. CDC (work) + personal projects @CDCgov @informaticslab
  • 3. Utah Public Health Laboratory

Description

v2.1.0 (02/11/2024)

Full Changelog

Implemented Enhancements:

  • Added handling for "unknown" assemblers in the scaffolds entry point so genomes can be downloaded from NCBI and run through PHoeNIx.
  • For entry points CDC_PHOENIX or PHOENIX you can now use the argument --create_ncbi_sheet to generate partially filled out excel sheets for uploading to NCBI. You will still need to fill in some lab/sample specific information and review for accuracy, but this should speed up the process. As a reminder, please do not submit raw sequencing data to the CDC HAI-Seq BioProject (531911) that are auto populated in these sheet unless you are a state public health laboratory, a CDC partner or have been directed to do so by DHQP. The BioProject accession IDs in these files are specifically designated for domestic HAI bacterial pathogen sequencing data, including from the Antimicrobial Resistance Laboratory Network (AR Lab Network), state public health labs, surveillance programs, and outbreaks. For inquiries about the appropriate BioProject location for your data, please contact HAISeq@cdc.gov.
  • New Terra workflow for combining Phoenix_Summary.tsv, GRiPHin_Summary.tsv and GRiPHin_Summary.xlsx of multiple runs into one file. This workflow will also combine the NCBI excel sheets created when using the --create_ncbi_sheet.
  • software_versions.yml now contains versions for all custom scripts used in the pipeline to streamline its validation process and align it with CLIA requirements, ensuring smoother compliance.
  • MultiQC now contains graphs and data from BBDuk, FastP, Quast and Kraken. BUSCO is also part of MultiQC if the entry point runs it (i.e. CDC_* entries).
  • AMRFinder+ species that are screened for point mutations were updated with Enterobacter asburiae, Vibrio vulfinicus and Vibrio parahaemolyticus.
  • A check was added to ensure only SRR numbers are passed to -entry CDC_SRA and SRA.
  • After extensive QC cut off review addtional warnings and minimum QC cut-offs were added:
    • Minimum PASS/FAIL:
      • %gt; 500 scaffolds
      • FAIry (file integrity check) - see Fixed Bugs section below for details.
    • Warnings:
      • 200-500 scaffolds -> high, but not enough for failure
      • Taxa Quality Checks:
        • FastANI Coverage <90% and Match <95%
        • For entries BUSCO <97%
      • Contamination Checks:
        • <70% of reads/weighted scaffolds assigned to top geneus hit.
        • Added weighted scaffold to kraken <30% unclassifed check (was just on reads before)
        • Added weighted scaffold to kraken only 1 genera >25% of assigned check (was just reads before)

Output File Changes:

  • The default outdir phx produces was changed. If the user doesn't pass --outdir, the default was changed from results to phx_output. This was changed in response to feedback from compliance program, to avoid confusion regarding the difference between public health results (i.e. summary) and diagnostic results (i.e. report).
  • The phx_output/FAIry folder will contain a *_summaryline_failure.tsv file for any isolate where file corruption was detected.
  • *.tax file had the NCBI assigned taxID added after the : for easy lookup.

Fixed Bugs:

  • Updated tower.yml file to reflect file name changes in v2.0.2. This will enable nf-tower reports to properly show up. commit e1b2b91
  • GRiPHin_Summary.xlsx was highlighting coverage outside 40-100x despite --coverage setting, changes made to respect --coverage flag.
  • Added a fix to handle when auto select by the mlst script chooses the wrong taxonomy. PHoeNIx will force a rerun in cases where the taxonomy is known but initial mlst is run against incorrect scheme. Known instances found so far include: E. coli (Pasteur) being incorrectly indentified as Aeromonas and E. coli (Pasteur) being identified as Klebsiella. The scoring in the MLST program was updated and can now cause lower count perfect hits (e.g. 6 of 6 Aeromonas genes at 100%) to be scored higher than novel correct hits (e.g. 7 of 8 at 100%, 1 novel gene).
  • Corrected instance where, in some cases, an mlst scheme could not be determined that a proper out file was not created.
  • Fixed issue with MLST where certain characters in filename would cause array index out of bounds error
  • Fixed issue where samples that failed SPAdes did not have --coverage parameter respected when generating synopsis file.
  • Fixed -entry CDC_SCAFFOLDS providing incorrect headers (missing BUSCO and BUSCO_DB).
  • Updated FAIry (file integrity check) to catch additional file integrity errors.
    • FAIry detects and reports when:
      • Corrupt fastq files that prevents the completion of gzip and zcat and generate a synopsis file when needed.
      • If R1/R2 fastqs that do not have equal number of reads in the files.
      • If there are no reads or scaffolds left after filtering and read trimming steps, respectively.

Container Updates:

Database Updates:

Files

CDCgov/phoenix-v2.1.0.zip

Files (270 Bytes)

Name Size Download all
md5:22d72ee8b4bcf54e7f8433bc31c3c5fd
270 Bytes Preview Download

Additional details

Related works

Is supplement to
Software: https://github.com/CDCgov/phoenix/tree/v2.1.0 (URL)