ncov-recombinant v0.4.2 - v0.5.0

Test Summary Package

This report was automatically generated on October 3, 2022.

Authors

Katherine Eaton | National Microbiology Laboratory, PHAC |

1. Summary

The ncov-recombinant update from v0.4.2 to v0.5.0 has two major changes. The first is increased flexibility in creating and defining sc2rf modes, which allows sc2rf to run with different parameter sets for breakpoint detection. The second change is a Nextclade upgrade to the sars-cov-2 2022-09-27 dataset, along with validation of all designated recombinants in this dataset (XA to XBC).

Between v0.4.2 and v0.5.0, 47.5% of sequences in the controls-gisaid dataset had different detection results. 16.2% of sequences were newly classified (NAX*) and represent lineages not present in the v0.4.2 model. 31.3% of sequences had lineage assignment changes as a result of the Nextclade dataset upgrade and manual curation of previously published breakpoints. 0% of positive controls were dropped between (X*NA), indicating no observed loss in sensitivity.

ncov-recombinant v0.5.0 is a strongly recommended upgrade for monitoring existing recombinants and performing routine surveillance for emerging lineages, given the high proportion of sequences (47.5%) with lineage assignment changes.

For a comprehensive summary of the methodological changes, please see the release notes for v0.5.0

2. Purpose

Verify that the update of ncov-recombinant pipeline from version 0.4.2 to0.5.0:

  1. Maintains specificity for recombinants trained in previous versions.
  2. Increased sensitivity for newly designated recombinants.

3. Datasets

Controls (controls-gisaid)

This dataset includes SARS-CoV-2 genomes from GISAID that reflect the known diversity of recombinant sequences to date. These include 431 positive controls (recombinants), representing lineages XA - XBC and 186 negative controls (non-recombinants) selected from the Nextstrain Reference Phylogeny.

In total, 617 control sequences were used as input and a strain list is available here.

4. Procedure

The snakemake pipelines for v0.4.2 and v0.5.0 were run independently on the same dataset (controls-gisaid). Please see the Procedure section in the Supplementary for detailed command-line instructions.

5. Results

Controls (controls-gisaid)

Note: Lineage assignments in v0.5.0 are identical to those in pango-designation and are the expected values.

Figure 1: Comparison of lineage assignments in the controls-gisaid dataset between v0.4.2 and v0.5.0.

New Detections

New detections (NAX*) result from the following changes in v0.5.0:

  1. New parameters for recombination involving clades other than Delta and Omicron: XA, XB, XC.
  2. New parameters for recombination occurring within clades (ex. BA.2): XBB
  3. New edge case handling: XN, XP, XAR, XAS, XAZ
  4. Removed restrictions on the number of breakpoints: XAY, XBA, XBC
  5. Clade definition updates to include recombination involving 22C (BA.2.12): XAJ

Lineage Changes

Lineage changes result from the following updates in v0.5.0:

  1. Curation of published breakpoints.

  2. Nextclade dataset updates.

* Why were sequences of XAL assigned to XM rather than XM-like in v0.4.2 ?

XAL is almost identical to XM, with the same hotspot breakpoint (17411:19954), and the same high-confidence parental lineages (BA.1.1*,BA.2*; confidence: 0.994,0.996). Before XAL was designated, ncov-recombinant had no way to detect that sequences of XAL belonged to a distinct cluster from XM. Furthermore, XAL only differs from XM by two mutations (A2865G, G21586T) which is insufficient evidence for ncov-recombinant to call this XM-like (by default, requires a minimum of three mutations). Finally, it is unclear whether XAL emerged from a unique recombination event, or is a sublineage within XM. For more information, please see pango-designation issue XAL #757.

Why were sequences of XAR assigned to XN rather than XN-like in v0.4.2 ?

XAR and XN are handed as special cases by ncov-recombinant, because their breakpoints lie at the extreme 5’ end of the genome (2834:4183) with few diagnostic alleles from a second parent (BA.1). Breakpoint and parents often cannot be detected by sc2rf and therefore before XAR was designated, ncov-recombinant could not differentiate them. For more information, please see ncov-recombinant issues XN #137, XAR #106, #74, and #90.

Why were sequences of XAP assigned to XZ rather than XZ-like in v0.4.2 ?

XAP is almost identical to XZ, with the same hotspot breakpoint (26061:26529), and the same parental lineages (BA.2*,BA.1.1*; confidence: 0.999,0.544). See the above discussion on XAL* for more information.

Figure 2: Historical timeline of recombinants in the controls-gisaid dataset in v0.4.2.
Figure 3: Historical timeline of recombinants in the controls-gisaid dataset in v0.5.0.
Figure 4: Breakpoint distributions by clade of the controls-gisaid dataset in v0.5.0.

Supplementary

Note: Download the GISDAID sequences and metadata in the strains list.

Procedure

Version 0.4.2 | 37f40480

Note: A commit hash (37f40480) is used instead of the tag (v0.4.2), for an important bugfix that was introduced between v0.4.2 and v0.4.3.

  1. Download the pipeline.

    git clone --recursive https://github.com/ktmeaton/ncov-recombinant.git 0.4.2
    cd 0.4.2
    git checkout 37f40480
  2. Version control submodules.

    cd sc2rf
    git checkout 2852f05a
    cd ..
  3. Create a version-controlled conda environment.

    mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.4.2
  4. Create profile for controls-gisaid.

    scripts/create_profile.sh --data data/controls-gisaid --hpc
  5. Manually change MIN_LINEAGE_SIZE in scripts/linelist.py to 5.

  6. Run the pipeline.

    scripts/slurm.sh --conda-env ncov-recombinant-0.4.2 --profile my_profiles/controls-gisaid-hpc

Version 0.5.0 | e90f5ac1

  1. Download the pipeline.

    git clone https://github.com/ktmeaton/ncov-recombinant.git 0.5.0
    cd 0.5.0
    git checkout v0.5.0
  2. Create a version-controlled conda environment.

    mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-0.5.0
  3. Run the pipeline.

    scripts/slurm.sh --conda-env ncov-recombinant-0.5.0 --profile my_profiles/controls-gisaid-hpc

Comparison

After the pipelines are complete for each version, run the following to compare lineage assignments.

python3 0.5.0/scripts/compare_positives.py \
  --positives-1 0.4.2/results/controls-gisaid/linelists/positives.tsv \
  --positives-2 0.5.0/results/controls-gisaid/linelists/positives.tsv \
  --ver-1 "v0.4.2" \
  --ver-2 "v0.5.0" \
  --outdir compare/controls-gisaid \
  --node-order alphabetical \
  --min-link-size 1

New Lineages

csvtk cut -t -f "strain" 0.4.2/results/controls-gisaid/linelists/positives.tsv \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - -v 0.5.0/results/controls-gisaid/linelists/positives.tsv \
  | csvtk cut -t -f "strain" \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - 0.4.2/results/controls-gisaid/linelists/linelist.tsv \
  | csvtk pretty -t \
  | less -S

Dropped Lineages

csvtk cut -t -f "strain" 0.5.0/results/controls-gisaid/linelists/positives.tsv \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - -v 0.4.2/results/controls-gisaid/linelists/positives.tsv \
  | csvtk cut -t -f "strain" \
  | tail -n+2 \
  | csvtk grep -t -f "strain" -P - 0.5.0/results/controls-gisaid/linelists/linelist.tsv \
  | csvtk pretty -t \
  | less -S