Published April 28, 2022 | Version 2
Other Open

Guidance document for cluster analysis of whole genome sequence data

  • 1. ROR icon National Veterinary Institute
  • 1. ROR icon Istituto Superiore di Sanità
  • 2. ROR icon French Agency for Food, Environmental and Occupational Health & Safety
  • 3. ROR icon National Institute for Public Health and the Environment
  • 4. ROR icon Technical University of Denmark
  • 5. Swedish Food Agency - Livsmedelsverket

Description

The continuous implementation of whole genome sequencing (WGS) by different laboratories in the EU has enabled new approaches for European surveillance and cross-country outbreak investigations. There are many different choices the laboratories are faced with to analyse WGS data. Some of the choices will affect the end results and other will affect practical aspects of the application of results, for example when data is not comparable and when there are no tools or conformity to communicate the data. This document has been produced in the framework of the Inter-EURLs working group on next generation sequencing (inter EURLs WG on NGS). It aims to inform and support NRLs in the choices of methods to be used for the so-called cluster analysis, in which comparisons of genomes are performed followed by visualisations of the results to allow an interpretation of how closely the genomes are related to each other. The document currently focuses on bacterial pathogens represented by the EURLs of the WG, as these methods are not yet applied to the same extent for viruses or parasites.


Broadly, the most common comparison approaches can be divided into (i) the single nucleotide polymorphism (SNP) approach where individual mutations are used as separate phylogenetic markers and (ii) the gene-by-gene approach, where each variant of a gene is considered an allele. Both approaches are introduced in the next two sections, 2.1 and 2.2, and chapter 3 describes the main differences between them. Both approaches involve several steps of analysis, each depending on bioinformatic scripts or software, that all can affect the end results. These steps may include e.g., read trimming, assembly, read-mapping, alignment, variant calling, allele calling and dendrogram/tree production. There are both freely available and commercial software solutions that perform these steps. Which tools or software the laboratories choose to use will rely heavily on previous experiences as well as national and financial preferences. Chapter 4 and 5 provide technical information on each approach and list software, including those used by the EURLs and/or the NRLs of the EURL-networks of the WG on NGS, but does not discriminate between the different software. An alternative comparison approach is based on estimation of k-mer distances. This is summarised in section 2.3.


It is important that the users have a solid knowledge of the software and methodology in order to produce correct and comparable results. Further, the different steps of analysis should be evaluated for each pathogen, sequencing machine and software intended for use when setting up the method. Validation of all steps of the end-to-end WGS workflow has been described in the document ‘Guidance document for WGS benchmarking’ also produced by the Inter-EURLs WG on NGS. All deliverables produced by the Inter-EURLs WG on NGS can be reached from the EURL websites.

SNP approach
Analysing WGS data by identifying SNPs that vary among isolates is generally regarded as the method with the highest resolution for relatedness studies. SNPs can be very informative markers when analysed correctly. Several solutions exist for identifying SNPs and many so-called “SNP pipelines”, which typically combine standalone bioinformatics tools into a workflow that generates a compilation of SNP differences and sometimes also include phylogenetic visualisation. For experienced bioinformaticians, it is possible to build customized SNP pipelines. The most common approach is to determine SNPs by comparing WGS data from isolates to a reference genome. However, there are also approaches that do not use a reference genome and approaches that use several reference genomes. SNP identification is usually done by mapping the sequence reads to the reference using a read-mapping software. A variant calling software is then used to determine the SNPs (relative to the reference) and the variants for each of the isolates are then combined into a format that allows for an analysis of phylogenetic relatedness. Some approaches can use/require assembled genomes instead of sequence reads as input. There are typically some quality filtering steps, which are very important to avoid calling false SNPs. Lack of a consensus in how to apply these filtering criteria and the multitude of read mappers, aligners, variant callers and tree-producing algorithms make SNP analysis difficult to standardize. Analysing a large dataset with SNP analysis can be computationally intensive and may therefore be time consuming depending on the available computational capacity. A schematic view of the fundamental steps in the SNP approach is presented in Figure 1.

Gene-by-gene approach
The gene-by-gene approach is basically a multilocus sequence typing (MLST) analysis upscaled to include up to thousands of genes or parts of genes [1]. This extended MLST is often referred to as core genome (cg) MLST (using a conserved core of target genes found in nearly all strains of a species) or whole genome (wg) MLST (using all genes found in the strains used to create the allele database). For the gene-by-gene approach, instead of a reference genome, the user supplies a gene target list, which is usually called the cg/wgMLST-scheme. This is either a list of conserved core genes (cgMLST) or both conserved and accessory genes (wgMLST). The gene-by-gene method usually accepts assembled genomes as input. Analysis is performed by aligning the gene targets (from the cg/wgMLST-scheme) to the assembly and extracting the isolate’s allele sequences. An alternative strategy is to skip the assembly step and identify alleles by mapping reads directly to the target genes. When a new allele sequence has been identified, it receives an integer, which is increased by 1 for each new allele. This is referred to as allele calling and can, together with the assembly process, be time consuming depending on the computational capacity. However, once the allele calling is done, it does not have be performed again on those isolates. Thus, if the user wants to add additional genomes to the analysis at a later stage, allele calling will only be done on the new genomes. The result from a cg/wgMLST run is a table with integers or a dissimilarity matrix, which makes the following cluster analysis computationally trivial. A schematic view of the fundamental steps in the gene-by-gene approach is presented in Figure 1.

Files

Biorisks EURLs WG on NGS - Del5_Guidelines_cluster_analysis-Segerman-20220428-v2.pdf