Can-Seq: a PCR and DNA sequencing strategy for identifying new alleles of known and candidate genes


	Discussion



	Optimisation of Can-Seq bioinformatics workflow

The Can-Seq bioinformatics workflow ( https://github.com/Carroll-Lab/can_seq) was designed to be as simple and user-friendly as possible, and can be installed and operated on a consumer-grade PC. A single command instigates the workflow, which processes Can-Seq raw read files and GenBank-format reference files, carries out alignments, identifies variant nucleotides above the threshold frequency and the consequent amino acid changes, and outputs these data as both CSV and annotated GenBank files.
To date, we have produced Can-Seq libraries based on up to 23 mutants in a bulk DNA template for up to 32 candidate genes. In view of the potential risk of false positives, we feel we have approached the limit for the number of independent mutants that can be included in a single Can-Seq library. However, the number of candidate genes per Can-Seq library could be expanded further depending on the depth of sequencing. The bioinformatics workflow requires no modification should this be the case, and is independent of the number of candidate genes included and analysed in each Can-Seq library.


	Can-Seq identifies mutants carrying causative rtp mutations in novel genes

Can-Seq identified 63 candidate rtp mutations in homozygous or heterozygous configurations in one of 40 independent rtp mutants analysed (Additional file3: Table S3). In total, 12 rtp mutants lacked a homozygous candidate mutation and 28 rtp mutants carried one or more homozygous candidate mutations (Additional file4: Table S4). However, in the case of EMS#146, the homozygous candidate missense mutation in RDR6 (P1073L) was not the causative mutation (Table1; Fig.3; Additional file7: Table S5). Thus, 13 of the 40 rtp mutants either did not carry a homozygous candidate mutation, or carried a homozygous candidate mutation that was not linked to the causative mutation (i.e. EMS#146; Additional file4: Table S4). Two of the rtp mutants carried candidate mutations in DCL2, which we subsequently showed by complementation tests to be the causative mutations in these mutants ([11]; Table1). These were the first dcl2 mutants to be recovered in a forward genetic screen, and the first demonstration that DCL2 plays a crucial role in systemic RNAi in Arabidopsis [11]. Thus, Can-Seq can not only be used to recover new alleles of known genes for a trait of interest, but also to uncover a role for additional candidate genes not previously implicated in the trait.
The 13 rtp mutants that either did not carry a homozygous candidate mutation, or carried a homozygous candidate mutation that was not linked to the causative mutation, therefore must harbour a causative mutation in a novel gene required for systemic PTGS (Additional file4: Table S4). In future studies, these novel rtp mutations will be identified by using a map-based gene cloning approach, which would initially involve whole genome sequencing of bulk DNA from BC1F2 mutants to map the chromosomal vicinity of the causative mutation in each mutant ([19]; Fig.4). Reverse genetics will then be used to identify the novel gene(s) involved in systemic PTGS (Fig.4).


	Advantages of Can-Seq over exome capture sequencing

Currently, the main alternative to Can-Seq for identifying new alleles of candidate genes is whole genome or exome capture sequencing [8]. However, Can-Seq offers several advantages over exome capture sequencing, including (i) simpler and faster template preparation, (ii) lower costs per independent mutants (Additional file8: Table S6), (iii) greater sequencing accuracy due to greater sequencing depth; Can-Seq delivered at least 60X coverage for each candidate gene compared to 20X coverage by exome sequencing [8], and (iv) less computationally intensive to mine and identify candidate mutations.
Particularly important for small budgets is the advantage of Can-Seq over whole genome or exome capture sequencing in cost per mutant (Additional file8: Table S6). Excluding labour costs, the total cost of using Can-Seq to identify candidate rtp mutations in 40 mutants (including PCR, Illumina HiSeq 2000 sequencing and identifying the mutant carrying each candidate mutation) was conservatively estimated to about US$89 per mutant (Additional file8: Table S6). The cost of whole genome or exome sequencing per mutant is conservatively estimated at ~ US$300 [20]. Thus, the cost of Can-Seq was about three times lower than the cost of whole genome sequencing or exome capture sequencing per mutant.


	Methods



	Growth and EMS mutagenesis of Arabidopsis

Arabidopsis seedlings and plants were grown in long-day conditions (16 h light, 8 h dark) under fluorescent lighting (70–80 μmol/m [2]/s) at 21 °C. Seeds from the transgenic reporter line 10027-3 wild type in a Columbia (Col-0) genetic background were mutagenized with ethyl methanesulfonate (EMS) as previously described [21]. Approximately 30,000 M1 seeds were treated, then germinated on UC mix soil. M2 seeds were collected from batches of 100 M1 plants, then approximately 100 M2 seeds from each batch were sown on UC mix and screened for defects in systemic silencing of GFP at two weeks post-germination [11]. Only one mutant was selected per batch of 100 M1 plants, and so all 40 rtp mutants used in this study arose from independent mutation events.


	Photography and image analysis

Plant photographs were taken with an EOS 600D (Canon) digital camera with an orange filter for GFP visualisation. Blue light illumination was provided by six Dark Reader Hand Lamps (Clare Chemical Research). Images were uniformly adjusted for brightness and contrast with Adobe Bridge CS6.


	High-quality genomic DNA extraction

DNA extractions were performed on leaf tissue from individual mutants or on bulked leaves (each approximately 0.25 cm [2]) from up to 23 individual mutants. High-quality genomic DNA was extracted using the protocol of Carroll et al. [22], but with some modifications. Leaves were crushed using a plastic rod in a 2 mL Eppendorf tube containing liquid nitrogen, followed by the addition of 500 µL of pre-warmed nuclear lysis buffer (NLB; 0.2 M Tris–HCl pH 7.5, 0.05 M EDTA, 2 M NaCl, 2% (w/v) hexadecyltrimethylammonium bromide (CTAB), 0.6% (w/v) sodium sulfite) and 100 µL 5% (w/v) sarkosyl (N-lauryl-sarcosine). The tubes were then sealed and inverted gently 10 times. After incubation at 65 °C for 1 h, including inverting the tubes once every 10 min, 850 µL of phenol:chlorophorm:isoamyl alcohol (25:24:1) was added and mixed by inverting 60 times before centrifuging the tubes at full speed for 5 min. The aqueous phase containing the DNA (350 µL) was transferred to a clean 1.5 mL Eppendorf tube and the DNA was precipitated by the addition of 350 µL of isopropanol. DNA was pelleted by centrifugation at full speed for 5 min and washed twice in 500 µL of 70% ethanol. After removing all ethanol and air-drying in a laminar flow cabinet for 5 min, the DNA was resuspended in 50 µL of Tris–EDTA (10 mM Tris–HCL, 0.1 mM EDTA).


	PCR amplification for preparing PCR products for sequencing

For each candidate gene, PCRs were carried out using Phusion High Fidelity DNA polymerase (NEB). Each reaction contained 0.5 μM of each primer, 200 μM dNTPs, 3% DMSO, 0.4 U of Phusion polymerase, 1 × Phusion HF buffer, and approximately 100 ng of template DNA in a final volume of 20 μL. PCRs were run in heated-lid thermal cyclers, with optimal cycling conditions determined empirically. Typical cycling conditions were 35 cycles including a denaturing step of 98 °C for 15 s, an annealing temperature specific to each primer pair for 30 s, and an elongation step of 72 °C for 1 min per kb of amplicon. PCR products were electrophoresed on 0.7% agarose gels to confirm a single discrete product. PCR products were column purified using a QIAquick kit (Qiagen). PCR products for up to 32 candidate genes were bulked for DNA sequencing, ensuring equal DNA molarity of the PCR products from each candidate gene. Primers used to amplify the 47 known or candidate genes are listed in Additional file1: Table S1.


	Deep sequencing

Library construction and deep sequencing were carried out by the Beijing Genomics Institute (BGI). Libraries were prepared for Illumina sequencing without the further use of PCR. Sequencing on the Illumina HiSeq 2000 platform generated 91 bp paired-end reads, which were provided in the FASTQ format. Approximately 15 million reads were generated for each Can-Seq library.


	Bioinformatics analysis

A workflow for identification of candidate gene mutations and annotation of GenBank format sequence files was developed as a custom Snakemake script ( https://github.com/Carroll-Lab/can_seq). The process was implemented as follows: Trim-galore was used to remove adapter and low-quality portions of each read ([23]; https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). GenBank format reference files for each gene were converted to FASTA format using the SeqIO Biopython module ( https://biopython.org/wiki/Biopython). Alignment of reads to reference sequences was carried out using Bowtie 2 [24], with variant positions identified using SAMtools mpileup output piped to VarScan [25,26]. CSV files produced by VarScan were in turn parsed and used to generate a new annotated version of the original GenBank file for each gene, along with an additional simplified CSV output. Filters in the user-configurable config.json file were set to default values; for a nucleotide variant to be called, it must be a canonical EMS mutation (G→A or C→T), be present in at least 0.75% of reads covering the position, with at least 200 reads aligning to that position, of which at least 30 reads must contain the variant nucleotide and at least five reads of the variant nucleotide must align in both the forward and reverse orientation. Information provided in the output includes the location of the reference and variant nucleotide in the candidate gene, along with the abundance of the variant expressed as a percentage of total alignments at that nucleotide of the candidate gene.


	Mutation validation and zygosity assay

For the validation of each nucleotide variant, cleaved amplified polymorphic (CAPS) or derived cleaved amplified polymorphic (dCAPS) allele-specific PCR assays were carried out on genomic DNA extracted from individual mutant plants using Taq polymerase [16–18]. Each PCR reaction contained 0.5 μM of each primer, 200 μM dNTPs, 0.5 μM MgCl2, 0.25 U of Taq polymerase and approximately 100 ng of template DNA in a final volume of 10 μL in 1× PCR buffer (100 mM Tris pH 8.3, 500 mM HCl, 15 mM MgCl2 and 0.01% gelatin). Typical cycling conditions were 35 cycles including a denaturing step of 94 °C for 15 s, an annealing step of 60 °C for 15 s, and an elongation step of 72 °C for 3 min. After PCR amplification, PCR products were digested using the appropriate restriction enzyme for 2–3 h. Specific primers and restriction enzymes to detect each candidate mutation are listed in Additional file3: Table S3. The zygosity of each candidate mutation was based on genotyping at least six individual plants of the respective rtp mutant line.


	Complementation tests

Complementation tests were performed by crossing independent rtp mutants containing homozygous candidate mutations in the same gene (see Table1 for the crosses performed). Crosses were made using the protocol described by Weigel and Glazebrook [21]. The GFP silencing phenotype of F1 progeny was scored, and verification of the cross was carried out using the PCR zygosity assay described above. Each mutant used in complementation tests was also backcrossed to the 10027-3 wild type, and BC1F1 and/or BC1F2 progeny was scored for GFP expression and inheritance of the rtp phenotype. Genetic mapping of some candidate mutations was conducted on homozygous BC1F2 mutant plants using PCR zygosity tests described above.
