Published June 5, 2023 | Version v1
Software Open

Supplementary materials to: Nano-Strainer: a workflow for identification of single-copy nuclear loci for plant systematic studies, using target capture kits and Oxford Nanopore long reads

  • 1. University of Regensburg

Description

In the paper associated with this dataset, a workflow is presented which enables the identification of single-/low-copy nuclear molecular markers for a plant group of interest, by mining data from a small representative target capture experiment done using a commercial probe kit and Oxford Nanopore long-read sequencing. The proposed pipeline first assesses sequence variability contained in the data from targeted loci and assigns reads to their respective genes, via a combined BLAST/clustering procedure. Cluster consensus sequences are then examined based on four pre-defined criteria presumably indicative for absence of paralogy. This is done by calculating four specialized indices; loci are ranked according to their performance in these indices, and top-scoring loci are considered putatively single- or low-copy. The approach can be applied to any probe set. As it relies on long reads, the contribution also provides template workflows for processing Nanopore-based target capture data. Identified loci can be used for NGS amplicon sequencing. For detection of possibly remaining paralogy in these data, which might occur in groups with rampant paralogy, the long-read assembly tool CANU is employed. The presented workflow can be useful for researchers dealing with reticulate or polyploidization phylogenetic histories in plants.

The present dataset contains several documents supplementing the original paper. Its most important elements are a detailed description (alongside two graphical workflow figures) of all methods employed in the study, suitable for reproducing the steps of the workflow and also the wet-lab work. The workflow employs a collection of BASH, Python and R scripts which is available here, together with a detailed account on command line use in Linux. Also, reference sequences for the identified markers can be found as well as sequence alignments derived from the amplicon sequencing.

Notes

For opening of the files included with this submission, the following programs are required: Microsoft Office or an open-source alternative for viewing DOCX and XLSX files, a PDF viewer, a standard sequence alignment program and a file archive unpacking tool capable of handling ZIP files and GZIPPED tar archives.

Funding provided by: Deutsche Forschungsgemeinschaft
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100001659
Award Number: OB 155/13-1

Files

Files (83.7 kB)

Name Size Download all
md5:b511e03ce0f1f7146346753f4c27db57
83.7 kB Download

Additional details

Related works

Is source of
10.5061/dryad.2fqz612tm (DOI)