Creating, curating and evaluating a mitogenomic reference database to improve regional species identification using environmental DNA

Species detection using eDNA is revolutionizing global capacity to monitor biodiversity. However, the lack of regional, vouchered, genomic sequence information—especially sequence information that includes intraspecific variation—creates a bottleneck for management agencies wanting to harness the complete power of eDNA to monitor taxa and implement eDNA analyses. eDNA studies depend upon regional databases of mitogenomic sequence information to evaluate the effectiveness of such data to detect and identify taxa. We created the Oregon Biodiversity Genome Project to create a database of complete, nearly error‐free mitogenomic sequences for all of Oregon's fishes. We have successfully assembled the complete mitogenomes of 313 specimens of freshwater, anadromous and estuarine fishes representing 24 families, 55 genera and 129 species and lineages. Comparative analyses of these sequences illustrate that many regions of the mitogenome are taxonomically informative, that the short (~150 bp) mitochondrial ‘barcode’ regions typically used for eDNA assays do not consistently diagnose for species and that complete single or multiple genes of the mitogenome are preferable for identifying Oregon's fishes. This project provides a blueprint for other researchers to follow as they build regional databases, illustrates the taxonomic value and limits of complete mitogenomic sequences and offers clues as to how current eDNA assays and environmental genomics methods of the future can best leverage this information.


| INTRODUC TI ON
The use of ambient genetic material-environmental DNA (eDNA)-to detect and identify metazoans in soil, air, marine environments and freshwater habitats is transforming biodiversity monitoring (Andersen et al., 2012;Clare et al., 2022;Deiner et al., 2016;Drummond et al., 2015;Hauck et al., 2019;Lim et al., 2016;Lynggaard et al., 2022;Port et al., 2016;Valentini et al., 2016;Yamamoto et al., 2017).Monitoring methods using eDNA require comprehensive reference databases of sequence information for target and nontarget species in the clade of interest.The oft-cited lack of comprehensive, reliably vouchered sequence information for many species (Bohmann et al., 2014;Collins et al., 2013;Cordier et al., 2021;Porter & Hajibabaei, 2018;Schnell et al., 2010) exposes the need to build reference databases using standardized sample collection, data and specimen curation and data-sharing protocols (Goldberg et al., 2016).This contribution documents and details the process of creating such a database for Oregon's freshwater fishes and provides a roadmap for others to follow in similar endeavours.
Molecular taxonomists have recommended microgenomic methods (metabarcoding, barcoding and single-species detection) for decades to work around limitations with morphologybased identification (Hebert, Cywinska, et al., 2003;Hebert, Ratnasingham, & de Waard, 2003), but molecular species detection methods do have some drawbacks.They rely on diagnostic sequence information from prototypical specimens to ensure correct identification of genetic material found in environmental samples, and that sequence information is not always available.Additionally, the gene-and taxon-specific primers needed to amplify barcode regions introduce a key source of error and bias in PCR amplification (Yang et al., 2021) because they select certain DNA sequences over others (Figure 1a; Deiner, Bik, et al., 2017).This primer bias not only allows researchers to sort metazoan targets from microorganismal backgrounds but also leads to unwanted loss of information when targeted populations and species amplify variably.Even minor binding biases among target sequences can affect PCR amplification substantially (Nichols et al., 2018;Piñol et al., 2015;Stadhouders et al., 2010), preventing reliable measurements of species presence and/or relative abundance (Yang et al., 2021).Additionally, if species have not diverged at the locus targeted by a primer set, the assay will neither diagnose those taxa nor properly assess their presence or abundance (Figure 1a).Incomplete mitogenomic sequence information prevents in silico verification that primers will bind to species' DNA or that the amplicon will correctly diagnose species.
In addition, missing sequence data and improper taxonomic assignments hinder accurate species identification when querying eDNA metabarcoding results.
Comprehensive databases of error-free, taxonomically verified, full mitogenomic data can solve issues related to unreliable genetic data and greatly improve the accuracy of novel environmental genomics methods that involve sequencing all the DNA in an environmental sample.Such approaches are known as 'shotgun sequencing', 'ecogenomics' or 'community genomics' (Béjà, 2004;Bragg & Tyson, 2014;Taberlet et al., 2012).Researchers focusing on animals typically target areas within the mitochondrial genome (mitogenome) for eDNA applications because mitochondrial DNA frequently diagnoses taxa accurately (Hebert, Cywinska, et al., 2003), resists environmental degradation (Foran, 2006) and is more easily recovered from degraded samples than lower copy nuclear DNA (Hartmann et al., 2011).Once isolated and F I G U R E 1 Primer issues and mapping shotgun reads.(a) The problems with primers.A sequence alignment of PCR products produced in silico illustrates why amplicons produced by some primer pairs, like the ones depicted here, might fail to detect or identify species.(b) Mapping shotgun reads.Samples that are shotgun-sequenced will produce reads from the entire mitogenome.This graphic depicts how, with whole mitogenomic information, it is possible to make use of all the data produced from shotgun sequencing.The length of barcodes is limited due to technological constraints and can only harness a small (maximum 300-500 bp using Illumina next generation sequencing) fragment of genetic information from the mitogenome while far more genetic information can be accessed by genome skimming from shotgun sequencing, where all the DNA that is present in a sample is sequenced.sequenced, whole mitogenomes can be used to assign taxonomic identifications in mitochondrial metagenomics (Crampton-Platt et al., 2016), multilocus metabarcoding (Arulandhu et al., 2017;Curd et al., 2019)-where multiple barcode markers are used to identify taxa in a sample-and 'ultra-barcoding' (Kane et al., 2012) also known as 'super-barcoding' (Li et al., 2015) where much longer barcodes or entire organelles are targeted.Mitogenomic approaches like these can help overcome key challenges with metabarcoding such as primer mismatches, which lead to taxonomic dropout (Cristescu & Hebert, 2018;Piñol et al., 2015;Sipos et al., 2010), reduced quantitative information (Bru et al., 2008;Wu et al., 2009) and incomplete taxonomic resolution (Piñol et al., 2015).For example, Tang et al. (2015) demonstrated that mapping shotgun-sequenced data to complete mitogenomes improved identification and quantitation of species in bee mock communities.Mitochondrial metagenomics can also harness more complete mitogenome sequences to infer phylogenetic relationships from bulk samples (Andújar et al., 2015;Crampton-Platt et al., 2015).
While advancements in sequencing technology have made it feasible to generate the voluminous data on which environmental mitogenomics depend, the lack of well-curated mitogenomic sequence databases presents a bottleneck (Arulandhu et al., 2017).
Environmental mitogenomics depend on these databases because they allow matching of any mtDNA fragment to complete, taxonomically verified mitogenomes (Figure 1b; Crampton-Platt et al., 2016;Deiner, Renshaw, et al., 2017).When such databases exist, any recovered fragment can improve inference about abundance (Braukmann et al., 2019) and yield valuable information on species presence, which together with high spatial resolution can provide distributional information.
Existing genetic information in public reference databases can facilitate assay design, but issues with data collection make them potentially unreliable.GenBank® (Benson, 1996;Clark et al., 2016) cannot fill the need for curated reference databases because GenBank's sequence data are not uniformly linked to taxonomically verified vouchers.The lack of vouchers weakens the link between DNA sequence and taxonomic identity and prevents independent verification (Meiklejohn et al., 2019).In addition, GenBank does not always screen for contamination or check for errors in protein-coding sequences.Quality-checking at GenBank has improved (Leray et al., 2019), but the sequence data it holds can still be draft quality, contain errors and be incorrectly assigned taxonomically (Meiklejohn et al., 2019), particularly at the species or subspecies level (Locatelli et al., 2020).Identification errors may particularly plague speciose invertebrate groups (Leray et al., 2020).RefSeq (O'Leary et al., 2016), GenBank's curated and well-annotated sequence dataset, solves some of these problems by incorporating additional rounds of error-checking and provides information for the entire mitogenome.However, it is not comprehensive-when this study began, RefSeq contained sequence data for just 44% of Oregon's freshwater fish species and these data were largely derived from specimens collected outside the state.
The Barcode of Life Data System (BOLD) provides an alternative to GenBank with more rigorous voucher requirements, consistent use of validation via translation to detect pseudogenes, contaminant checking and other tools to identify data anomalies and low-quality records (Ratnasingham & Hebert, 2007).However, BOLD skews heavily to information from cytochrome c oxidase I (COI) due to BOLD's initial development around a single >500 bp COI barcode region.As of this writing, COI sequences represent 80.8% of the chordate data available and 82.4% for ray-finned fishes.Although remarkably diagnostic for many species, COI markers often fail to discern recently diverged sister species pairs and may fail to amplify certain taxa due to poorly conserved primer binding regions (Deagle et al., 2014).In some cases, other mitochondrial regions contain better conserved sites across taxa for primer placement and do a comparable job of distinguishing species.For example, Miya et al. (2015) found that two conserved regions flanking a hypervariable segment of the 12S mitochondrial gene provided more suitable fish metabarcoding primers.
Reference databases also require denser intraspecific sampling to identify loci that best diagnose species or geographically structured variation within species.Without multiple sequences for each species, researchers cannot test primer binding specificity and species diagnosability in silico.Here again, available reference sequence databases fall short.GenBank's curated RefSeq database, for example, is nonredundant by design (NCBI, 2021), with each species associated with only one complete mitogenome.
Overall, the data gaps associated with GenBank and BOLD introduce uncertainty and potential error into the eDNA assay design process, making sole reliance on these resources for sequence data problematic.
Regional databases have the benefit of being able to control error-checking, curate full mitogenomic data, provide sequence information for genes and intergenic regions and identify and resolve taxonomic/genetic inconsistencies through resampling and revalidation (Astrin et al., 2013;Buckner et al., 2021;de Santana et al., 2021).Ideally, management-quality eDNA biodiversity surveys would extend the 'BOLD model' to create curated reference databases of complete mitogenome sequences tied to vouchered specimens collected throughout discrete regions.The need to expand the range of species with full mitogenomic sequence information has been elucidated by Langlois et al. (2021).The authors call for full mitochondrial genome sequences for multiple examples per species so that robust, comprehensive sequence alignments can support the development of assays that avoid cross-binding of primers to nontarget taxa or nonbinding of primers to target DNA (Langlois et al., 2021).In addition to improving qPCR primer design, databases of full mitogenomic sequence information provide the data needed to correctly assign metabarcoding sequence queries to taxa regardless of the primer binding site in the mitogenome.
Finally, whole mitochondrial genomes also permit the exploration of numerous markers for both quantitative and conventional PCR (Farrington et al., 2015).Fortunately, it is now practical and affordable for a small consortium to sequence and assemble hundreds of mitogenomes using a single Illumina Novaseq sequencing lane.This means that little impedes the development of the curated mitogenomic reference sequence databases needed to prepare for PCR-free mitogenomics and to develop, test and query single-species and metabarcoding eDNA assays.
Here, we provide a roadmap for constructing such a curated mitogenomic reference library and additionally evaluate the resulting data to determine the effectiveness of subsections of the mitogenome and the organelle in its entirety to identify species.This effort was motivated by the Oregon Biodiversity Genome Project (OBGP; www.obgp.org), a multi-institution collaboration between scientists and managers at Oregon State University, the Oregon Department of Fish and Wildlife and the United States Forest Service.The OBGP aims to develop a regional genetic reference database to facilitate statewide eDNA monitoring programmes for Oregon's resident freshwater fishes.The specific goals of the OBGP, as outlined in our roadmap (Figure 2), are to: (1) use sterile laboratory methods to collect 10 georeferenced fullbodied vouchers of each freshwater fish species from dispersed watersheds in Oregon; (2) archive and link voucher specimens, tissues and metadata for taxonomic verification and revision; (3) sequence full mitogenomes from multiple specimens per species; and (4) make all curated data publicly available.While biodiversity and geographic complexity differ from region to region, this study provides a realistic sense of the effort needed to construct a database covering ~150 species spread across ~250,000 km 2 .By curating this reference database of full mitogenomes, we created the taxonomic reference information needed to identify freshwater, anadromous and estuarine fish species found in Oregon and bordering states by any mitogenome-based single-species eDNA or metabarcoding assay and set the stage for future PCR-free environmental mitogenomics methods.Our approach also analyses the efficacy of various regions of the mitogenome to identify species and provides pipelines that can guide other organizations as they develop reference sequence databases for their taxa and regions of interest.

| Voucher specimen and tissue collection
The study area initially encompassed the state of Oregon-the region of interest for our eDNA monitoring programme-and expanded to a few sites in northern California and Washington State (Figure 3).To strategize sample collection, we examined historical location records in fish collections, such as the Oregon State Ichthyology Collection, and conferred with local biologists to identify current distributions.For cases where we knew or suspected that deeply divergent evolutionary lineages existed in the present concept of a species, we aimed to include representatives of all lineages.We ultimately identified 146 native and non-native freshwater fish species and lineages that are currently found in Oregon and strategized collections to span watersheds throughout the state (Appendix S1).
To facilitate consistent sampling, we provided sampling kits (Box S1 in Appendix S2) to collectors that contained a 500-mL Nalgene bottle filled with 10% formalin, a 2.0 mL cryotube filled with 95% EtOH, a sterile scalpel, scissors and tweezers, a bleach wipe, latex gloves, a detailed sampling protocol to ensure consistent tissue sampling and data collection (Box S2 in Appendix S2) and a field notes sheet (Box S3 in Appendix S2) for metadata collection.
Collectors anaesthetised and euthanized all fish specimens before tissue collection by immersion in an aqueous solution of Tricaine mesylate (MS-222; 400 mg MS-222, 400 mg sodium bicarbonate, 1 L water).We instructed all partners to collect a minimum of ~0.5 cm 3 of tissue from each specimen, which was then placed in 95% EtOH.
Euthanized fish were placed in 10% formalin as vouchers, thereby ensuring preservation of diagnostic features.The cost outlay for kit components to collect 1500 individuals totalled US$16,185.20 (Appendix S6).

| Taxonomic verification, accession and cataloguing
Fish biologists provisionally identified specimens in the field before Oregon State Ichthyology Collection taxonomists verified or refined those identifications by morphological examination and reference to published keys (Markle & Tomelleri, 2016;Wydoski & Whitney, 2003).
The Oregon State Ichthyology Collection is in the process of accessioning and cataloguing all vouchers and tissues.During that process, the curators input the metadata associated with each specimen and collection event into a relational database, and full-bodied vouchers are transferred from formalin to isopropyl alcohol for permanent storage in a dedicated collection facility that complies with modern fire and earthquake safety codes.Tissues are stored in 2.0 mL cryotubes in −80°C freezers.Total expenditures on storage supplies such as jars, lids and preservation fluid came to US$8624.21 (Appendix S6).

| Mitogenome assembly
To capture geographic genetic variation of each resident species across its distribution in Oregon, we sequenced the first collected representative of each species and subsequently sequenced specimens collected from separate watersheds, when possible.
We stored gzipped FASTq sequencing files on 2 × 1TB hard drives and performed mitogenome assemblies on 4 × 2.30 GHz 16-core processors using 512GB RAM (total hardware cost $12,368.33;Appendix S6).Mitochondrial genomes were assembled de novo F I G U R E 2 Creating a reference sequence database for monitoring with eDNA.A one-page blueprint designed to provide researchers with the basic steps to follow to create a regional database of full mitogenome sequences for target taxa.from raw paired reads using SPAdes assembler (versions 3.12.0-3.15.3;Bankevich et al., 2012) and getOrganelle 1.6.2 or 1.7.5 (Jin et al., 2020) upon release.We annotated all mitochondrial sequences using a combination of MITOS 2 WebServer (Donath et al., 2019) and Geneious 10.2.6 using annotations from identical or closely related species.

| Mitogenome variability
To analyse intra-and interspecies mitogenome variability, assembled mitogenomes from each species were aligned with MUSCLE (Edgar, 2004) in Geneious 10.2.6 using default parameters.After reciprocal rounds of morphological examination and molecular clustering, we aligned sequences of species from within the same family and then aligned these family clusters to create a master alignment of all sequences.To identify taxonomically diagnostic regions for efficient eDNA assay development, we used the R package SPIDER (Brown et al., 2012) to perform a sliding window analysis on the master alignment and locate areas with the highest density of taxonomically diagnostic nucleotides (TDN)-defined as locations where a nucleotide is fixed within species and different or unaligned in all other species.
We determined mean taxonomically diagnostic nucleotides per 150base window shifted at 20-base intervals (TDN/w 150 i 20 ) by calculating the mean of all TDN/w 150 i 20 counts across a particular region.We also identified regions containing TDN 'spikes', where max(TDN/w 150 i 20 ) >2 * mean(TDN/w 150 i 20 ).To identify species and genes with high variability, we used identity matrices generated in Geneious and plotted variability with heat maps, parallel coordinate plots and radar charts using the R packages Superheat (Barter & Yu, 2018), GGally (Schloerke et al., 2021) and fmsb (Nakazawa, 2021), respectively.Gene regions <690 base pairs in length-ATP6, ATP8, NAD3, NAD4l, NAD6 and all tRNA genes-were not included in our analyses of individual genes.
We grouped described subspecies with their full parent species for the purpose of calculating mean percent identities.Calculation of intraspecies, intrafamily/interspecies and interfamily/interspecies mitogenome identities and the proportional relationships among these identities required mitogenomes of multiple specimens for each species within

| Utility of different mitogenomic regions for species identification
We evaluated the relative success of subregions of the mitogenome to identify fish species by first extracting subsets from the alignment of 313 mitogenomes.In Geneious 10.2.6, we created miFish (Miya et al., 2015) and Teleo (Valentini et al., 2016) amplicons using Primer3 (Untergasser et al., 2012) and extracted 12S, 16S, NAD2, CO1, NAD4, NAD5 and D-loop regions from the alignment using annotations as guidelines.We examined the entire mitogenome along with regions spanning the 12S, trnV and 16S genes and the NAD4, trnH, trnS1, trnL1 and NAD5 genes.We performed BLASTn queries of these regions against a local database created from our 313 mitogenomes.We parsed the results from this BLASTn query to determine the effectiveness of the different regions to successfully identify species.

| Taxonomic verification
After clustering sequences based on maximum-likelihood (ML) inference and examining voucher specimens in the laboratory, we refined or corrected 31 field identifications (9.9%; Appendix S3; Catostomus bondi yet lacked all the diagnostic features of this species and was assumed to be a hybrid with Catostomus columbianus.
OBGP-2016-007 was a transforming lamprey microphthalmia with caudal fin pigmentation suggesting Entosphenus lethophagus, and a catfish specimen OBGP-2018-046 that had some of the diagnostic features of both Ameiurus natalis and A. nebulosus but clustered with A. nebulosus was assumed to be a hybrid.
Mitogenome sizes ranged from 16,098 to 17,185 bp in length (mean 16,590).All but 17 assembled mitogenomes had error-free contigs when measured with k = 31 using Merqury.Assembled mitogenomes with errors had QVs between 40.7507 and 57.0952 (Appendix S3) indicating errors in the range of 1 in ~10,000 bp to 1 in ~1,000,000 bp, respectively.Read mapping showed anomalous coverage in intergenic regions of 36 assemblies.These anomalies were generally located in areas with nucleotide repeats and manifested as dips or spikes in coverage.Dips were likely due to sequencing and/or assembly errors, and we assumed spikes were the result of nuclear mitochondrial DNA.These anomalies were not sufficient to exclude mitogenomes from downstream analyses (Appendix S3).

| Mitogenome variability
The sliding window analysis of our alignment of 313 complete mitogenomes revealed that mean taxonomically diagnostic nucleotides per 150-base window shifted at 20-base intervals (TDN/ w 150 i 20 ) in analysed gene regions were as follows: COI (7.257), CytB (9.451), NAD1 (13.381),NAD5 (17.092),NAD4 (18.226),NAD2 (20.065), 12S (20.814) and 16S (25.726) (Figure 4b).The highest concentrations of TDNs occurred in the D-loop and the intergenic region between the NAD2 and COI genes (Figure 4a).We used heat maps (Figure 6a; Figure S10 in Appendix S4; raw data in Appendix S5) to illustrate the degrees of similarity between families in different gene regions, although all genes have diverged sufficiently to diagnose familial lineages.These results showed that sequence identity among families in the COI gene exceeded that of other coding and noncoding gene regions (Figure 6a) suggesting that COI is the mitogenomic region in which families differ the least, a conclusion that concords with the results from the sliding window analysis.NAD2, NAD5 and 16S contrasted the most among families and species in overall percent identity.Despite different degrees of intrafamily identity in the mitochondrial regions we examined, our analysis suggested that divergence in all mitochondrial genes and the D-loop is sufficient to identify taxa at the family level (Figure 6a).
Zooming in to species differences within families, after in-field identification errors were corrected, all species resembled members of their own species in mtDNA sequences more than they resembled members of other species within the same family, as expected (Table S1 in Appendix S4).Species within the same family differed most in the percent identity of the D-loop (0.855), followed by the NAD2 (mean 0.885), NAD5 (0.896) and NAD4 (0.896) genes.The 12S (0.965) and 16S (0.957) genes were the least differentiable regions (Figure 7).Intraspecies mean per cent identities for the 12S and 16S rRNA genes and all coding genes >690 bp ranged from 98.259% to 99.975% and from 92.428% to 98.249% for the D-loop (Table S2 in Appendix S4).This comparison illustrates that the D-loop is less conserved within species than are any of the other gene regions (Figure 8).The most conserved genes were 12S, 16S and COX2, with the lowest mean values found in the NAD2 gene nevertheless still exceeding 99% identity (Table S2 in Appendix S4).Radar charts of mean percent intraspecies, intrafamily interspecies and interfamily interspecies identity (Figure 8) illustrated that different genes are more conserved among species within certain families than others.For example, species in Catostomidae varied little in sequences from rRNA and all three COX genes, while the 12S and 16S genes were fairly conserved among salmonid and cottid species.Non-rRNA regions in Salmonidae and Cottidae, and all gene regions and the D-loop in Cyprinidae, Centrarchidae and Ictaluridae contained diverged interspecies sequences.Full mitogenomes were highly conserved within species (mean 99.493% identity) and had sufficient divergence among species in the same family to suggest TA B L E 1 Assembled Mitogenome Taxa Counts: OBGP specimens with assembled mitogenomes are grouped according to taxonomic designation with counts for each taxonomic level (see Table S3 in Appendix S4 for table with individual specimen IDs).
they would be diagnostic at the species level for Oregon fishes (Figure 9).

| Utility of mitogenomic regions for species identification
Plots of parsed results from BLASTn queries suggest that the full mitogenome, concatenated gene regions from the mitogenome, and the NAD5 gene are superior to miFish and Teleo metabarcoding primers and most individual mitochondrial genes for producing first hits that match specimens to species and to described subspecies (Figure 10).Nevertheless, queries using every gene region and

| A blueprint for constructing mitogenomic databases
We demonstrate a robust, affordable, feasible blueprint for con- We have catalogued the collection strategy and wet and dry laboratory pipelines we used for our bottom-up development of an eDNA biodiversity reference collection and sequence database and provide an easy-to-follow roadmap (Figure 2).This bottom-up approach harnesses the expertise, knowledge and resources of researchers and managers within their region and taxa of interest, an essential strategy as these individuals possess the intimate knowledge of species, taxonomy and geography needed to plan expeditions, carry out collections and identify specimens.There are many factors to consider when constructing a genomic database.Contamination must be prevented during tissue sampling, and voucher archiving is key for taxonomic verification (Rimet et al., 2021).Adequate tissue volumes must be harvested and properly preserved (Nagy, 2010) to derive sufficient volumes of high-quality DNA for sequencing requirements.Additionally, if sample processing cannot be outsourced, access to equipment, faand the expertise to perform DNA extractions, library preparation and genetic sequencing is necessary.The choice of extraction method requires consideration as DNA quality can differ depending on taxa (Panova et al., 2016) and high-molecular-weight DNA is required for long-read sequencing methods (Trigodet et al., 2022).
We have detailed the extraction and sequencing protocols used for this project (Appendix S8), but trials should be performed to optimize collection, extraction and sequencing strategies for deriving a desired dataset for a particular taxonomic group.The workflow also depends on taxonomic expertise to identify specimens within difficult families, namely those featuring many morphologically similar species or undescribed cryptic species.In addition, access to the infrastructure and archival capacity of a natural history collection is vital because voucher specimens must be catalogued properly and preserved in perpetuity for the science to be verifiable and repeatable (Astrin et al., 2013;Buckner et al., 2021;Prendini et al., 2002).
Finally, the use of an appropriate repository-in our case the Oregon State University Ichthyology Collection-that links DNA extracts, tissues, specimens and genomic data permits the re-evaluation of specimen assignment when taxonomies are revised, ensuring that the collection remains valid and current.For example, a 2023 assessment of the Rhinichthys osculus species complex (Moyle et al., 2023) revised the taxonomic designations of our dace specimens during final review of our manuscript.This highlights the point that a collection of this nature is not static and needs maintenance.It requires collaborators with a robust understanding of local species along with the expertise to perform morphological evaluations and update classifications as taxonomies are revised.In addition, any taxonomic or other changes to specimen metadata need to be reflected in public repositories.
Researchers seeking to construct reference libraries for nonpiscine taxa should first understand the structure and makeup of the mitochondrial and nuclear genomes of those groups prior to curating mitogenomic sequences.They should ensure that the proposed wet and dry laboratory pipelines can successfully resolve mitogenomes in those groups.Fish mitogenomes contain fewer repeats, insertions and deletions than those of other vertebrates (Formenti et al., 2021), which can cause problems in the sequencing and assembly pipeline (Tørresen et al., 2019).The structural simplicity of many fish mitochondrial genomes enables mitogenome recovery while sequencing at shallow depths.Given 2 × 150 bp sequencing (L), 65 specimens (S) and ~ 300 million reads (N) per lane and a hypothetical 2200 Mb of DNA (G) per specimen, coverage will be <1× for the nuclear genome (LN/GS; Sims et al., 2014) but many multiples of this for the mitogenome as there are many mitochondria per cell.Between 30× and 50× coverage is recommended to reconstruct mitogenomes de novo when using short-read sequencers, so increased sequencing depth may be needed for taxa with bulky, complex genomic material, such as certain salamander species.Alternatively, a combination of long-read and short-read sequencing may readily resolve complete mitogenomes in these circumstances (Formenti et al., 2021).
An essential concern for a sequence database of this type is to examples per species, but the ratio of intra-to interspecific identity was otherwise low (Figure 7), suggesting that, for most of Oregon's fishes, only a few per species were needed.Unfortunately, we were unable to collect multiple individuals for all taxa and were therefore unable to assess intraspecific variation for all species.To properly capture intraspecific variation, one could collect multiple examples of each taxon across its distribution, repeatedly measure the diminution of new polymorphisms and use rarefaction curves to strategize collection.Nevertheless, databases containing full mitogenomic data that do not completely capture intraspecific variation or include all target taxa are still useful because queries with multiple nucleotide mismatches spread across the mitogenome suggest a species has been detected that is not represented in the database.

| Taxonomically informative genes for fishes
Our reference sequence database provides a valuable genetic resource for analysing mitochondrial genetic variability among Oregon's freshwater fishes and gauging capacity to identify species in eDNA assays.Our analyses illustrate that mitochondrial sequences at every level, from individual genes to the entire mitogenome, are sufficiently conserved within species to provide reliable identifications.However, not all mtDNA regions are equally good at distinguishing between closely related species and certain complete genes and concatenated gene regions (such as the NAD regions) are generally superior (Figure 10).Sequences must have diverged sufficiently among taxa to discern them, so longer regions and faster evolving sections of the mitogenome are preferable for diagnosing recently separated lineages.
Sections of the COI gene have been recommended for barcoding animals (Hebert, Cywinska, et al., 2003) and metabarcoding eukaryotes (Meusnier et al., 2008) or metazoans (Andújar et al., 2018), but they may not be optimal for identifying certain taxa, especially when only short stretches of DNA are retrievable.
For certain taxonomic groups, COI sequences do not consistently cluster or associate sequences with their assigned species when barcoding (Waugh, 2007) or metabarcoding (Collins et al., 2019).
Previous analyses by Hebert, Cywinska, et al. (2003) and Hebert, Ratnasingham, and de Waard (2003) examined the use of COI for species identification in Lepidoptera by quantifying sequence divergence, using NJ analyses and multidimensional scaling to assign sequences to species.They extrapolated the general suitability of their COI barcode to diagnose animal species from their success in these taxonomically narrow trials.Though their arguments in favour of the COI as the core of a global bioidentification system for animals were logical, they were also speculative (Hebert, Ratnasingham, & de Waard, 2003).They did not assess the comparative merits of the COI over other mitochondrial genes and explicitly stated the need to validate the diagnosability of the COI gene for different taxonomic groups (Hebert, Cywinska, et al., 2003).
This has been done for the COI barcode for a variety of taxonomic groups over the intervening decades (invertebrates: (Cywinska et al., 2006;Sheffield et al., 2009;M. R. Young et al., 2019); fish: (Zemlak et al., 2009); birds: (Hebert et al., 2004;Kerr et al., 2009); amphibians: (Smith et al., 2008); mammals: (Francis et al., 2010)) with results based on sequence divergence and NJ clustering analyses suggesting that, for arthropods and vertebrates, this barcode is useful for parsing these groups taxonomically.It is unclear, however, to what degree the COI is taxonomically diagnostic to the species level for all metazoans.Relatively low percent identity between multiple species does not necessarily equate to species-level diagnosticity (Figure S9 in Appendix S4).A more complete evaluation of the comparative diagnosability of different parts of the mitogenome is therefore needed for a broad range of taxa.Here, we demonstrate that for Oregon's freshwater fishes, regions other than the COI have the potential to better identify species in certain taxa.
A variety of factors contribute to how successfully a region parses the independently evolving lineages we typically call species.Areas with relatively high interspecies genetic distance and concentrations of taxonomically diagnostic nucleotides within families are likely candidates for diagnosing to the species level, whether barcoding, metabarcoding or performing single-species detection.To 'capture' these regions for single-species qPCR assays, the goal is to detect eDNA from the target species and no other species, so areas with high intrafamily distance, high intraspecies identity and high mean concentrations of TDNs are likely the best candidates.Unlike single-species qPCR assays, metabarcoding primers need to capture eDNA from a broad range of taxadifferent families, orders, classes or even phyla-so there need to be shared regions (typically between 18 and 27 bases long) that can permit primer binding and avoid species dropout.Essentially, a 'Goldilocks' zone is needed for metabarcoding: a region with sufficient genetic divergence to distinguish between species, but not so divergent that conserved flanking regions are unavailable for primer binding.For this reason, the hairpin-loop structure of rRNA regions makes them appropriate for metabarcoding and explains why the most referenced fish metabarcoding primers are found in the 12S region (Miya et al., 2015) despite this region's high withinfamily interspecies percent identity relative to other regions.To guide the discovery of these areas in future studies, our analysis suggests that these regions contain TDN 'spikes' in otherwise conserved stretches of the mitogenome.
For Oregon's freshwater fishes, we found multiple gene regions and the D-loop had high interspecies genetic distance and concentrations of TDNs within families, suggesting there are numerous alternatives to the COI gene for species identification using eDNA.For single-species analyses, the results were mixed depending on the family examined.Salmonidae, Cyprinidae, Cottidae and Petromyzontidae had the highest mean TDN/w 150 i 20 in the D-loop (Table 2; Figure 5), although this region also had low relative intraspecies identity compared with other mitochondrial regions (Figure 8).Regions with a combination of all three important factors-high intraspecies identity, high intrafamily distance and high mean TDN/w 150 i 20 -were the NAD2 gene for Salmonidae, Centrarchidae, Cottidae Petromyzontidae, the NAD5 gene for Cyprinidae, and the NAD4 gene for Catostomidae and Ictaluridae (Table 2; Figure 7; Table S2 in Appendix S4).The 12S and 16S genes contain TDN 'spikes' ( To ultimately be of use for metabarcoding, a region needs to diagnose species effectively and consistently and be sequenceable using contemporary technologies.Our comparative BLAST analysis provides insight into how consistently various regions of the mitogenome diagnose species or subspecies (Figure 10).The results suggest that, when performing BLASTn sequence queries-the standard query used for eDNA metabarcoding results-multiple genes or the full mitogenome are generally more successful at F I G U R E 6 (a) Percent identity heatmaps: Family level.This is a graphical representation of interfamily mitogenome percent identity.
Yellow colours indicate greater dissimilarity, while 100% identity is represented in dark purple; higher contrast therefore indicates greater distance in identity between families.Y-and X-axes are identical with each block on an axis representing one family in the following order from top to bottom and right to left: Poeciliidae, Fundulidae, Pleuronectidae, Oxudercidae, Embiotocidae, Centrarchidae, Percidae, Pholidae, Gasterosteidae, Sebastidae, Cottidae, Percopsidae, Osmeridae, Umbridae, Esocidae, Salmonidae, Ictaluridae, Cyprinidae, Catostomidae, Cobitidae, Clupeidae, Atherinopsidae, Acipenseridae and Petromyzontidae.Raw numbers can be referenced in Appendix S5.A specieslevel heatmap is available in Figure S9 in Appendix S4.(b) Schematic View of an Identity Matrix: An alignment of sequences is compared in a pairwise fashion to determine the distance between one sequence and all other sequences in the alignment.The resulting matrix is symmetric along the diagonal with the central diagonal-where a sequence is compared with itself-remaining blank.Data can be grouped, and mean percent identities can be calculated as depicted here.
(a) (b) identification than most individual genes and mini-barcodes.The exception is the NAD5 gene, which produced the most targetspecific results along with queries using the full mitogenome and concatenated NAD4/NAD5 or 12S/16S regions.This suggests that the NAD5 gene and certain concatenated genes within the mitogenome may be useful for single-species assays and with future PCR-free approaches.However, the low interspecies per cent identity of the NAD genes (Figure 8) may prevent location of nondegenerate multispecies metabarcoding primers that capture short enough to be sequenced using Illumina technologies.In addition, entire single and concatenated genes are too long to be sequenced using next-generation methods.For example, the ~650 bp COI region published by Hebert, Ratnasingham, and de Waard (2003) exceeds the length feasible for Illumina throughput sequencing (Meusnier et al., 2008), so subsections need to be used for metabarcoding.However, capturing F I G U R E 7 Within-family relationships in mean percent identity: Parallel coordinate plots.These plots illustrate the relationship between intraspecies and interspecies genetic distance within the six plotted families at 11 regions (10 gene regions and the D-loop).At each of these 11 mitogenomic regions, mean intraspecies percent identity is calculated and then mean percent identity is calculated between that species and all other species within a given family.For each species at each region, the proportional relationship between these two means is plotted.A higher proportional value indicates higher genetic similarity between species within a family, so identifying species within these families may be more difficult using eDNA assays.Species within families exhibiting greater genetic distance here-Cyprinidae and Centrarchidae-should therefore be easier to identify using shorter regions of the mitogenome than species within families with higher genetic similarity (e.g.Cottidae and Catostomidae).See Table S1 in Appendix S4 for proportional identity figures.
sufficiently short regions of the COI for Illumina sequencing often requires highly degenerate primers (Collins et al., 2019;Deagle et al., 2014).Because of their wide range of optimal primer melting temperatures, degenerate primers may differentially amplify target species, of particular concern when targeting extraorganismal eDNA (Hajibabaei et al., 2019).The 16S gene may present a potential alternative to 12S mini-barcode regions for eDNA metabarcoding of fishes because it identifies targets to species or subspecies relatively well (Figure 10) and has high interspecies percent identity (Figure 8) and TDN 'spikes' (Table 2).Additional studies evaluating subsections of the 16S gene for this purpose would be informative.
In cases where no single barcode can separate all targeted species, several viable options exist.For single-species and metabarcoding assays that rely on short barcode regions, it may be necessary to use multiple regions for metabarcoding or perhaps a diagnostic region in the nuclear genome, such as the ITS1 gene, to discern closely related congeners (Dysthe et al., 2018).Having more extensive mitogenomic data is necessary for multilocus metabarcoding, a strategy that can increase specificity and sensitivity in single-species eDNA assays (Brys et al., 2023).In addition, full mitogenomes contain significant within-family interspecies variability that can be harnessed with PCR-free approaches to improve species identification over single genes, even among F I G U R E 8 Percent identity in a subset of families at regions within the mitogenome.Within each family, mean intraspecies percent identities are calculated and plotted for each gene and the D-loop (pale green), along with interspecies intrafamily percent identities (dark green), and interspecies interfamily percent identities (orange).Interfamily calculations are computed between species from all families, not just the families depicted here.Values on radar chart axes span from 40% identity at the innermost ring to 100% identity at the outermost ring.Genes are arranged in the order in which they occur in the circular mitogenome.Petromyzontidae is not included in plots due to highly skewed interspecies/interfamily identity.To identify species in eDNA assays, intraspecies identity should ideally be high and interspecies identity should be low.All families exhibit high intraspecies identity across all regions of the entire mitogenome with the lowest intraspecies identity found in the D-loop.Between families, there is sufficient interspecies variation to distinguish species from different families and identify sequences to the genus level.Within Cottidae and Catostomidae, there is relatively high interspecies identity, suggesting that species within these families may be difficult to identify.Within Salmonidae, there is high interspecies identity at the 12S (rrnS) and 16S (rrnL) gene regions suggesting these regions may not be ideal for identifying salmonid species.
relatively conserved taxa.The benefits of using full mitogenomes or a combination of strategically valuable mitochondrial genes derived from full mitogenomes to discern Oregon's resident fish species are illustrated by our analysis of in silico BLASTn queries (Figure 10).An advantage of genome skimming from shotgun sequencing (also 'whole-genome shotgun sequencing' or WGS)-where all genetic material from a specimen is sequenced at shallow depth (Trevisan et al., 2019), as undertaken in this project (Appendix S8)-is that the resulting raw sequence data includes high-copy-number nuclear markers such as 18S rRNA at sufficient coverage for potential assembly and use to discern species should mitochondrial markers fail.Indeed, genome skimming produces a plethora of useful data that can be utilized for specimen identification in a variety of applications (Bohmann et al., 2020).

| Biological challenges in distinguishing species
It is important to note that the failure to correctly identify the presence of a species using eDNA may be the biological reality rather than a fault of the method.Species that often confound taxonomists and are difficult to distinguish morphologically also appear to be difficult to resolve genetically; not even full, reliable, mitogenomic sequences may be able to resolve such cases.For example, hybridization and organelle introgression from secondary contact can obscure the relationships of different species in an environment (Dowling et al., 2016;Forsythe et al., 2020).Because it is inherited matrilineally, mitogenomic information on its own cannot distinguish hybrid species, and nuclear genetic information may be needed to untangle the genetic complexities of introgression resulting from secondary contact and hybridization-the likely culprit behind difficulties identifying certain catostomid species (Dowling et al., 2016).Difficulties with cottid identification may result from insufficiently diverged lineages or widely variable morphology (Rowsey & Egge, 2017), and the tendency of lamprey taxa to rapidly derive nonparasitic 'satellite species' from parasitic species in sympatry (Salewski, 2003;Vladykov & Kott, 1979) often leaves insufficient time for genetic divergence to accrue (April et al., 2011;Brownstein & Near, 2023).Understanding these shortcomings is essential when undertaking biodiversity assessments using eDNA, and even though the method may not be perfect, the ability to assess a broad variety of taxa using noninvasively collected environmental samples is transformative.
F I G U R E 9 Whole mitogenome percent identity in a subset of families.Within each family, mean intraspecies percent identities are calculated and plotted for the whole mitogenome (pale green), along with interspecies intrafamily percent identities (dark green), and interspecies interfamily percent identities (orange).
Interfamily calculations are computed between species from all families, not just the families depicted here.Values on radar chart axes span from 60% identity at the innermost ring to 100% identity at the outermost ring.Petromyzontidae is not included in this plot due to highly skewed interspecies/interfamily identity.
To identify species in mitochondrial metagenomic assays, intraspecies identity should ideally be high and interspecies identity should be low.All families exhibit high intraspecies identity across the entire mitogenome.All families exhibit very low intrafamily interspecies identity and relatively low within-family interspecies identity suggesting the entire mitogenome should successfully identify species from all families.

| Mitogenomes in a nuclear future
Despite the challenges involved with a project of this scale and the limitations of mitogenomes to genetically resolve hybrids and certain closely related species, full mitogenomic data provide both a useful genetic reference for species identification and the genetic information needed to develop primers for single-species and metabarcoding assays.It also furnishes researchers with the data needed to move away from microgenomics such as barcoding or metabarcoding and into capture enrichment (Wilcox et al., 2018) or PCR-free environmental genomics, which solve the problems associated with PCR amplification biases (Piñol et al., 2015).Such methods make accurate quantification of relative species abundance in a sample a real possibility (Yang et al., 2021).Full mitogenomic data also frequently diagnose individuals to species more reliably than do shorter sequences (Figure 10).Compiling regional mitogenome databases expands the global repository of available genetic data and builds local capacity to query metabarcoding reads and develop or select the qPCR and metabarcoding primers that best support eDNA-based monitoring of biodiversity.
These large-scale, expensive and top-down efforts will create nuclear genome reference databases for many species in the coming decades, but their global focus is unlikely to provide comprehensive genetic information for any specific geographic region or taxonomic group of interest.For example, as of the time of this writing the VGP has produced 110 nuclear assemblies and published data for 125 mitogenomes (Formenti et al., 2021).This is an invaluable contribution but lacks redundant sequence information.An exception to these F I G U R E 1 0 Mitogenome regions and PCR amplicons queried against local OBGP database: Analysis of All Highest E-Value BLAST Hits: Regions of the mitogenome were extracted from the alignment of 313 OBGP sequences, and all sequence regions were blasted against a local database created from the same alignment.Single genes queried were COI, NAD2, NAD4, NAD5, 16S (rrnL) and 12S (rrnS) regions.
The entire region spanning from the beginning of the NAD4 to the end of the NAD5 (nad4nad5) genes and from the beginning of the 12S to the end of the16S (rrnSrrnL) genes was also queried along with the entire mitogenome (mitome) as well as the amplicons produced by both miFish and Teleo primer pairs.All hits with the highest E-Value measured in BLAST were analysed to determine the proportion of species having the target species or subspecies as the first hit.Species or subspecies with multiple specimens represented in the alignment were evaluated to see whether the target species or subspecies was the first hit for some or all representatives of that species.For the full mitogenome, the target species was the first hit for all specimens assigned to the species level and the target subspecies was the first hit for most subspecies that had specimens assigned to the subspecies level.For concatenated regions and the NAD5 gene, the results were similar to the full mitogenome, although the correct species was not always the first hit for all representatives of a species.The remaining regions were less likely to have the target as the first hit with miFish and Teleo amplicons having fewer first target hits than other, longer regions.This illustrates that a reliance on single short amplicons produced with PCR limits the capacity to identify species in an eDNA sample.
large-scale projects is the California Conservation Genomics Project (Shaffer et al., 2022), a regional effort to create nuclear reference genomes and multiple resequenced nuclear genomes for numerous species within the state of California.This presents an exemplar for researchers to pursue similar efforts, and if the affordability and accuracy promised by PacBio for its new generation of long-read sequencers becomes a reality, projects of this scale will effectively become democratized.While we should strive to create reference databases of full nuclear genomes for all organisms on the planet, nuclear DNA may not turn up reliably in extraorganismal eDNA from aggregating, nonspawning metazoans (Jensen et al., 2021;Olson et al., 2012), making nuclear reference sequence databases less useful for eDNA applications.There will therefore always be a need for mitogenomic reference data, and much can still be gleaned from complete mitogenomes on their own (Blair, 2023).

| CON CLUS ION
We hope these protocols and insights into mitogenomic variability will encourage researchers around the globe to follow suit and develop their own regional databases and archives of voucher speci-

F
Map of study area and sampling sites.Each orange circle represents a single sampling location.An interactive map can be viewed here.a family.Seven families satisfied these requirements: Catostomidae, Centrarchidae, Cottidae, Cyprinidae, Ictaluridae, Petromyzontidae and Salmonidae.
Mitogenome data generated for this project have been deposited in GenBank under the Oregon Biodiversity Genome Project BioProject.GenBank accession numbers and sequence data are included in Supplemental Information (Appendices S3 and S7 contain mitogenome FASTas).Sequence data are also available at www.obgp.org/downl oads.As of the time of this writing, linked voucher, tissue and DNA extract accessioning into the Oregon State Ichthyology Collection was ongoing.The voucher specimens accessioned and catalogued to date are available and searchable via webportal at https://webpo rtal.specifyclo ud.org/osich thyol ogy/, which the collection's curators updated periodically.Further details of voucher specimen and tissue collection, taxonomic verification, accession and cataloguing, mitogenome assembly and utility of different mitogenomic regions for species identification are available in Appendix S8.Information regarding DNA extraction and sequencing is solely available in Appendix S8. 3 | RE SULTS 3.1 | Voucher specimen and tissue collection We collected 625 specimens representing 129 fish species or distinct lineages within species complexes.These specimens originated from more than 240 localities in Oregon.Twelve additional tissue samples of four species were acquired from natural history collections, bringing the total number of species represented in the database to 132.Of these 132 species, 119 represent the original 146 fish species identified by Oregon Department of Fish and Wildlife as native or naturalized in Oregon.The remaining 13 species belong to 11 coastal estuarine species not included in our initial freshwater collection plan, plus one species endemic to western Washington (Olympic Mudminnow, Novumbra hubbsi) and one newly identified lineage of Paiute Sculpin (Cottus beldingii ssp.) from the John Day River Basin in central Oregon.
n = 303) resolved as a single mitochondrial contig with an overlapping splice point.The remaining assemblies were derived from either: (a) multiple contigs with overlapping splice points (n = 3); (b) a single contig with a nonoverlapping splice point in an intergenic area with mononucleotide C repeats (n = 6); or (c) multiple contigs from different SPAdes runs with overlapping splice points (n = 1).
structing mitogenomic databases.The workflow begins with the collection of reference specimens and progresses through taxonomic verification, accessioning of specimens, tissues and DNA, mitogenome assembly and open-source provisioning of complete mitogenomes.Such databases can help to refine the taxonomy of understudied or difficult groups, guide the discovery and delineation of cryptic species or distinct population segments and facilitate the transition to eDNA-based monitoring of aquatic biodiversity.Though beginning such a project appears daunting, the steps for carrying out a similar endeavour are straightforward: (1) using historical collection data and local knowledge, determine all focal species and their distributions, (2) break up the region of interest into manageable subregions for sampling, (3) create a sampling plan to collect 3-10 individuals per species/lineages of interest and begin the sampling effort using accepted standards for metadata collection(Rimet et al., 2021), acquiring tissues from vouchered specimens in natural history collections whenever possible and (4) sequence and assemble specimens as they accumulate, measuring intraspecies sequence variability to inform continued collection.

F I G U R E 4
Sliding window analysis: A 150-base-wide window is placed at the beginning of an alignment of 313 mitogenome sequences and shifted right at 20 base intervals.At each window position, the number of taxonomically diagnostic nucleotides (TDN)-where a nucleotide is shared within a species but is either different or unaligned with other species-is counted.Areas with high concentrations of TDNs are likely to be superior for species identification in eDNA assays.(a) Full mitogenome: Gene regions and the D-loop are shaded.Areas with clusters of TDNs are distributed across the entire mitogenome and the highest TDN concentrations are in noncoding regions suggesting the whole genome is of potential use.(b) Individual genes: These plots zoom in on a subset of individual genes to focus on the number of TDNs within 8 barcode regions.The 16S region has the highest taxonomically diagnostic nucleotides per 150-base window shifted at 20base intervals.Mean TDN/w 150 i 20 in each region:COI, 7.257; CytB, 9.451; NAD1, 13.381; NAD5, 17.092; NAD4, 18.226; NAD2, 20.065;  12S, 20.814; 16S, 25.726.Sliding window analysis, by family, full mitogenome.The full alignment of 313 specimens is separated into individual families and a window 150 bp in length is placed along the length of an alignment of each family alignment and moved at 20-bp intervals.At each position, the number of diagnostic nucleotides-where a base

| 1893 DZIEDZIC
et al.Because projects of this scale require moderate financial support and substantial human effort, we strongly recommend assessing available resources before launching a new endeavour.We were able to complete this project on a relatively low budget (estimated cost without labour ~$250 mitogenome; Appendix S6) because individuals donated considerable amounts of their time collecting throughout the state of Oregon and because collaborating institutions provided genetic laboratory facilities and sequencing at reduced costs.
mens.Widely ranging mitogenomic databases would expand eDNA monitoring potential to more regions.Repositories of vouchered samples and full mitogenomic information as described herein not only provide the genetic information needed to use eDNA effectively for biodiversity studies (de Santana et al., 2021) but also can support investigations of taxonomy, population structure, landscape genetics and multilocus metabarcoding.Coupled with highmolecular-weight DNA extraction for nuclear genome sequencing, a project of this scope grows the global repository of sequence information in anticipation of an environmental genomics future and lays the groundwork to compile all the available genetic information for freshwater fishes or other taxa in a region of interest.