CRITTERBASE - Quality Management

Data quality controls are major components within CRITTERBASE and ensure that the imported data meet a high quality standard. There are basic quality components, such as its data model itself, and several other routines that flag mistakes through a number of logical checks before, during and after data import.

The various CRITTERBASE quality control components outlined here prevent data errors that may corrupt subsequent analyses. All CRITTERBASE quality component activities are logged by a component called ActionLog and are stored in the database.

 

Table 1: Main components of CRITTERBASE quality control

 

Basic

Pre-Import

Import

Maintenance

Data model

X

X

X

X

MyWoRMS

X

X

X

X

TaxonWizard

X

 

X (TW1)

X (TW2)

ExcelTamer

 

X

 

 

TaxaPolishing

 

X

 

 

BioDjinn

 

 

X

 

GeoCruncher

 

 

X

 

BiotaControl

 

 

 

X

ActionLog

 

 

X

X

Note: TW1 = create taxa list, TW2 = update taxa list 

 

 

1.    Basic quality components

1.1. CRITTERBASE data model

The data model (Figure 1) requires that all data are ingested in a certain format. This represents the central core of CRITTERBASE`s quality control. The model includes (1) "metadata", i.e. cruise, station, sample, subsample and subset, and (2) "sampled data", i.e. information about biota, individual measure, sediment, etc. Within the data model, the latter are attached to the metadata framework. In addition, the model also contains the category of "lookup data", i.e., inventory information (e.g., "ship", "positioning system", "gear") used multiple times across different data sets. To avoid their redundant storage and reduce the likelihood of input errors (e.g., typing errors), the lookup data are stored in "lookup tables" separate from the metadata and sampled data.

 

dir

Figure 1: Schematic data model

Note: Primary and foreign keys are defined according to database structures. The primary key consists of one or more columns whose data contained within are used to uniquely identify each row in the table (e.g., primary key "Cruise" in table "Cruise"). The foreign key is a set of one or more columns in a table (e.g., foreign key "Cruise" in table "Station") that refers to the primary key (here: "Cruise") in another table (here: "Cruise").

 

 

1.2. MyWoRMS

The MyWoRMS component in CRITTERBASE synchronises taxon information with WoRMS (World Register of Marine Species; www.marinespecies.org). MyWoRMS creates a "taxon object" for each taxon imported into CRITTERBASE. For this, a new "taxon object" is created by using the entered AphiaID of the taxon, which is used to query WoRMS for the respective AphiaID. All information that WoRMS provides about this taxon are stored in "taxon object", together with the date of query and storage (see details below). If the same AphiaID is re-imported into CRITTERBASE within 30 days (via another data set with this unique AphiaID), the "taxon object" is still considered valid and WoRMS is not queried again. Otherwise, it is updated by synchronisation with WoRMS. MyWoRMS directly displays the age of all data/taxa entries; their manual update is possible at any time. All taxa are successively updated (from most recent to earliest) during a process defined as de-ageing. Taxon information of data sets that have been imported previously will still reflect changes detected by MyWoRMS (see Maintenance - BiotaControl).

 

The most important information stored in the "taxon object" are:

·      AphiaID used for query in WoRMS

·      valid AphiaID

·      valid scientific taxon name

·      valid taxon descriptor (= valid authority)

·      taxonomic rank (e.g., species, genus, family)

·      status (i.e., accepted or unaccepted), and

·      reason for non-acceptance (e.g., duplicate).

 

Only the first parameter "AphiaID" used for the query in WoRMS originates from the data import file, while all other fields are populated with information obtained from WoRMS. During a query, WoRMS itself will return an unaccepted taxon if the imported AphiaID pointed to an unaccepted taxon. The MyWoRMS algorithm in CRITTERBASE ("taxon class") was extended to always trace the unaccepted taxon to the currently accepted taxon (as registered in WoRMS). The path from the unaccepted taxon to the currently accepted one is stored in the "taxon object" together with the AphiaID of the accepted taxon (= accepted AphiaID), the corresponding scientific taxon name (= accepted scientific name) and taxon descriptor (= accepted authority).   

 

A major advantage of MyWoRMS is that a query for a taxon already stored in MyWoRMS requires no network time, i.e., queries for about 5000 taxa run in a few seconds, whereas the same queries to WoRMS may take up half an hour. Another advantage of the local CRITTERBASE component is that MyWoRMS checks each "taxon object" for the following possible problems:

(1)  Taxon status

The following parameters are used to identify the taxon status of a particular "taxon object": (i) imported AphiaID, (ii) valid AphiaID, (iii) accepted AphiaID, (iv) status and (v) unaccepted reason. In addition to "accepted" and "unaccepted" taken from WoRMS, in CRITTERBASE a status can also be "in limbo", which defines a taxon that is on its way from accepted to unaccepted but has not yet arrived (or may never arrive) there.

The identification of the taxon status is realised using the following logical matrix:

 

Table 2: Logical matrix to identify the taxon status

Status

AphiaID=ValidAphia=AcceptedAphiaID

 

Status

 

 

Unaccepted Reason

accepted

true

and

="accepted"

and

*

unaccepted

false

or

="unaccepted"

and

*

limbo

true

and

!="accepted" 1

and

*

* = unaccepted reason is disregarded when identifying taxon status.

1 WoRMS gives here information like nomen dubium or taxon inquirendum.

 

(2)  Status problems

MyWoRMS detects whether (i) there are multiple status associated with a "taxon object", (ii) the taxon is accepted but still no accepted scientific taxon name is recorded in WoRMS, or (iii) the taxon is accepted but WoRMS still lists an unaccepted reason entry.

In addition, MyWoRMS investigates whether a change in taxon status from "accepted" to "unaccepted" or "in limbo" has an impact on any unaccepted taxa previously traced to the formerly accepted taxon (with its accepted AphiaID). For example, if taxon t1 is unaccepted, MyWoRMS follows the valid AphiaIDs until an accepted taxon tn is found (then the AphiaID of tn is the accepted AphiaID for t1). However, if tn now changes its status to unaccepted, the chain of AphiaIDs becomes invalid, because it no longer terminates with a valid AphiaID.

(3)  Scientific taxon name

MyWoRMS checks whether (i) the scientific taxon name is missing, which can occur when the scientific name of a particular taxon changes in WoRMS. If a taxon status is changed from accepted to unaccepted, WoRMS deletes the corresponding (now invalid) scientific name for this taxon AphiaID leaving the record without a scientific name (i.e., blank field) but with a reference to the new accepted taxon. (ii) The taxon is accepted, but still no accepted scientific name is given, or (iii) the taxon is accepted (or in limbo), but the scientific name, the valid scientific name and accepted scientific name differ.

 

These checks run automatically for every single "taxon object" created in MyWoRMS when a data set is imported into CRITTERBASE. The full query log is provided, including the different types of problems that can occur in the import process. The user is then prompted to perform possible further actions on the taxa (via the import data file).

 

Since MyWoRMS can be backed up via snapshots and restored on other systems, otherwise cost- and time-intensive rebuilds on several computers are possible fairly easily (within the 30-day update period).

 

1.3. TaxonWizard

TaxonWizard builds a comprehensive taxa list across the respective target data sets stored in CRITTERBASE, which will be used for any external analytical routines (implemented in R or Python) of CRITTERBASE data, such as computation of biodiversity indices or secondary production measures. The basis for the taxa list is the TaxBase list. There are only three possibilities to input taxa to the TaxBase list: (1) Direct import of a TaxBase list, (2) import via the "biota sheet" (i.e., the sheet in the Excel template where the scientific taxa names and/or AphiaIDs are given) during data import, or (3) manual entry. Each of these three import paths is logged, ensuring that no taxon can enter TaxBase list untracked.

In addition to the TaxBase list, CRITTERBASE offers the option for independent expert knowledge lists created by the user (i.e., information does not come from MyWoRMS). For example, the user/expert can enter AphiaIDs of taxa, which are defined as colonial organisms in the TaxColony list. The user can also assign an alternate scientific taxon name to a corresponding AphiaID (from MyWoRMS) in the TaxPrivate list, i.e., an unofficial taxon name (e.g., from a historical record) that does not exist in MyWoRMS as either accepted or unaccepted taxon names.  

 

TaxonWizard takes all AphiaIDs from the TaxBase list and any expert knowledge lists, if present. For each AphiaID, TaxonWizard queries the respective "taxon object" in MyWoRMS and creates the corresponding final entries in the taxa list (with all information stored to the "taxon object"; see details above in MyWoRMS).

TaxonWizard checks the final taxa list using the same routines as MyWoRMS. The check that MyWoRMS performs is purely "taxon object" based, i.e., each taxon is tested independently, whereas the TaxonWizard taxa list, which provides context between taxa (i.e., which taxa should be conflated), ensures that TaxonWizard can also check consistency between taxa by performing the following quality tests:

(1)  Check for taxa with different AphiaIDs but the same scientific taxon name. As this is considered a legitimate case (e.g., there may be several unaccepted taxa in an AphiaID cascade that have the same scientific name), TaxonWizard only records these instances but does not create an error or warning message.

(2)  Check for taxa with different AphiaIDs but the same accepted scientific taxon name. As this case should not occur, it is considered an error and a corresponding warning message is generated.

 

2.    Pre-import quality components

CRITTERBASE features a number of components that ensure data-quality control prior to data import.

2.1. ExcelTamer

CRITTERBASE allows data input through the import of Excel files, which users create via Excel templates designed for this purpose. If an Excel import file has falsely been changed from the template, e.g., with regard to sheet and/or column names, ExcelTamer re-orders everything to the correct format to ensure an error-free data import. It does so by assigning unknown sheet and column names to the correct template names using an alias list. Missing sheet and column names (for which no alias exists) are also added if this is necessary and possible without losing any information from the original ingest file.

2.2. TaxaPolishing

TaxaPolishing checks and completes missing taxa information in the import file by searching for (a) missing AphiaIDs to the scientific taxa names or (b) vice versa by querying MyWoRMS. If the respective AphiaID or taxon name is not yet stored in MyWoRMS or the storage in MyWoRMS is outdated (older than 30 days), a new/current "taxon object" is created by the query of WoRMS and stored in MyWoRMS (see details in Basic Components - MyWoRMS).

Searching for a missing AphiaID using the scientific taxon name can be done by (1) the "normal search" procedure, where only results matching the search term exactly are returned (resulting in one hit per search); or (2) the "fuzzy search", which queries WoRMS for approximate matches, meaning results will be returned even if typing errors in the search request had occurred (multiple hits returned if necessary). The normal search process is preset; a fuzzy search must be selected separately by the user.

Searching for a missing scientific name by an AphiaID (given in the input sheet) provides only one hit as only one taxon name will be found for a given AphiaID or no result at all if the given AphiaID does not exist in WoRMS.

All hits for missing taxonomic information are automatically provided in a final log, structured according to the different classes of problems encountered by TaxaPolishing.

 

3.    Import quality components

CRITTERBASE also performs logic checks on data scenarios during data set import. Different data scenarios can be caused by different combinations of biota and sampling area inputs (Table 3). Biotic data can be provided in various units: (i) numbers (i.e., counts), (ii) abundances (i.e., densities) or (iii) presence/absence. Similarly, information on sampling area may be available in different types: (i) actually sampled area, (ii) reference area (fictional/defined area), (iii) calculated sampling area using gear information or (iv) unknown area. Depending on biotic and sample area input combinations, possible data product types are ABCD, A, B, C or D (Table 3). Data scenario of type A allows for the assessment of biomass or abundance (i.e., density). Type B only allows for the determination of the number of species per sampling unit. Type C allows the user to assess the number of species or number of individuals per sample, whereas type D only allows for the determination of the number of species per sample. The compound type ABCD supports all types of biodiversity analysis.

3.1.BioDjinn

BioDjinn checks the "data scenario" during data import (Table 3) and inspects the data to ensure entries abide by the scenario`s logic. A violation of a logic rule triggers an error message and prevents the import of data into CRITTERBASE. For example, if Scenario 1b is detected, BioDjinn checks whether a reference area is given for abundance values. If this is not the case, a warning message draws attention to the missing reference area. If no information on the reference area is available for the data set, the data have to be amended accordingly and imported under Scenario 4 (presence/absence data). BioDjinn does not convert data automatically, to give the user the opportunity to provide missing data, if available. If the imported data set contains further information (e.g., number), then Scenario 0c applies instead of Scenario 1b.

If a data set contains information that does not lead to another scenario and thus violates the logic rules (e.g., it encompasses both presence/absence and abundance data), the data cannot be imported and an error message is generated.

Furthermore, BioDjinn checks whether number and abundance values are >= 0, flagging accidentally entered false negative values. Floating-point numbers are only accepted as values for "number" if subsample information is given. For presence/absence data, only 1 or 0 are accepted values, otherwise an error message is displayed.


Table 3: Data Scenario Matrix

Scenario

Number

N [-]

Abundance

A [N/RA]

Presence/Absence

PA [0,1]

Sampled area
SA [m2]

Reference area

RA [m2]

Area calculated by gear

GSA [m2]

Replicate

Rep [-]

Rules (R) and formulas (F)

(for definitions see below table)

Product

0a

+

 

 

(+)

 

(+)

1

- Rules R1, R2 and R3 apply

- Formula F1 or F2 may have to be used

ABCD

0b

+

 

 

(+)

 

(+)

n

- Rules R1, R2 and R3 apply

- Formula F1 or F2 may have to be used

- If necessary, consider F3 for scientific analyses

ABCD

0c

+

+

 

 

+

(+)

 

- Rules R5 may apply

- Formula F4 applies

- Formula F1 or F2 may have to be used

ABCD

1a

 

+

 

 

+

+

 

- Formula F5 must be applied

ABCD

1b

 

+

 

 

+

 

 

 

A

2

+

 

 

 

 

 

 

 

C

3

 

 

+

(+)

 

(+)

 

- Rules R1, R2 and R3 apply

- Formula F1 or F2 may have to be used

B

4

 

 

+

 

 

 

 

 

D

General rules:

(R1) SA and/or GSA must be given.

(R2) If SA is given, it takes precedence over GSA.

(R3) If SA is not given, GSA must be calculated (see Formulas F1 and F2).

(R4) If replicates per sample are given, SA or GSA represents the sampled area per replicate.

(R5) If GSA is given, calculated SA (see Formula F4) is tested against GSA to check if SA and GSA are equal. The test is performed for all gears except trawls.

General formulas:

(F1) GSA (for grab samples) = grab length * grab width

(F2) GSA (for trawl samples) = trawl gear width * sampling distance; sampling distance = towed speed * towed time or sampling distance is defined by latitude/longitude coordinates of sampling start and end (these formulas are generally valid for all towed gears)

(F3) Total SA/sample = n rep * SA; total GSA/sample = n rep * GSA

(F4) SA = (N*RA)/A

(F5) N = (GSA*A)/RA


3.2. GeoCruncher

GeoCruncher checks the geographical coordinate format used during data import (e.g., degrees° minutes′ seconds″, degrees° decimal minutes), and, if necessary, automatically converts them to the decimal degree format that is used consistently throughout CRITTERBASE. If conversion fails, the data import process is aborted.

3.3. Further quality components

Subset

The "sampled data" (i.e., biota, individual measures, sediments) are always primarily associated to their "subset" metadata ID to ensure the identification of multiple studies of the same sample (e.g., with regard to different taxonomic groups collected from the same grab or trawl sample). The uniqueness of a data set is checked using the parameter tuple of (i) taxonomic coverage (i.e., the targeted community fraction of the study, e.g., macrozoobethos) and (ii) taxonomic resolution (i.e., the taxonomic level of identification).

Expansion of existing metadata entries

Before a "cruise" stored in CRITTERBASE is to be extended by a further "station" of the (supposedly) same "cruise" from a further data ingestion, the metadata entries of the two cruises (e.g., scientific leader, ship) are checked for consistency. If metadata entries differ, CRITTERBASE generates an error message, a warning or just an information entry, depending on the default settings used.

Uniqueness of lookup data

CRITTERBASE checks the uniqueness of the following quadruple condition per subset: (i) AphiaID, (ii) life stage, (iii) sieve, and (iv) specification. One AphiaID may be named more than once in a subset only if it differs in its combination with the three other attributes. Accidental false multiple entries per subset (e.g., the same taxon of the same life stage) are thus avoided.

 

4.    Quality components for maintenance

4.1. BiotaControl

BiotaControl checks the biotic data (i.e., taxa) on demand (i.e., user input is required) to keep them up-to-date with current "taxon objects" in MyWoRMS and detects any emerging inconsistencies, e.g., regarding the AphiaID, the scientific taxon name or the taxon status.