CRITTERBASE - Quality Management
Data quality controls are major components within
CRITTERBASE and ensure that the imported data meet a high quality standard. There
are basic quality components, such as its data model itself, and several other routines
that flag mistakes through a number of logical checks before, during and after data
import.
The various CRITTERBASE quality control
components outlined here prevent data errors that may corrupt subsequent
analyses. All CRITTERBASE quality component activities are logged by a
component called ActionLog and are stored
in the database.
Table 1: Main components of CRITTERBASE quality control
|
Basic |
Pre-Import |
Import |
Maintenance |
Data
model |
X |
X |
X |
X |
X |
X |
X |
X |
|
X |
|
X (TW1) |
X (TW2) |
|
|
X |
|
|
|
|
X |
|
|
|
|
|
X |
|
|
|
|
X |
|
|
|
|
|
X |
|
ActionLog |
|
|
X |
X |
Note: TW1 = create
taxa list, TW2 = update taxa list
1. Basic quality components
1.1. CRITTERBASE data model
The data model (Figure 1) requires that all
data are ingested in a certain format. This represents the central core of CRITTERBASE`s
quality control. The model includes (1) "metadata", i.e. cruise,
station, sample, subsample and subset, and (2) "sampled data", i.e. information
about biota, individual measure, sediment, etc. Within the data model, the latter
are attached to the metadata framework. In addition, the model also contains the
category of "lookup data", i.e., inventory information (e.g.,
"ship", "positioning system", "gear") used
multiple times across different data sets. To avoid their redundant storage and
reduce the likelihood of input errors (e.g., typing errors), the lookup data
are stored in "lookup tables" separate from the metadata and sampled
data.
Figure 1: Schematic data model
Note: Primary and foreign keys are defined according
to database structures. The primary key consists of one or more columns whose
data contained within are used to uniquely identify each row in the table
(e.g., primary key "Cruise" in table "Cruise"). The foreign
key is a set of one or more columns in a table (e.g., foreign key "Cruise"
in table "Station") that refers to the primary key (here:
"Cruise") in another table (here: "Cruise").
1.2. MyWoRMS
The MyWoRMS component in CRITTERBASE synchronises taxon information with WoRMS (World
Register of Marine Species; www.marinespecies.org). MyWoRMS creates a "taxon object" for each taxon imported
into CRITTERBASE. For this, a new "taxon object" is created by using the
entered AphiaID of the taxon, which is used to query WoRMS for the respective
AphiaID. All information that WoRMS provides about this taxon are stored in "taxon
object", together
with the date of query and storage (see details below). If the same AphiaID is re-imported
into CRITTERBASE within 30 days (via another data set with this unique
AphiaID), the "taxon object" is still considered valid and WoRMS is
not queried again. Otherwise, it is updated by synchronisation with WoRMS. MyWoRMS directly displays the age of all
data/taxa entries; their manual update is possible at any time. All taxa are successively
updated (from most recent to earliest) during a process defined as de-ageing.
Taxon information of data sets that have been imported previously will still
reflect changes detected by MyWoRMS (see
Maintenance - BiotaControl).
The most
important information stored in the "taxon object" are:
·
AphiaID used for query in WoRMS
·
valid AphiaID
·
valid scientific taxon name
·
valid taxon descriptor (= valid
authority)
·
taxonomic rank (e.g., species,
genus, family)
·
status (i.e., accepted or unaccepted),
and
·
reason for non-acceptance (e.g.,
duplicate).
Only the first
parameter "AphiaID" used for the query in WoRMS originates from the data
import file, while all other fields are populated with information obtained
from WoRMS. During a query, WoRMS
itself will return an unaccepted taxon if the imported AphiaID pointed to an
unaccepted taxon. The MyWoRMS algorithm
in CRITTERBASE ("taxon class") was extended to always trace the
unaccepted taxon to the currently accepted taxon (as registered in WoRMS). The
path from the unaccepted taxon to the currently accepted one is stored in the
"taxon object" together with the AphiaID of the accepted taxon (=
accepted AphiaID), the
corresponding scientific taxon name (= accepted scientific
name) and taxon descriptor (= accepted authority).
A major advantage of MyWoRMS is that a query for a taxon already
stored in MyWoRMS requires no network
time, i.e., queries for about 5000 taxa run in a few seconds, whereas the same
queries to WoRMS may take up half an hour. Another advantage of the local
CRITTERBASE component is that MyWoRMS checks each "taxon object" for
the following possible problems:
(1) Taxon status
The following parameters are used to
identify the taxon status of a particular "taxon object": (i) imported AphiaID, (ii) valid
AphiaID, (iii) accepted AphiaID, (iv) status and (v) unaccepted reason. In addition to "accepted"
and "unaccepted" taken from WoRMS, in CRITTERBASE a status can also
be "in limbo", which defines a taxon that is on its way from accepted
to unaccepted but has not yet arrived (or may never arrive) there.
The identification of the taxon status
is realised
using the following logical matrix:
Table 2: Logical matrix to identify
the taxon status
Status |
AphiaID=ValidAphia=AcceptedAphiaID |
|
Status |
|
Unaccepted Reason |
accepted |
true |
and |
="accepted" |
and |
* |
unaccepted |
false |
or |
="unaccepted" |
and |
* |
limbo |
true |
and |
!="accepted"
1 |
and |
* |
* = unaccepted reason is disregarded when
identifying taxon status.
1 WoRMS gives here information like nomen dubium or taxon inquirendum.
(2) Status problems
MyWoRMS detects whether (i) there are multiple status associated with a "taxon
object", (ii) the taxon is accepted but still no accepted scientific taxon
name is recorded in WoRMS, or (iii) the taxon is accepted but WoRMS still lists
an unaccepted reason entry.
In addition, MyWoRMS investigates whether a change in taxon status from "accepted"
to "unaccepted" or "in limbo" has an impact on any unaccepted
taxa previously traced to the formerly accepted taxon (with its accepted AphiaID).
For example, if taxon t1 is unaccepted, MyWoRMS
follows the valid AphiaIDs until an accepted taxon tn is found (then the
AphiaID of tn is the accepted AphiaID for t1). However, if tn now changes its
status to unaccepted, the chain of AphiaIDs becomes invalid, because it no
longer terminates with a valid AphiaID.
(3) Scientific taxon name
MyWoRMS checks whether
(i) the scientific taxon name is missing, which can
occur when the scientific name of a particular taxon changes in WoRMS. If a
taxon status is changed from accepted to unaccepted, WoRMS deletes the
corresponding (now invalid) scientific name for this taxon AphiaID leaving the
record without a scientific name (i.e., blank field) but with a reference to
the new accepted taxon. (ii) The taxon is accepted, but still no accepted
scientific name is given, or (iii) the taxon is accepted (or in limbo), but the
scientific name, the valid scientific name and accepted scientific name differ.
These checks run automatically for
every single "taxon object" created in MyWoRMS when a data set is imported into CRITTERBASE. The full
query log is provided, including the different types of problems that can occur
in the import process. The user is then prompted to perform possible further
actions on the taxa (via the import data file).
Since MyWoRMS can be backed up
via snapshots and restored on other systems, otherwise cost- and time-intensive
rebuilds on several computers are possible fairly easily (within the 30-day
update period).
1.3. TaxonWizard
TaxonWizard builds a comprehensive taxa list across
the respective target data sets stored in CRITTERBASE, which will be used for any
external analytical routines (implemented in R or Python) of CRITTERBASE data,
such as computation of biodiversity indices or secondary production measures. The basis for
the taxa list is the TaxBase list. There are only three possibilities to input
taxa to the TaxBase list: (1) Direct
import of a TaxBase list, (2) import
via the "biota sheet" (i.e., the sheet in the Excel template where
the scientific taxa names and/or AphiaIDs are given) during data import, or (3)
manual entry. Each of these three import paths is logged, ensuring that no
taxon can enter TaxBase list untracked.
In addition to
the TaxBase list, CRITTERBASE offers
the option for independent expert knowledge lists created by the user (i.e., information
does not come from MyWoRMS). For example, the user/expert can enter
AphiaIDs of taxa, which are defined as colonial organisms in the TaxColony list. The user can also assign
an alternate scientific taxon name to a corresponding AphiaID (from MyWoRMS) in the TaxPrivate list, i.e., an unofficial taxon name (e.g., from a
historical record) that does not exist in MyWoRMS
as either accepted or unaccepted taxon names.
TaxonWizard takes all AphiaIDs from the TaxBase list and any expert knowledge lists, if present. For each
AphiaID, TaxonWizard queries the
respective "taxon object" in MyWoRMS
and creates the corresponding final entries in the taxa list (with all information
stored to the "taxon object"; see details above in MyWoRMS).
TaxonWizard checks the final taxa list using the same
routines as MyWoRMS. The check that MyWoRMS performs is purely "taxon
object" based, i.e., each taxon is tested independently, whereas the TaxonWizard taxa list, which provides
context between taxa (i.e., which taxa should be conflated), ensures that TaxonWizard can also check consistency
between taxa by performing the following quality tests:
(1) Check for taxa with different AphiaIDs but the same
scientific taxon name. As this is considered a legitimate case (e.g., there may
be several unaccepted taxa in an AphiaID cascade that have the same scientific
name), TaxonWizard only records these
instances but does not create an error or warning message.
(2) Check for taxa with different AphiaIDs but the same
accepted scientific taxon name. As this case should not occur, it is considered
an error and a corresponding warning message is generated.
2. Pre-import quality components
CRITTERBASE features a number of
components that ensure data-quality control prior to data import.
2.1. ExcelTamer
CRITTERBASE
allows data input through the import of Excel files, which users create via Excel
templates designed for this purpose. If an Excel import file has falsely been
changed from the template, e.g., with regard to sheet and/or column names, ExcelTamer
re-orders everything to the correct format to ensure an error-free data
import. It does so by assigning unknown sheet and column names to the correct template
names using an alias list. Missing sheet and column names (for which no alias
exists) are also added if this is necessary and possible without losing any information
from the original ingest file.
2.2. TaxaPolishing
TaxaPolishing checks and completes missing taxa
information in the import file by searching for (a) missing AphiaIDs to the
scientific taxa names or (b) vice versa by querying MyWoRMS. If the respective AphiaID or
taxon name is not yet stored in MyWoRMS or the storage in MyWoRMS is
outdated (older than 30 days), a new/current "taxon object"
is created by the query of WoRMS and stored in MyWoRMS (see details in Basic Components - MyWoRMS).
Searching for a missing AphiaID using
the scientific taxon name can be done by (1) the "normal search"
procedure, where only results matching the search term exactly are returned (resulting
in one hit per search); or (2) the "fuzzy search", which queries
WoRMS for approximate matches, meaning results will be returned even if typing
errors in the search request had occurred (multiple hits returned if necessary).
The normal search process is preset; a fuzzy search must be selected separately
by the user.
Searching for a missing scientific
name by an AphiaID (given in the input sheet) provides only one hit as only one
taxon name will be found for a given AphiaID or no result at all if the given
AphiaID does not exist in WoRMS.
All hits for missing taxonomic information
are automatically provided in a final log, structured according to the
different classes of problems encountered by TaxaPolishing.
3. Import quality components
CRITTERBASE
also performs logic checks on data scenarios during data set import. Different
data scenarios can be caused by different combinations of biota and sampling
area inputs (Table 3). Biotic data can be provided in various units: (i) numbers (i.e., counts), (ii) abundances (i.e.,
densities) or (iii) presence/absence. Similarly, information on sampling area
may be available in different types: (i) actually
sampled area, (ii) reference area (fictional/defined area), (iii) calculated
sampling area using gear information or (iv) unknown area. Depending on biotic
and sample area input combinations, possible data product types are ABCD, A, B,
C or D (Table 3). Data scenario of type A allows for the assessment of biomass
or abundance (i.e., density). Type B only allows for the determination of the
number of species per sampling unit. Type C allows the user to assess the
number of species or number of individuals per sample, whereas type D only
allows for the determination of the number of species per sample. The compound
type ABCD supports all types of biodiversity analysis.
3.1.BioDjinn
BioDjinn checks the "data scenario" during
data import (Table 3) and inspects the data to ensure entries abide by the scenario`s
logic. A violation of a logic rule triggers an error message and prevents the
import of data into CRITTERBASE. For example, if Scenario 1b is detected, BioDjinn
checks whether a reference area is given for abundance values. If this is not
the case, a warning message draws attention to the missing reference area. If
no information on the reference area is available for the data set, the data have
to be amended accordingly and imported under Scenario 4 (presence/absence data).
BioDjinn does not convert data automatically, to give the user the
opportunity to provide missing data, if available. If the imported data set
contains further information (e.g., number), then Scenario 0c applies instead
of Scenario 1b.
If
a data set contains information that does not lead to another scenario and thus
violates the logic rules (e.g., it encompasses both presence/absence and
abundance data), the data cannot be imported and an error message is generated.
Furthermore,
BioDjinn checks whether number and abundance values are >= 0,
flagging accidentally entered false negative values. Floating-point numbers are
only accepted as values for "number" if subsample information is
given. For presence/absence data, only 1 or 0 are accepted values, otherwise an
error message is displayed.
Table 3: Data Scenario Matrix
Scenario |
Number N [-] |
Abundance A [N/RA] |
Presence/Absence PA [0,1] |
Sampled area |
Reference area RA [m2] |
Area calculated by gear GSA [m2] |
Replicate Rep [-] |
Rules (R) and formulas (F) (for definitions see below table) |
Product |
0a |
+ |
|
|
(+)
|
|
(+)
|
1 |
-
Rules R1, R2 and R3 apply -
Formula F1 or F2 may have to be used |
ABCD |
0b |
+ |
|
|
(+)
|
|
(+)
|
n |
-
Rules R1, R2 and R3 apply -
Formula F1 or F2 may have to be used - If
necessary, consider F3 for scientific analyses |
ABCD |
0c |
+ |
+ |
|
|
+ |
(+) |
|
-
Rules R5 may apply -
Formula F4 applies -
Formula F1 or F2 may have to be used |
ABCD |
1a |
|
+ |
|
|
+ |
+ |
|
-
Formula F5 must be applied |
ABCD |
1b |
|
+ |
|
|
+ |
|
|
|
A |
2 |
+ |
|
|
|
|
|
|
|
C |
3 |
|
|
+ |
(+)
|
|
(+)
|
|
-
Rules R1, R2 and R3 apply -
Formula F1 or F2 may have to be used |
B |
4 |
|
|
+ |
|
|
|
|
|
D |
General rules:
(R1)
SA and/or GSA must be given.
(R2)
If SA is given, it takes precedence over GSA.
(R3) If SA is not given, GSA must be
calculated (see Formulas F1 and F2).
(R4) If replicates per sample are
given, SA or GSA represents the sampled area per replicate.
(R5) If GSA is given, calculated SA
(see Formula F4) is tested against GSA to check if SA and GSA are equal. The
test is performed for all gears except trawls.
General formulas:
(F1) GSA (for grab samples) = grab
length * grab width
(F2) GSA (for trawl samples) = trawl
gear width * sampling distance; sampling distance = towed speed * towed time or
sampling distance is defined by latitude/longitude coordinates of sampling
start and end (these formulas are generally valid for all towed gears)
(F3) Total SA/sample = n rep * SA;
total GSA/sample = n rep * GSA
(F4) SA = (N*RA)/A
(F5) N = (GSA*A)/RA
3.2. GeoCruncher
GeoCruncher
checks
the geographical coordinate format used during data import (e.g., degrees°
minutes′ seconds″, degrees° decimal minutes), and, if necessary, automatically converts
them to the decimal degree format that is used consistently throughout CRITTERBASE.
If conversion fails, the data import process is aborted.
3.3. Further quality components
Subset
The "sampled data" (i.e.,
biota, individual measures, sediments) are always primarily associated to their "subset" metadata ID to ensure the
identification of multiple studies of the same sample (e.g., with regard to
different taxonomic groups collected from the same grab or trawl sample). The
uniqueness of a data set is checked using the parameter tuple of (i) taxonomic coverage (i.e., the targeted community fraction
of the study, e.g., macrozoobethos) and (ii) taxonomic resolution (i.e., the
taxonomic level of identification).
Expansion of existing metadata entries
Before a "cruise" stored
in CRITTERBASE is to be extended by a further "station" of the (supposedly)
same "cruise" from a further data ingestion, the metadata entries of
the two cruises (e.g., scientific leader, ship) are checked for consistency. If
metadata entries differ, CRITTERBASE generates an error message, a warning or
just an information entry, depending on the default settings used.
Uniqueness of lookup data
CRITTERBASE
checks the uniqueness of the following quadruple condition per subset: (i) AphiaID, (ii) life stage, (iii)
sieve, and (iv) specification. One AphiaID may be named more than
once in a subset only if it differs in its combination with the three other
attributes. Accidental false multiple entries per subset (e.g., the
same taxon of the same life stage) are thus avoided.
4. Quality components for maintenance
4.1. BiotaControl
BiotaControl checks the biotic data (i.e., taxa)
on demand (i.e., user input is required) to keep them up-to-date with current "taxon
objects" in MyWoRMS and detects any
emerging inconsistencies, e.g., regarding the AphiaID, the scientific taxon
name or the taxon status.