AusTraits data
compilation - a curated plant trait database for the Australian
floraThis document describes the structure of the AusTraits compilation,
corresponding to Version 6.0.0 of the dataset. Note
that the information provided below is based on the information provided
within the file
system.file("data", "schema.yml", package = "traits.build").
For details on access, structure and usage please visit https://doi.org/10.5281/zenodo.3568417
The compiled AusTraits database has the following main components:
austraits
├── traits
├── locations
├── contexts
├── methods
├── excluded_data
├── taxonomic_updates
├── taxa
├── contributors
├── sources
├── definitions
├── schema
├── metadata
└── build_info
These elements include all the data and contextual information submitted with each contributed datasets. Each component is defined as follows:
Description: A table containing measurements of traits.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
| observation_id |
A unique integral identifier for the observation, where an observation
is all measurements made on an individual at a single point in time. It
is important for joining traits coming from the same
observation_id. Within each dataset, observation_id’s are
unique combinations of taxon_name,
population_id, individual_id, and
temporal_context_id.
|
| trait_name |
Name of the trait sampled. Allowable values specified in the table
definitions.
|
| value | The measured value of a trait, location property or context property. |
| unit | Units of the sampled trait value after aligning with AusTraits standards. |
| entity_type | A categorical variable specifying the entity corresponding to the trait values recorded. |
| value_type | A categorical variable describing the statistical nature of the trait value recorded. |
| basis_of_value | A categorical variable describing how the trait value was obtained. |
| replicates |
Number of replicate measurements that comprise a recorded trait
measurement. A numeric value (or range) is ideal and appropriate if the
value type is a mean, median, min
or max. For these value types, if replication is unknown
the entry should be unknown. If the value type is
raw_value the replicate value should be 1. If the trait is
categorical or the value indicates a measurement for an entire species
(or other taxon) replicate value should be .na.
|
| basis_of_record | A categorical variable specifying from which kind of specimen traits were recorded. |
| life_stage |
A field to indicate the life stage or age class of the entity measured.
Standard values are adult, sapling,
seedling and juvenile.
|
| population_id | A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category). |
| individual_id | A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time. |
| repeat_measurements_id | A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve. |
| temporal_context_id | A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table. |
| source_id | For datasets that are compilations, an identifier for the original data source. |
| location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
| entity_context_id | A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects). |
| plot_context_id | A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table. |
| treatment_context_id | A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table. |
| collection_date |
Date sample was taken, in the format yyyy-mm-dd,
yyyy-mm or yyyy, depending on the resoluton
specified. Alternatively an overall range for the study can be
indicating, with the starting and ending sample date sepatated by a
/, as in 2010-10/2011-03.
|
| measurement_remarks | Brief comments or notes accompanying the trait measurement. |
| method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
| method_context_id | A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table. |
| original_name | Name given to taxon in the original data supplied by the authors. |
Description: A table containing observations of
location/site characteristics associated with information in
traits. Cross referencing between the two dataframes is
possible using combinations of the variables dataset_id,
location_name.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
| location_name | The location name. |
| location_property |
The location characteristic being recorded. The name should include
units of measurement, e.g. MAT (C). Ideally we have at
least the following variables for each location,
longitude (deg), latitude (deg),
description.
|
| value | The measured value of a location property. |
Description: A table containing observations of
contextual characteristics associated with information in
traits. Cross referencing between the two dataframes is
possible using combinations of the variables dataset_id,
link_id, and link_vals.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| context_property |
The contextual characteristic being recorded. If applicable, name should
include units of measurement, e.g. CO2 concentration (ppm).
|
| category |
The category of context property, with options being plot,
treatment, individual_context,
temporal and method.
|
| value | The measured value of a context property. |
| description | Description of a specific context property value. |
| link_id |
Variable indicating which identifier column in the traits table contains
the specified link_vals.
|
| link_vals |
Unique integer identifiers that link between identifier columns in the
traits table and the contextual properties/values in the
contexts table.
|
Description: A table containing details on methods
with which data were collected, including time frame and source. Cross
referencing with the traits table is possible using
combinations of the variables dataset_id,
trait_name.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| trait_name |
Name of the trait sampled. Allowable values specified in the table
definitions.
|
| methods | A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data. |
| method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
| description | A 1-2 sentence description of the purpose of the study. |
| sampling_strategy | A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait. |
| source_primary_key |
Citation key for the primary source in sources. The key is
typically formatted as Surname_year.
|
| source_primary_citation | Citation for the primary source. This detail is generated from the primary source in the metadata. |
| source_secondary_key |
Citation key for the secondary source in sources. The key
is typically formatted as Surname_year.
|
| source_secondary_citation | Citations for the secondary source. This detail is generated from the secondary source in the metadata. |
| source_original_dataset_key |
Citation key for the original dataset_id in sources; for compilations.
The key is typically formatted as Surname_year.
|
| source_original_dataset_citation | Citations for the original dataset_id in sources; for compilationse. This detail is generated from the original source in the metadata. |
| data_collectors | The person (people) leading data collection for this study. |
| assistants | Names of additional people who played a more minor role in data collection for the study. |
| dataset_curators | Names of AusTraits team member(s) who contacted the data collectors and added the study to the AusTraits repository. |
Description: A table of data that did not pass
quality tests and so were excluded from the master dataset. The
structure is identical to that presented in the traits
table, only with an extra column called error indicating
why the record was excluded. Common reasons are
missing_unit_conversions, missing_value, and
unsupported_trait_value.
Content:
| key | value |
|---|---|
| error | Indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value. |
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
| observation_id |
A unique integral identifier for the observation, where an observation
is all measurements made on an individual at a single point in time. It
is important for joining traits coming from the same
observation_id. Within each dataset, observation_id’s are
unique combinations of taxon_name,
population_id, individual_id, and
temporal_context_id.
|
| trait_name |
Name of the trait sampled. Allowable values specified in the table
definitions.
|
| value | The measured value of a trait. |
| unit | Units of the sampled trait value after aligning with AusTraits standards. |
| entity_type | A categorical variable specifying the entity corresponding to the trait values recorded. |
| value_type | A categorical variable describing the statistical nature of the trait value recorded. |
| basis_of_value | A categorical variable describing how the trait value was obtained. |
| replicates |
Number of replicate measurements that comprise a recorded trait
measurement. A numeric value (or range) is ideal and appropriate if the
value type is a mean, median, min
or max. For these value types, if replication is unknown
the entry should be unknown. If the value type is
raw_value the replicate value should be 1. If the trait is
categorical or the value indicates a measurement for an entire species
(or other taxon) replicate value should be .na.
|
| basis_of_record | A categorical variable specifying from which kind of specimen traits were recorded. |
| life_stage |
A field to indicate the life stage or age class of the entity measured.
Standard values are adult, sapling,
seedling and juvenile.
|
| population_id | A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category). |
| individual_id | A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time. |
| repeat_measurements_id | A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve. |
| temporal_context_id | A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table. |
| source_id | For datasets that are compilations, an identifier for the original data source. |
| location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
| entity_context_id | A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects). |
| plot_context_id | A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table. |
| treatment_context_id | A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table. |
| collection_date |
Date sample was taken, in the format yyyy-mm-dd,
yyyy-mm or yyyy, depending on the resoluton
specified. Alternatively an overall range for the study can be
indicating, with the starting and ending sample date sepatated by a
/, as in 2010-10/2011-03.
|
| measurement_remarks | Brief comments or notes accompanying the trait measurement. |
| method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
| method_context_id | A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table. |
| original_name | Name given to taxon in the original data supplied by the authors. |
Description: A table of all taxonomic changes
implemented in the construction of AusTraits. Changes are determined by
comparing the originally submitted taxon name against the taxonomic
names listed in the taxonomic reference files, best placed in a
subfolder in the config folder . Cross referencing with the
traits table is possible using combinations of the
variables dataset_id and taxon_name.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| original_name | Name given to taxon in the original data supplied by the authors. |
| aligned_name |
The taxon name without authorship after implementing automated syntax
standardisation and spelling changes as well as manually encoded syntax
alignments for this taxon in the metadata file for the corresponding
dataset_id. This name has not yet been matched to the
currently accepted (botanical) or valid (zoological) taxon name in cases
where there are taxonomic synonyms, isonyms, orthographic variants, etc.
|
| taxonomic_resolution | The rank of the most specific taxon name (or scientific name) to which a submitted orignal name resolves. |
| taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
| aligned_name_taxon_id | An identifier for the aligned name before it is updated to the currently accepted name usage. This may be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
| aligned_name_taxonomic_status |
The status of the use of the aligned_name as a label for a
taxon. Requires taxonomic opinion to define the scope of a taxon. Rules
of priority then are used to define the taxonomic status of the
nomenclature contained in that scope, combined with the experts opinion.
It must be linked to a specific taxonomic reference that defines the
concept.
|
Description: A table containing details on taxa
associated with information in traits. Whenever possible,
this information is sourced from curated taxon lists that include
identifiers for each taxon. The information compiled in this table is
released under a CC-BY3 license. Cross referencing between the two
dataframes is possible using combinations of the variable
taxon_name.
Content:
| key | value |
|---|---|
| taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
| taxonomic_dataset | Name of the taxonomy (tree) that contains this concept. ie. APC, AusMoss etc. |
| taxon_rank | The taxonomic rank of the most specific name in the scientific name. |
| trinomial |
The infraspecific taxon name match for an original name. This column is
assigned na for taxon name that are at a broader
taxonomic_resolution.
|
| binomial |
The species-level taxon name match for an original name. This column is
assigned na for taxon name that are at a broader
taxonomic_resolution.
|
| genus | Genus of the taxon without authorship. |
| family | Family of the taxon. |
| taxon_distribution | Known distribution of the taxon, by Australian state. |
| establishment_means | Statement about whether an organism or organisms have been introduced to a given place and time through the direct or indirect activity of modern humans. |
| taxonomic_status | The status of the use of the scientificName as a label for the taxon in regard to the ‘accepted (or valid) taxonomy’. The assigned taxonomic status must be linked to a specific taxonomic reference that defines the concept. |
| taxon_id | An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
| taxon_id_genus | An identifier for the set of taxon information (data associated with the taxon class) for the genus associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
| taxon_id_family | An identifier for the set of taxon information (data associated with the taxon class) for the family associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
| scientific_name | The full scientific name, with authorship and date information if known. |
| scientific_name_id | An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
Description: A table of people contributing to each study.
Content:
| key | value |
|---|---|
| dataset_id |
Primary identifier for each study contributed to AusTraits; most often
these are scientific papers, books, or online resources. By default this
should be the name of the first author and year of publication,
e.g. Falster_2005.
|
| last_name | Last name of the data collector. |
| given_name | Given names of the data collector. |
| ORCID | ORCID of the data collector. |
| affiliation | Last known institution or affiliation. |
| additional_role | Additional roles of data collector, mostly contact person. |
Description: Bibtex entries for all primary and secondary sources in the compilation.
Description: A copy of the definitions for all tables and terms. Information included here was used to process data and generate any documentation for the study.
Description: A copy of the schema for all tables and terms. Information included here was used to process data and generate any documentation for the study.
Description: Metadata associated with the dataset, including title, creators, license, subject, funding sources.
Description: A description of the computing environment used to create this version of the dataset, including version number, git commit and R session_info.
The core organising unit behind AusTraits is the
dataset_id. Records are organisation as coming from a
particular study, defined by the dataset_id. Our preferred
format for dataset_id is surname of the first author of any
corresponding publication, followed by the year, as
surname_year. E.g. Falster_2005. Wherever
there are multiple studies with the same id, we add a suffix
_2, _3 etc. E.g.Falster_2005,
Falster_2005_2.
As well as a dataset_id, each trait measurement has 10
additional identifiers, observation_id,
population_id, individual_id,
temporal_id, source_id,
location_id, entity_context_id,
plot_id, treatment_id, and
method_id.
All except source_id are simply integral identifiers that link groups
of measurements and are automatically generated through the AusTraits
workflow (individual_id can be assigned in the metadata
file or automatically generated.)
To expand on the definitions provided above,
observation_id links measurements made on the same
entity (individual, population, or species) at a single point in
time.
population_id indicates entites that share a common
location_id, plot_id, and
treatment_id. It is used to align measurements and
observation_id’s for individuals versus
populations (i.e. distinct entity_types) that
share a common population_id. It is numbered sequentially
within a dataset.
individual_id indicates a unique organisms. It is
numbered sequentially within a dataset by population. Multiple
observations on the same organism across time (with distinct
observation_id values), share a common
individual_id.
temporal_id indicates a distinct point in time and
is used only if there are repeat measurements on a population or
individual across time. The identifier links to context properties
(& their associated information) in the contexts table
for context properties of type temporal.
source_id is applied if not all data within a single
dataset (dataset_id) is from the same source, such as when
a dataset represents a compilation for a meta-analysis.
location_id links to a distinct
location_name and associated
location_properties in the location
table.
entity_context_id links to information in the
contexts table for context properties (& associated
values/descriptions) with category entity_context.
Entity_contexts include organism sex, organism caste and
any other features of an entity that needs to be documented.
plot_id links to information in the
contexts table for context properties (& associated
values/descriptions) with category plot.
Plot contexts include both blocks/plots within an
experimental design as well as any stratified variation within a
location that needs to be documented (e.g. slope position).
treatment_idlinks to information in the
contexts table for context properties (& associated
values/descriptions) with category treatment.
Treatment contexts are experimental manipulations applied
to groups of individuals.
method_idlinks to information in the
contexts table for context properties (& associated
values/descriptions) with category method. A
method context indicates that the same trait was measured
on or across individuals using different methods.
Each record in the table of trait data has an associated
value, value_type, and
basis_of_value.
Value type: A trait’s value_type is either
numeric or categorical. - For traits with
numerical values, the recorded value has been converted into
standardised units and the AusTraits workflow has confirmed the value
can be converted into a number and lies within the allowable range. -
For categorical variables, records have been aligned through
substitutions to values listed as allowable values (terms) in a trait’s
definition. * we use _ for multi-word terms,
e.g. semi_deciduous
* we use a space for situations where two values co-occur for the same
entity. For instance, a flora might indicate that a plant species can be
either annual or biennial, in which case the trait is scored as
annual biennial.
Each trait measurement has an associated value_type,
which is a categorical variable describing the statistical nature of the
trait value recorded. Possible values are:
| key | value |
|---|---|
| raw | Value recorded for an entity. |
| minimum | Value is the minimum of values recorded for an entity. |
| mean | Value is the mean of values recorded for an entity. |
| median | Value is the median of values recorded for an entity. |
| maximum | Value is the maximum of values recorded for an entity. |
| mode | Value is the mode of values recorded for an entity. This is the appropriate value type for a categorical trait value. |
| range | Value is a range of values recorded for an entity. |
| bin | Value for an entity falls within specified limits. |
| unknown | Not currently known. |
Each trait measurement also has an associated
basis_of_value, which is a categorical variable describing
how the trait value was obtained. Possible values are:
| key | value |
|---|---|
| measurement | Value is the result of a measurement(s) made on a specimen(s). |
| expert_score | Value has been estimated by an expert based on their knowledge of the entity. |
| model_derived | Value is derived from a statistical model, for example via gap-filling. |
| unknown | Not currently known. |
AusTraits does not include intra-individual observations made at a
single point in time. When multiple measurements per individual are
submitted to AusTraits, we take the mean of the values and record the
value_type as mean and indicate under replicates the number
of measurements made.
Version 6.0.0 of AusTraits contains records for 33494 different taxa.
We have aligned taxa with known taxonomic units in the Australian Plant Census
(APC) and/or the Australian Plant Names Index
(APNI). Of the 33494 taxa included, 33319 are aligned with known
taxa.
The traits table reports both the original and the
updated taxon name alongside each trait record.
The table taxa lists all taxa in the database, including
additional information about the taxa.
The table taxonomic_updates provides details on all
taxonomic names changes implemented in aligning with APC and APNI.
For each dataset in the compilation there is the option to list primary and secondary citations as well as, for compilations, original citations. The primary citation The original study in which data were collected. while the secondary citation is A subsequent study where data were compiled or re-analysed.. These references are included in two places:
Following is a list of traits included in this version.
vine. Types of climbers
(scrambling, twining) are captured under the trait
plant_climbing_mechanism. (Synonyms, vine)liana.
Woody climbers generally use hooks, tendrils, and/or adventitious roots
to climb; the climbing mechanisms used by a taxon are captured under the
trait plant_climbing_mechanism. (Synonyms, liana)herbs and also have the term
tufted mapped to the trait
stem_growth_habit.shrub is
complex, as there are many single-stemmed shrubs within Australia and
many taxa that are described in the taxonomic literature as a shrub or
small tree.plant_climbing_mechanism.cold stratification or
warm stratification.resprouters, while those where fewer than 30% of plants
resprout are designated as fire killed. Species with an
intermediate response have a mixed fire response, and are coded as
fire_killed resprouts.basal buds rather than
basal stem buds or lignotuber.storage_organ. (Synonyms,
rootstock, lignotuberous resprouter)true wood (secondary xylem) and taxa that do not produce
secondary xylem (i.e. monocots and ferns) but have thick, stiff, robust
lignified stems.fuel bed refers to the accumulated plant litter
[ENVO:01000628] on the ground [SWEET:realmSoil/Ground], as this is the
fuel [CHEBI:33292] that is burnt in a ground fire [ENVO:01000786].;The
density of the fuel bed, the plant litter accumulated on the ground. It
is calculated as the ratio of the mass of the fuel bed to the fuel bed
volume. This trait should be measured on plant litter from a single
taxon.empty space) and is calculated at 1 minus the ratio
[PATO:0001470] of total xylem vessel [PO:0025417] and tracheid
[PO:0000301] lumen [PO:0025117] cross-sectional [NCIT:C63795] area
[PATO:0001323] to stem [PO:0009047] cross-sectional area.;The fraction
of a stem cross-section not comprised of lumen (the empty space in xylem
vessels and tracheids) and calculated as one minus the ratio of the
total cross-sectional area comprised of xylem vessels and tracheids
(lumen fraction) to the total stem cross-sectional area.