

# Named Entity Corpus for Occupational Substance Exposure Assessment

This is a corpus consisting of selected sections (i.e., *Abstract, Methods* and *Results*) of scientific research articles concerning occupational exposures to two different types of substance, i.e.,  diesel exhaust (51 articles) and respirable crystalline silica (RCS)  (50 articles).  The article sections have been annotated by experts in the field with 6 categories of named entities (NEs) relevant to the assessment of occupational substance exposures, particularly in the context of Job Exposure Matrices (JEMs).



## Named Entity Categories

The table below provides details and examples of the six categories of NEs that have been annotated in the corpus.  

| **Category**                    | **Definition**                                                 | **Examples**                                                 |
| ------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------ |
| **SubstanceOrExposureMeasured** | Measured substance,  chemical or pollutant                     | *respirable quartz dust;  elemental carbon*                  |
| **OccupationJobTitle**          | Job/occupation of subject(s) of exposure studies.              | *carpenters; concrete  workers; operators in the refinery*   |
| **IndustryWorkplace**           | Workplace OR industry involved in the sampling series.         | *mining operations;  diesel factory; four-lane motorway*     |
| **JobTaskactivity**             | Physical activity/action forming part of workers' daily duties | *welding; concrete  pouring; mechanical mowing of weeds*     |
| **OHMeasurementDevice**         | Device/apparatus used by to measure workplace exposure levels  | *IOM samplers; Higgins Dewell cyclones;  Dräger stain tubes* |
| **SampleTypePersonal**          | Phrases denoting that samples represent  personal exposures.   | *personal measurements; personal  breathing zone sample*     |



## Corpus Statistics

The table below provides some statistics of annotations in the corpus

- **Total annotations** - Total number of spans annotated in the indicated category. 
- **Unique spans** – Number of distinct spans annotated in the indicated category, after converting to lower case.
- **Unique span frequency ** - Average number of times that each unique span in the indicated category was annotated

| **Category**                    | **Total Annotations**      | **Unique Spans** | **Unique Span Frequency** |
| ------------------------------- | -------------------------- | ---------------- | ------------------------- |
| **SubstanceOrExposureMeasured** | 7620                       | 810              | 9.41                      |
| **OccupationJobTitle**          | 2159                       | 644              | 3.35                      |
| **IndustryWorkplace**           | 2764                       | 927              | 2.98                      |
| **JobTaskactivity**             | 1582                       | 973              | 1.63                      |
| **OHMeasurementDevice**         | 896                        | 412              | 2.17                      |
| **SampleTypePersonal**          | 517                        | 115              | 4.5                       |



## Corpus Format

The corpus is available in two different formats:

- **brat standoff format** - The text for each article is stored in a separate file; the corresponding NE annotations are stored in separate files from the document text. The format is fully described [here](https://brat.nlplab.org/standoff.html). 
- **JSON** - The complete corpus is stored in a single file. The file includes the text for each article, metadata regarding the source of the article and the NE annotations

Each article is assigned a unique ID, which is used consistently across both formats

## Structure of the Downloadable Corpus 

The structure of the files and directories provided in the download is as follows:

- **README.MD** - This README file

- **Occupation-substance-exposure-annotation-guidelines.pdf** - Guidelines used by the annotators to perfrom the NE annotation

- **article-metadata**-  Directory containing tsv files storing metadata associated with each article that is included in the annotated corpus. The information is split between two different (*diesel-exhaust.tsv* and *rcs.tsv*), according to the main category of substance exposure described in the article. The first row of each file contains the column names, while each subsequent row contains the metadata associated with a specific document. The columns are as follows:

  - *Article ID* - A unique ID for the article
  - *PMID* - If the article is indexed in the [PubMed](https://pubmed.ncbi.nlm.nih.gov/) database of biomedical and life sciences literature, then the PMID (PubMed Unique Identifier) is provided.
  - *DOI* - The [DOI](https://www.doi.org/) (Digital Object Identifier) for the article is provided if available.
  - *URL* - If a DOI is not available for the article, then a URL is provided 

- **brat** - Directory containing the annotated corpus in brat standoff format.  For each article, the text from all annotated sections of the document is stored in a text file (*.txt*) whose base name matches the *Document ID* provided in *Corpus-sources.xlsx*. The corresponding annotations are in a file with the same base name and the suffix *.ann*.   The files are organised into two subdirectories documents in the same way as *Corpus-sources.xlsx* according to the the main type of substance exposure described in the article, i.e.:

  - *diesel-exhaust*
  - *RCS*

- **json** - Directory containing the annotated corpus in JSON format and associated information. 

  - **substance_exposure_corpus.json** - File containing the complete in JSON format. The file contains an array of JSON objects, each representing an annotated article. The information provided for each document is formally in *substance_exposure.json.schema*. However, briefly, this includes the following:

    - *Article metadata* - Attributes encode the same metadata for the article as described above for *Corpus-sources.xlsx*, i.e., *Article ID*; *PMID* (optional); *DOI* or *URL*
    -  *Article text* - Text from the selected sections of the article that have been annotated (i.e., *Abstract, Methods* and *Results*)
    - *Sentences* - Spans of the individual sentences within the article text
    - *Annotations* - Spans and categories of the manually-added NE annotations

  - **substance_exposure.json.schema** - A [JSON Schema](https://json-schema.org/) that formally defines the structure of  *substance_exposure_corpus.json*

  - **schema_doc** - A directory containing an HTML documentation file for the JSON (*substance_exposure_schema.html*) and associated files for the correct formatting of the HTML file. The documentation file  aims to make the schema easier to understand. This file was generated using [JSON Schema for Humans](https://coveooss.github.io/json-schema-for-humans/#/). 

  - **data_splits** - We have split the corpus in different ways carry out experiments to fine-tune and evaluate different machine learning models. In each case, the split was carried out at the sentence level.  This directory contains files that define these splits. The files contain the sentence IDs (as defined in *substance_exposure_corpus.json*, one ID per line) of the sentences in each of the data splits that we have used for our experiments.  This is intended to better facilitate reproducibility of our results and/or comparison of our results with those obtained by alternative models.   There are two different subdirectories: 

    - **train-valid-test** - Directory containing files that define a 3-way split of all sentences in the corpus
      - **train.txt** - Training set (80% of corpus sentences)
      - **valid.txt** - Validation set (10% of corpus sentences)
      - **test.txt** - Test set (10% of corpus sentences)

    - **cross-validation**- Directory containing files that define 10 splits of the data to facilitate 10-fold cross validation of models. We maintain the same test set as the one used in the *train-valid-test* directory. The remainder of the data was randomly split into 10 equal sized folds, which were used to create 10 different training sets (by combining 9 out of the 10 of the folds) and 10 different validation sets (the remaining fold).  There are 10 different subdirectories (numbered 1 - 10), each with a *train.txt* and a *valid.txt* containing the IDs of the sentences in each split The idea is that the different splits can be used to optimise models, and the final performance of the final model can be evaluated through application to the held-out test set (*test.txt* is also included in this directory, for convenience).

    
## Licence  

The annotation guidelines and the corpus annotations are licensed under a  [Creative Commons Attribution (CC BY) licence](https://creativecommons.org/licenses/by/4.0/).  If you use either of these resources, please attribute the [National Centre for Text Mining (NaCTeM)](https://www.nactem.ac.uk/) , School of Computer Science, University of Manchester, UK and cite the following article: 

Thompson, P., Ananiadou, S., Basinas I., Brinchmann, B. C., Cramer, C., Galea, K. S., Ge, C., Georgiadis, P., Kirkeleit, J., Kuijpers, E., Nguyen, N., Nuñez, R., Schlünssen, V., Stokholm, Z. A., Taher, E. A., Tinnerberg, H., Van Tongeren, M. and Xie, Q. (2024). [Supporting the working life exposome: annotating occupational exposure for enhanced literature search](https://doi.org/10.1371/journal.pone.0307844). PLoS ONE 19(8): e0307844 



