Extraction of clinical phenotypes for Alzheimer disease dementia from clinical notes using natural language processing

Oh, Inez; Schindler, Suzanne; Ghoshal, Nupur; Lai, Albert; Payne, Philip; Gupta, Aditi

doi:10.5061/dryad.0vt4b8h3g

Published February 9, 2023 | Version v1

Dataset Open

Extraction of clinical phenotypes for Alzheimer disease dementia from clinical notes using natural language processing

1. Washington University in St. Louis School of Medicine

Objectives

There is much interest in utilizing clinical data for developing prediction models for Alzheimer disease (AD) risk, progression, and outcomes. Existing studies have mostly utilized curated research registries, image analysis, and structured Electronic Health Record (EHR) data. However, much critical information resides in relatively inaccessible unstructured clinical notes within the EHR.

Materials and Methods

We developed a natural language processing (NLP)-based pipeline to extract AD-related clinical phenotypes, documenting strategies for success and assessing the utility of mining unstructured clinical notes. We evaluated the pipeline against gold-standard manual annotations performed by two clinical dementia experts for AD-related clinical phenotypes including medical comorbidities, biomarkers, neurobehavioral test scores, behavioral indicators of cognitive decline, family history, and neuroimaging findings.

Results

Documentation rates for each phenotype varied in the structured versus unstructured EHR. Inter-annotator agreement was high (Cohen's kappa = 0.72–1) and positively correlated with the NLP-based phenotype extraction pipeline's performance (average F1-score = 0.65-0.99) for each phenotype.

Discussion

We developed an automated NLP-based pipeline to extract informative phenotypes that may improve the performance of eventual machine-learning predictive models for AD. In the process, we examined documentation practices for each phenotype relevant to the care of AD patients and identified factors for success.

Conclusion

Success of our NLP-based phenotype extraction pipeline depended on domain-specific knowledge and focus on a specific clinical domain instead of maximizing generalizability.

Notes

Data preprocessing steps were performed using the Python Pandas and striprtf (version 0.0.10) packages.

Linguamatics I2E query files (*.i2qy) and Enterprise Architect Simulation Library (EASL) code for each NLP module can be found on the Linguamatics Community webpage (https://community.linguamatics.com/queries), accessible with the creation of a free account. Linguamatics I2E software is required to open the query files (*.i2qy) directly, but the logic underlying the NLP modules can be understood by referencing the EASL code.

Funding provided by: Centene Corporation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100020153
Award Number: P19-00559

Files

README.md

Files (11.9 kB)

Name	Size	Download all
AD_cohort_summary.xlsx md5:34517a978c97b050d0a65ee38457da16	9.8 kB	Download
README.md md5:2d8b2e6a4a112720ea7efa5375b67760	2.1 kB	Preview Download

Additional details

Is derived from: 10.5281/zenodo.7616180 (DOI)

	All versions	This version
Views	47	47
Downloads	20	20
Data volume	196.6 kB	196.6 kB

Extraction of clinical phenotypes for Alzheimer disease dementia from clinical notes using natural language processing

Creators

Description

Notes

Files

README.md

Files (11.9 kB)

Additional details

Related works