Published February 22, 2022 | Version v1
Presentation Open

Automating data-cleaning and documentation of extracted data using interactive R-markdown notebooks

  • 1. University of Washington

Description

At the Institute for Health Metrics and Evaluation, we conduct ~40 systematic reviews each year. In our general process to search > screen > extract > analyze, we found we need an intervening step: cleaning extracted data before analysis. The problem arises from a feature of our workflow: one person extracts the data, while another analyzes. Clean-up falls through the gap as we hand off data. Analysts must then spend time cleaning, though the extractor is far more familiar with the dataset. To work faster with fewer errors, we developed a stepwise cleaning checklist, then wrote code modules to fix common problems. But juggling Excel and R and a checklist still takes time and attention. To streamline further, we are developing a systematic solution: an interactive R-markdown notebook to take in parameters of the specific extraction dataset; clean and validate the data; and return a new cleaned dataset. We are testing with a recent systematic review dataset of ~2800 observations from >150 sources. This semi-automated interactive code has other benefits besides valid, upload-ready analysis data. First, a flexible, parameterized template enables faster work, easily repeated. Also, the code can reproducibly make documentation of cleaning done, or extraction history, or other reports on data, parameters, and results. And critically, an interactive notebook makes sophisticated coding accessible to data extractors, who tend to have less coding experience than research analysts.

Files

Steph Zimsen.mp4

Files (1.3 GB)

Name Size Download all
md5:62a3068972517f4ae124536937bc3eb8
1.3 GB Preview Download

Additional details

Related works

Is derived from
Presentation: https://youtu.be/H64Bw6FvnMw (URL)