Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published August 15, 2017 | Version v1
Journal article Open

Towards a comprehensive workflow for biodiversity data in R

  • 1. Department of Civil and Environmental Engineering, The Technion – Israel Institute of Technology, Haifa, Israel
  • 2. Florida Museum of Natural History, Gainesville, United States of America
  • 3. Indian Institute Of Technology(IIT) -BHU, Varanasi, India
  • 4. Informatics Institute of Technology, Colombo, Sri Lanka

Description

Increasing number of scientists are using R for their data analyses, however, proficiency required to manage biodiversity data in R is considerably rarer. Since, users need to retrieve, manage and assess high-volume data with inherent complex structure (Darwin Core standard, DwC), various R packages dealing with biodiversity data and specifically data cleaning have been published. Though numerous new procedures are now available, implementing them require users to provide a great deal of efforts in exploring and learning each R package. For the common users, this task can be daunting. In order to truly facilitate data cleaning using R, there is an urgent need for a package that will fully integrate functionality of existing packages, enhance their functionality, and simplify its implementation. Furthermore, it is also necessary to identify and develop missing crucial functionalities.

We are attempting to address these issues by developing two projects under Google Summer of Code (GSoC)-- an international annual program that matches up students with open source organizations to develop code during their summer break. The first project is dealing with the integration challenge by developing a taxonomic cleaning workflow; standardizing various spatial and temporal data quality checks; and enhancing different data retrieval and data management techniques. The second project aims at advancing new and exciting features, such as establishing a flagging system (HashMap-like) in R, an innovative set of DwC summary tables, and developing new techniques for outliers analysis.

The products of these projects lay down crucial infrastructure for data quality assessment in R. Obviously this is a work in progress and needs further inputs. By developing a comprehensive framework for handling biodiversity data, we can fully harness the synergetic quality of R, and hopefully supply more holistic and agile solutions for the user.

Files

BISS_article_20311.pdf

Files (55.5 kB)

Name Size Download all
md5:895df1fb53f68e86944590e80deab3a5
47.2 kB Preview Download
md5:a1cb689a6492f908c9b694693eee6fb9
8.3 kB Preview Download

Linked records