Preview

==============================
Structural audit of aggregated beetle data
Robert Mesibov, 31 August 2016; robert.mesibov@gmail.com
==============================

A great deal of preliminary data cleaning was required for a 2014 study of mine on Australian insect publications (http://dx.doi.org/10.3897/BDJ.2.e1160; https://zenodo.org/record/10481). I spent much of my cleaning time on taxonomic and bibliographic details, but a surprisingly large amount of effort was devoted to "structural" problems in the data. These were problems that didn't require specialist entomological knowledge to identify and correct. They included character encoding errors, formatting errors and duplicated records.

Curious to see how often such errors appeared in aggregated biodiversity data, I selected three publicly available beetle datasets and audited them. I chose beetles because I am not a beetle specialist (I study millipedes), so I would not be distracted by nomenclatural or taxonomic errors!

There is some overlap between errors in data table structure and errors in data content. For simplicity's sake I only looked for six broad categories of structural problems: broken records, misuse of fields, duplicate records, disagreements between fields, truncated records and character encoding errors. The audit was done on the command line using a BASH shell and GNU text-processing tools, as detailed in the accompanying logs.

The raw versions of two of the three beetle datasets accompany this report in its Zenodo directory, together with command-by-command logs of the audits so that my results can be replicated. The Catalogue of Life data is archived from my file "col0", as decribed in the CoL log. Here I summarise those results:

- I checked 467979 beetle occurrence records from the Atlas of Living Australia (ALA), 71842 beetle name and reference records from the Australian Faunal Directory (AFD) and 272610 beetle name records from the Catalogue of Life (CoL).

- The most common problems I found were misuse of fields, character encoding issues and duplicated records. ALA had broken records and some disagreements between fields, while CoL and possibly ALA had a few apparently truncated records. A "score card" might be:

					ALA	AFD	CoL
broken records			yes	no	no
misuse of fields			yes	yes	yes
duplicate records			yes	yes	yes
field disagreements			yes	no	no
truncated records			yes?	no	yes
encoding errors			yes	yes	yes

- Broken records are single records spread across two or more lines in a data table. Data from a broken record will appear at least twice in a field summary or analysis, with only one of the data items actually belonging in that field. Broken records are often (perhaps nearly always) caused by unrecognised line feeds or carriage returns within data items. They need to be fixed by "stitching together" the pieces before any further work on the data table is done.

- Fields can be misused in two major ways. The correct data item might appear in the wrong field (field shifting), and a field might contain an inappropriate data item (failure of data validation, domain schizophrenia).

- Because each record in a data table usually contains a unique ID string, there are unlikely to be exact, character-for-character duplicated records. For this reason a search for duplicates can be flexible, i.e. the definition of "duplicate" can be based on important likely uses for the record. The duplications I searched for were tailored to the individual datasets and (I think!) reasonably chosen (qv).

- Within a single record, the data items in two different fields might disagree. A fictitious example is a record which contains the name Aus bus Jones, 1944 in the "scientific name" field and the year 1934 in the authorship "year" field.

- Truncated records are particularly hard to spot because the truncation might be a data entry error. In the data item "Aus bus Jones, 194", did the data enterer simply leave off the last digit of the year, or did software (at some stage) truncate the string at 17 characters?

- Character encoding issues are probably responsible for most of the cleaning work to be done, but their origin is usually untraceable. I suspect that many compilers use Microsoft Excel to prepare data tables, and Excel (even Excel 2016) still defaults to windows-1252, replacing many UTF-8 characters with "?".

More comments on errors are in the individual dataset logs.