Frictionless Data and Data Packages

Jo Barratt
Open Knowledge International
London, UK
jo.barratt@okfn.org

Rufus Pollock
Paris, France

Paul Walsh
Open Knowledge International
Tel Aviv, Israel
paul.walsh@okfn.org

Dan Fowler

Serah Rono
Open Knowledge International
Tallinn, Estonia
serah.rono@okfn.org

There is significant friction in the acquisition, sharing, and reuse of research data. It is estimated that eighty percent of data analysis is invested in the cleaning and mapping of data. This friction hampers researchers not well versed in data processing techniques from reusing an ever-increasing amount of research data available on the web and within scientific data repositories. Frictionless Data is an ongoing project at Open Knowledge International focused on removing this friction in a variety of circumstances. We are doing this by developing a set of tools, specifications, and best practices for describing, publishing, and validating data. The heart of this project is the “Data Package”, a containerization format for data based on existing practices for publishing open-source software.

Introduction

Data-driven research is fundamental to scientific inquiry. In recent years, we’ve seen significant growth of interest around the role of data in research, both in terms of producing new tools and methodologies to support data-driven research processes, and in producing new ways to share and reuse research outputs. The promise of reproducible research is dependent to a large extent on the ability to share and reuse data that supports research claims and, perhaps more implicitly, to share and reuse the workflows via which data was acquired, processed, and analysed.

The Problem

The problem is that, fundamentally, there is significant friction in working with data and with Frictionless Data, we focus specifically on reducing friction around discoverability, structure, standardization and tooling. More generally, the technicalities around the preparation, validation and sharing of data, in ways that both enhance existing workflows and enable new ones, towards the express goal of minimizing the gap between data and insight. We do this by creating specifications and software that are primarily informed by reuse (of existing formats and standards), conceptual minimalism, and platform-agnostic interoperability.

Value Proposition

The core value proposition of Frictionless Data is the “containerization” of data via the Data Package and Tabular Data Package specifications. These specifications provide a lightweight, JSON-serialized format for declaring metadata and schematic information about a given dataset, and serve as a glue for data and tooling interoperability.

A key aspect of this specification is that it aligns with researchers’ usual tools (e.g. Excel and CSV) and will require few or no changes to existing data and data structures.

Having been under development for several years the specification is now mature and reached v1.0 in 2017. A key distinguishing feature of our approach is a strong focus on users and, relatedly, a commitment to providing the most minimal, simplest solutions. This principle of simplicity is based on our long experience building tools and working with researchers and governments: it is essential that any new proposal disrupt as little as possible existing approaches and processes, build as much as possible on existing knowledge and work, and deliver as much immediate value as possible. The approach proposed here delivers on all of these requirements.

Approach

In summary, here are some key principles of our approach that we think, together, are distinctive:

Narrow Focus - we are not boiling the ocean, specifically, we are not trying to create a standardized schema language for everyone to use, or standardize data structures within or across disciplines (a much harder problem). Instead we are providing a simple, extensible specification that can is lightweight but adaptable. In addition, these changes can - and will be - easily be integrated into existing tools meaning that they are largely invisible to the end user.
Goldilocks: not too hot, not too cold. We provide simplicity but with the ability to extend and enhance. There is just enough structure to allow the process to deliver value, but not too much to get in the way of adoption and use. The balance presented here is the result of years of practical experience and refinement.
Tools not specs: whilst obviously a specification (“data package”) lies at the heart of this project, the actual focus is the tooling and integration - and adoption. This is somewhat different from other “standards / specifications” efforts where the bulk of the focus is the standard. Whilst we think our specification is good in its “zen-like” simplicity based on ten years of refinement, ultimately a specification stands or falls entirely on adoption, and adoption in turn is driven by the combination of ease of use and value generated. This in turn, in this area, is largely determined by how conveniently and powerfully the specification is integrated into the workflow and tools that researchers already have.
Progressive enhancement not replacement: The approach can be adopted on top of normal tooling with minimal disruption especially if we integrate with existing tools (which the Data Package model makes very easy). We therefore progressively enhance rather than disrupt the patterns of use of existing tool users - if you are using Excel you are still using Excel. Given that most researchers are very busy and are already invested in existing tools this is essential if you are going to build adoption.
Decoupling: the Data Package helps decouple - at least to an extent - data producers from data consumers. This is immensely valuable as it reduces a tool integration problem from polynomial one to a linear one: rather than having to interconnect M producer tools (e.g. a sampling device) to N consumer tools (e.g. Excel) directly (M*N connectors) we only need to connect the M producer tools to Data Package and the N consumer tools making this an M+N problem.
User-focus: The approach focuses on everyday users of research data - not archivists, librarians or tool-makers. Much other specification work focus on long-term preservation or specialist management of data. This approach attempts to do something simple and of direct benefit to those working every day in research. It seeks adoption not through mandates or committee but because it will make their life easier and deliver near-term, tangible benefits.

Conclusion

Having started work on this concept through the process of developing CKAN, OpenSpending, and other data-intensive civic technology projects, we believe a decentralized, open standard for publishing tabular research data building on existing formats like CSV and JSON is a substantial contribution. Our experiences so far point to an unmet need for exactly this kind of approach in the research data ecosystem. Over the last year, we have noticed a very positive reaction driven by the needs of tool-makers (e.g. keeping the standards as simple as possible to make them easy to implement) while learning as much as we can about the needs of working researchers.

It is sometimes said that data standards are like toothbrushes: a good idea but no one wants to use anyone else’s. Many existing standards and best practices already exist for sharing research data, however, the greatest problems occur, where standards compete directly and in the case of metadata standards this is rarer than might first be apparent.

Our work with various communities, researchers, developers, and others, it is clear that the Data Package specifications are useful, not just for researchers, but for anyone who works with tabular data.

Through this approach, we expect broad-based improvements in data quality as well as increased re-use of data. By providing an enabling environment for tools to create and consume well-packaged data, we can empower these researchers to do more with less by allowing for the integration of modular, automated data import and validation services into research data repositories. We suggest that data quality can thereby be made “visible” by enabling better quality control and providing standardized visualization options through tools like GoodTables.io and datahub.io

By providing a simple metadata format that also describes tabular data at a columnar level, we hope to enable data transport integrations across a diversity of platforms, from the cloud (Google’s BigQuery and Amazon’s AWS) all the way to the researcher’s desktop.

Introduction

The Problem

Value Proposition

Approach

Conclusion