Frictionless Data and Data Packages

Jo Barratt
Open Knowledge International
London, UK
jo.barratt@okfn.org

Rufus Pollock
Paris, France


Paul Walsh
Open Knowledge International
Tel Aviv, Israel
paul.walsh@okfn.org

Dan Fowler


Serah Rono
Open Knowledge International
Tallinn, Estonia
serah.rono@okfn.org


There is significant friction in the acquisition, sharing, and reuse of research data. It is estimated that eighty percent of data analysis is invested in the cleaning and mapping of data. This friction hampers researchers not well versed in data processing techniques from reusing an ever-increasing amount of research data available on the web and within scientific data repositories. Frictionless Data is an ongoing project at Open Knowledge International focused on removing this friction in a variety of circumstances. We are doing this by developing a set of tools, specifications, and best practices for describing, publishing, and validating data. The heart of this project is the “Data Package”, a containerization format for data based on existing practices for publishing open-source software.

  1. Introduction

Data-driven research is fundamental to scientific inquiry. In recent years, we’ve seen significant growth of interest around the role of data in research, both in terms of producing new tools and methodologies to support data-driven research processes, and in producing new ways to share and reuse research outputs. The promise of reproducible research is dependent to a large extent on the ability to share and reuse data that supports research claims and, perhaps more implicitly, to share and reuse the workflows via which data was acquired, processed, and analysed.

  1. The Problem

The problem is that, fundamentally, there is significant friction in working with data and with Frictionless Data, we focus specifically on reducing friction around discoverability, structure, standardization and tooling. More generally, the technicalities around the preparation, validation and sharing of data, in ways that both enhance existing workflows and enable new ones, towards the express goal of minimizing the gap between data and insight. We do this by creating specifications and software that are primarily informed by reuse (of existing formats and standards), conceptual minimalism, and platform-agnostic interoperability.

  1. Value Proposition

The core value proposition of Frictionless Data is the “containerization” of data via the Data Package and Tabular Data Package specifications. These specifications provide a lightweight, JSON-serialized format for declaring metadata and schematic information about a given dataset, and serve as a glue for data and tooling interoperability.

A key aspect of this specification is that it aligns with researchers’ usual tools (e.g. Excel and CSV) and will require few or no changes to existing data and data structures.

Having been under development for several years the specification is now mature and reached v1.0 in 2017. A key distinguishing feature of our approach is a strong focus on users and, relatedly, a commitment to providing the most minimal, simplest solutions. This principle of simplicity is based on our long experience building tools and working with researchers and governments: it is essential that any new proposal disrupt as little as possible existing approaches and processes, build as much as possible on existing knowledge and work, and deliver as much immediate value as possible. The approach proposed here delivers on all of these requirements.

  1. Approach

In summary, here are some key principles of our approach that we think, together, are distinctive:

  1. Conclusion

Having started work on this concept through the process of developing CKAN, OpenSpending, and other data-intensive civic technology projects, we believe a decentralized, open standard for publishing tabular research data building on existing formats like CSV and JSON is a substantial contribution. Our experiences so far point to an unmet need for exactly this kind of approach in the research data ecosystem. Over the last year, we have noticed a very positive reaction driven by the needs of tool-makers (e.g. keeping the standards as simple as possible to make them easy to implement) while learning as much as we can about the needs of working researchers.

It is sometimes said that data standards are like toothbrushes: a good idea but no one wants to use anyone else’s. Many existing standards and best practices already exist for sharing research data, however, the greatest problems occur, where standards compete directly and in the case of metadata standards this is rarer than might first be apparent.

Our work with various communities, researchers, developers, and others, it is clear that the Data Package specifications are useful, not just for researchers, but for anyone who works with tabular data.

Through this approach, we expect broad-based improvements in data quality as well as increased re-use of data.  By providing an enabling environment for tools to create and consume well-packaged data, we can empower these researchers to do more with less by allowing for the integration of modular, automated data import and validation services into research data repositories. We suggest that data quality can thereby be made “visible” by enabling better quality control and providing standardized visualization options through tools like GoodTables.io and datahub.io

By providing a simple metadata format that also describes tabular data at a columnar level, we hope to enable data transport integrations across a diversity of platforms, from the cloud (Google’s BigQuery and Amazon’s AWS) all the way to the researcher’s desktop.