Published May 13, 2019 | Version v1
Presentation Open

Semi-Supervised Machine Learning with Word Embeddings for Price Statistics

  • 1. Office for National Statistics
  • 2. Edward
  • 3. Tanya

Description

The public sector has increasing amounts of big data sources available to it. There is a focus on exploiting the potential of these data sources to transform services and impact policy making. The UK’s Office for National Statistics (ONS) has been challenged to use alternative data sources (e.g. Open web scraped or administrative data) in Sir Charles Bean’s Independent review of UK economic statistics (2016). Machine Learning (ML) provides a vast array of methods that can gain exciting insights from these vast new data sources. Here, we discuss our approach to tackling two common issues with such data. The first issue is that often, such data contains unstructured text that is difficult to extract meaningful information from.  Second: many ML methods are supervised; they need to be trained on a large and representative labelled dataset to work effectively. Such labelled datasets can be challenging and time consuming to create. To solve these problems, we have developed a sophisticated, multi-step process to label data. First, we use text vectorisation methods, such as Word2Vec and fastText, to create numerical representations of the text. We then use these representations in the label propagation algorithm to spread a small number of labels across a much larger dataset. Using the resulting labelled dataset we can train a ML classifier. We will present results showing this method applied to classifying web scraped clothing data according to COICOP definitions for the creation of price statistics.

Files

Files (956.1 kB)

Name Size Download all
md5:275d6f69141bbc8fbab24ab87cf1bbaa
956.1 kB Download