Published September 29, 2023 | Version v0.1.0
Software Open

Zoomerjoin: Superlatively Fast Fuzzy-Joins

Authors/Creators

  • 1. Yale University

Description

Researchers often have to link large datasets without access to a unique identifying key, or on the basis of a field that contains misspellings, small errors, or is otherwise inconsistent. In these cases, "fuzzy" matching techniques are employed, which are resilient to minor corruptions in the fields meant to identify observations between datasets. Most popular methods involve comparing all possible pairs of matches between each dataset, incurring a computational cost proportional to the product of the rows in each dataset O(mn). As such, these methods do not scale to large datasets.

Zoomerjoin is an R package that empowers users to fuzzily-join massive datasets with millions of rows in seconds or minutes. Backed by two performant, mutlithreaded Locality-Sensitive Hash algorithms, zoomerjoin saves time by not comparing distant pairs of observations and typically runs in linear (O(m+n)) time. The algorithmic details are technical but the results are transformational; for the distance-metrics it supports, zoomerjoin takes seconds or minutes to join datasets that would have taken other matching packages hours or years.

Files

Files (27.2 MB)

Name Size Download all
md5:b962af77a3686b83cd98cfd26acb4383
27.2 MB Download