Zoomerjoin: Superlatively Fast Fuzzy-Joins
Description
Researchers often have to link large datasets without access to a unique identifying key, or on the basis of a field that contains misspellings, small errors, or is otherwise inconsistent. In these cases, "fuzzy" matching techniques are employed, which are resilient to minor corruptions in the fields meant to identify observations between datasets. Most popular methods involve comparing all possible pairs of matches between each dataset, incurring a computational cost proportional to the product of the rows in each dataset O(mn). As such, these methods do not scale to large datasets.
Zoomerjoin is an R package that empowers users to fuzzily-join massive datasets with millions of rows in seconds or minutes. Backed by two performant, mutlithreaded Locality-Sensitive Hash algorithms, zoomerjoin saves time by not comparing distant pairs of observations and typically runs in linear (O(m+n)) time. The algorithmic details are technical but the results are transformational; for the distance-metrics it supports, zoomerjoin takes seconds or minutes to join datasets that would have taken other matching packages hours or years.
Files
Files
(27.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b962af77a3686b83cd98cfd26acb4383
|
27.2 MB | Download |