Evaluating the Effect of Deduplication Thresholds and Feature Weighting on Match Quality
Authors/Creators
Description
Deduplication is a fundamental problem in data integration and record linkage, where the goal is to identify records that refer to the same real-world entity despite differences in formatting, wording, or completeness. This paper studies how threshold choice and auxiliary feature weighting affect match quality in tabular deduplication. Using the Abt-Buy benchmark dataset, we evaluate threshold-based matching with name similarity alone and compare it against weighted combinations of name, description, and price similarity. Our results show that threshold selection has a substantial impact on precision, recall, and F1, with poorly chosen thresholds leading to either excessive false matches or missed duplicates. We also find that additional features do not automatically improve performance. A naive multi-feature hybrid underperformed the name-only baseline, while a tuned name-plus-price model achieved the best results by improving recall with almost no loss of precision. A simple logistic regression comparison also underperformed the tuned threshold model, but its learned coefficients reinforced the same feature-level conclusion by assigning positive weight to name and price similarity and negative weight to description similarity. Together, these results show that deduplication quality depends not only on threshold choice, but also on whether added features provide reliable information beyond the primary matching field.
Files
deduplication_paper.pdf
Files
(270.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:691f03a177be91a14a1a0d404b6f5d18
|
270.3 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/KelvinLinBU/dedup-threshold-study
- Programming language
- Python
References
- I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. Journal of the American Statistical Association, 1969
- W. E. Winkler. Overview of Record Linkage and Current Research Directions. U.S. Census Bureau, 2006.
- H. K¨opcke, A. Thor, and E. Rahm. Evaluation of Entity Resolution Approaches on Real-World Match Problems. PVLDB, 2010.
- S. Mudgal et al. Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD, 2018.
- Y. Li et al. Deep Entity Matching with Pre-Trained Language Models. PVLDB, 2021.
- DeepMatcher Dataset Documentation. Abt-Buy Benchmark Dataset Description. Available at: https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md