Evaluating the Effect of Deduplication Thresholds and Feature Weighting on Match Quality

Lin, Kelvin

doi:10.5281/zenodo.19795601

Published April 26, 2026 | Version 1

Publication Open

Evaluating the Effect of Deduplication Thresholds and Feature Weighting on Match Quality

Lin, Kelvin (Researcher)

Deduplication is a fundamental problem in data integration and record linkage, where the goal is to identify records that refer to the same real-world entity despite differences in formatting, wording, or completeness. This paper studies how threshold choice and auxiliary feature weighting affect match quality in tabular deduplication. Using the Abt-Buy benchmark dataset, we evaluate threshold-based matching with name similarity alone and compare it against weighted combinations of name, description, and price similarity. Our results show that threshold selection has a substantial impact on precision, recall, and F1, with poorly chosen thresholds leading to either excessive false matches or missed duplicates. We also find that additional features do not automatically improve performance. A naive multi-feature hybrid underperformed the name-only baseline, while a tuned name-plus-price model achieved the best results by improving recall with almost no loss of precision. A simple logistic regression comparison also underperformed the tuned threshold model, but its learned coefficients reinforced the same feature-level conclusion by assigning positive weight to name and price similarity and negative weight to description similarity. Together, these results show that deduplication quality depends not only on threshold choice, but also on whether added features provide reliable information beyond the primary matching field.

Files

deduplication_paper.pdf

Files (270.3 kB)

Name	Size	Download all
deduplication_paper.pdf md5:691f03a177be91a14a1a0d404b6f5d18	270.3 kB	Preview Download

Additional details

Repository URL: https://github.com/KelvinLinBU/dedup-threshold-study
Programming language: Python

I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. Journal of the American Statistical Association, 1969
W. E. Winkler. Overview of Record Linkage and Current Research Directions. U.S. Census Bureau, 2006.
H. K¨opcke, A. Thor, and E. Rahm. Evaluation of Entity Resolution Approaches on Real-World Match Problems. PVLDB, 2010.
S. Mudgal et al. Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD, 2018.
Y. Li et al. Deep Entity Matching with Pre-Trained Language Models. PVLDB, 2021.
DeepMatcher Dataset Documentation. Abt-Buy Benchmark Dataset Description. Available at: https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

	All versions	This version
Views	68	68
Downloads	42	42
Data volume	15.9 MB	15.9 MB

deduplication_paper.pdf

Files (270.3 kB)

Software

References

Evaluating the Effect of Deduplication Thresholds and Feature Weighting on Match Quality

Authors/Creators

Description

Files

deduplication_paper.pdf

Files (270.3 kB)

Additional details

Software

References