Machine Learning-Based Name Matching: A Logistic Regression Perspective
Description
In this study, we conducted experiments to investigate the use of logistic regression in de- veloping a name matching system. The primary objective was to create a system capable of identifying potential matches between names in a given dataset and a query. To achieve this, we employed established techniques like Levenshtein distance and fuzzywuzzy similarity to assess the similarity between names.
Initially, we preprocessed the dataset by calculating the Levenshtein distance and fuzzy- wuzzy percentages for each name in comparison to the query. These calculated features were then appended as additional columns to the dataset. Subsequently, we utilized a logistic regression model that had been previously trained using a labeled dataset.
To evaluate the performance of the model, we employed it to predict the likelihood of a name being a match for each entry in the dataset. These predictions were incorporated as a new column within the dataset. Finally, we sorted the dataset in descending order based on the prediction values to identify the most probable name matches.
The developed name matching system provides a scalable and efficient approach, enabling users to input a query and obtain a ranked list of potential name matches. To further assess the accuracy and efficacy of the system, it is possible to compare the predicted matches with known ground truth data.
The results obtained from our study demonstrate the effectiveness of the name matching system in identifying potential matches based on the computed features and the trained logistic regression model. The system holds significant value in various applications, including data integration, record linkage, and identity verification.
Files
Machine_Learning_Based_Name_Matching__A_Logistic_Regression_Perspective .pdf
Files
(395.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0da89df11d085489a3aac244951999b6
|
395.3 kB | Preview Download |