Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published May 4, 2020 | Version 1
Dataset Open

Datasets from Approximate equality of character strings and its application to record linkage in metadata of scientific publications thesis

Contributors

Supervisor:

Description

The datasets were produced in my thesis project. The thesis (in Czech language) explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric.

Thesis citation:
DOBIÁŠOVSKÝ, Jan. Approximate equality of character strings and its application to record linkage in metadata of scientific publications [online]. Praha, 2020 [cit. 2020-05-04]. Masters thesis. Charles University. Faculty of Arts. Institute of Information Studies and Librarianship.

 

Files

manual_validation.zip

Files (1.1 GB)

Name Size Download all
md5:fca3dcd9ca875daa21ae6c820e40ddd7
100.6 kB Preview Download
md5:2696e3b8fa79a463ca666dfd79cbf8ff
3.3 MB Preview Download
md5:3563ee302555ee5309b10edef7b7a9c8
1.1 GB Download
md5:70fe993a16a8691e2df3ea94a08b9cd4
6.3 kB Preview Download