# README.txt The datasets were produced in my thesis project. The thesis (in Czech language) explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric. Thesis citation: DOBIÁŠOVSKÝ, Jan. Approximate equality of character strings and its application to record linkage in metadata of scientific publications [online]. Praha, 2020 [cit. 2020-05-04]. Masters thesis. Charles University. Faculty of Arts. Institute of Information Studies and Librarianship. Source code used for creation of this data is accessible at: https://github.com/jdobiasovsky/metric-test If you have any further questions, or need help with getting the dataset to work, let me know: honza.dobiasovsky@gmail.com #################################################################################################################################### There are three datasets. Each file is named using following pattern: [metric_used][year/years_of_publication]. Metrics used are: lv - Levenshtein distance jw - Jaro-Winkler distance (p=0.1) jaro - Jaro distance jaccard3 - Jaccard coefficient for q-grams of size 3 jaccard4 - Jaccard coefficient for q-grams of size 4 cosine3 - cosine distance for for q-grams of size 3 cosine4 - cosine distance for q-grams of size 4 Results were produced for different subsets based on year of publication and are split accordingly in the data. If year range is provided, it signifies all publications published within these years of this range were used. Files are in the .csv format (first row contains column names) and UTF-8 encoded. 1) raw_string_distances.7z [7zip archive of 483 .csv files; total size uncompressed 13.6GB] Contains raw calculated string distances for each pair of compared publication titles. Blocking by publication year is used with tolerance of +-1. The files contain only pairs where DOI was present on both sides. The stringdist R library (https://cran.r-project.org/web/packages/stringdist/stringdist.pdf) was used to calculate the distances. Dataset contains following columns: X1 - internal identifier of pair, used for control during testing phase YEAR,ID,DOI - information about pairs which were compared, year of publication, internal unique identifier of the publication and DOI. Information for both documents in pair is provided 1, 2 suffix. YEAR1 is used in the file name. TITLE - value of string distance of the publication titles, values in <0,1> are used, where 0 signifies identical strings. The metrics were calculated as follows: lv - stringdist(a , b, method = "lv")/max_string_length # raw Levenshtein distance was normed to <0,1> range jaro - stringdist(a , b, method = "jw", p = 0) jw -stringdist(a , b, method = "jw", p = 0.1) jaccard3 - stringdist(a , b, method = "jaccard", q = 3) jaccard4 - stringdist(a , b, method = "jaccard", q = 4) cosine3 - stringdist(a , b, method = "cosine", q = 3) cosine4 - stringdist(a , b, method = "cosine", q = 4) Some files might contain column names only if no pairs were found. 2) pair_comparison_outcomes.zip [ZIP archive of 28 CSV files; total uncompressed size 15.5 MB] Summarization of the cases from raw_string_distances.7z. The files contain 6001 rows with threshold values ranging from 0 to 0.6 with step of 0.0001. Based on whether the DOIs matched and on the threshold, it is decided what pairs would be classified as true positive, false positive or false negative as follows: If the DOIs match and the string distance is lower than or equal to the threshold, the pair would be classified as true positive. If the DOIs do not match, but the string distance was lower than or equal to the threshold, the pair is classified as false positive. If the DOIs did match, buth the string distance was greater than the threshold, the pair is classified as false negative. Each row represents what would be the result of record matching of entire input file, were this threshold applied. Each results file in this section has same name as file in the raw_string_distances which were used as input. Files were processed in three different year ranges: 1950-2018, 2009-2018, 2016-2018. If the filename contains “_fbeta” additional F_2 and F_3 measures were calculated (columns named F2measure and F3measure). Dataset contains following columns: Threshold - threshold which was applied Precision - calculated precision on given threshold Recall - calculated recall on given threshold Fmeasure - total F-measure on given treshold TP - number of true positives: pairs where the distance was within the treshold and their DOIs matched FP - number of false positives: pairs where the distance was within the treshold, but their DOIs did not match FN - number of false negatives: pairs where the distance was above the treshold, but their DOIs matched 3) manual_validation.zip [ZIP archive of 7 CSV files; uncompressed size 325 KB] Optimal thresholds with highest Fmeasure were determined for each metric. A random sample of 100 document pairs without DOI present on one or both sides of the comparison and the distance would be within this optimal threshold has been selected and manually validated to determine the most common errors. Dataset contains following columns: X1 - internal identifier of pair, used for control during testing phase ID,YEAR,DOI - publication pair information. Information for both documents in pair is provided 1, 2 suffix TITLE_DISTANCE - value of string distance produced CONTROL - decision whether the match is true positive (TP) or false positive (FP) TITLE,AUTHORS,DOC_TYPE_CODE,CONFERENCE_NAME - additional metadata which were used to determine outcome. Information for both documents in pair is provided 1, 2 suffix Due to copyright restrictions, it is not possible to publish the input dataset for this research.