Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval

Galke, Lukas Paul Achatius; Saleh, Ahmed; Scherp, Ansgar

doi:10.5281/zenodo.1143963

Published September 29, 2017 | Version v1

Conference paper Open

Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval

1. ZBW -- Leibniz Information Centre for Economics

We assess the suitability of word embeddings for practical information retrieval scenarios.
Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved
documents after applying a boolean matching operation between the query and the documents. We
compare the performance of several techniques that leverage word embeddings in the retrieval models
to compute the similarity between the query and the documents, namely word centroid similarity,
paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF)
re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean
average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally,
we inspect the retrieval models’ sensitivity to document length by using either only the title or the
full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best
competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word
frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed
cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even
outperforms it in case of the news domain with a relative percentage of 15%.

Files

INF17_as_submitted.pdf

Files (303.5 kB)

Name	Size	Download all
INF17_as_submitted.pdf md5:33b0994ee3c890bd642465b0dfcaa4dd	303.5 kB	Preview Download

Additional details

MOVING – Training towards a society of data-savvy information professionals to enable open leadership innovation 693092: European Commission

	All versions	This version
Views	224	223
Downloads	337	337
Data volume	105.0 MB	105.0 MB

Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval

Creators

Description

Files

INF17_as_submitted.pdf

Files (303.5 kB)

Additional details

Funding