Bilingual Document Alignment with Latent Semantic Indexing

doi:10.5281/zenodo.834343

Published August 7, 2016 | Version v1

Conference paper Open

Bilingual Document Alignment with Latent Semantic Indexing

Germann, Ulrich¹

1. University of Edinburgh

We apply cross-lingual Latent Semantic Indexing to the Bilingual Document Alignment Task at WMT16. Reduced-rank singular value decomposition of a bilingual term-document matrix derived from known English/French page pairs in the training data allows us to map monolingual documents into a joint semantic space. Two variants of cosine similarity between the vectors that place each document into the joint semantic space are combined with a measure of string similarity between corresponding URLs to produce 1:1 alignments of English/French web pages in a variety of domains. The system achieves a recall of ca. 88% if no in-domain data is used for building the latent semantic model, and 93% if such data is included.

Analysing the system’s errors on the training data, we argue that evaluating aligner performance based on exact URL matches under-estimates their true performance and propose an alternative that is able to account for duplicates and near-duplicates in the underlying data.

Files

W16-2368.pdf

Files (436.6 kB)

Name	Size	Download all
W16-2368.pdf md5:32eaf219ddfabceff323a948ae5a0d83	436.6 kB	Preview Download

Additional details

SUMMA – Scalable Understanding of Multilingual Media 688139: European Commission
MMT – MMT will deliver a language independent commercial online translation service based on a new open-source machine translation distributed architecture 645487: European Commission

	All versions	This version
Views	95	94
Downloads	84	83
Data volume	41.9 MB	41.5 MB

Bilingual Document Alignment with Latent Semantic Indexing

Creators

Description

Files

W16-2368.pdf

Files (436.6 kB)

Additional details

Funding