Language comparison via network topology

Škrlj, Blaž; Pollak, Senja

doi:10.1007/978-3-030-31372-2_10

Published October 18, 2019 | Version v1

Conference paper Open

Language comparison via network topology

1. Jožef Stefan Institute

Modeling relations between languages can offer understanding of language characteristics and uncover similarities and differences between languages. Automated methods applied to large textual corpora can be seen as opportunities for novel statistical studies of language development over time, as well as for improving cross-lingual natural language processing techniques. In this work, we first propose how to represent textual data as a directed, weighted network by the text2net algorithm. We next explore how various fast, network-topological metrics, such as network community structure, can be used for cross-lingual comparisons. In our experiments, we employ eight different network topology metrics, and empirically showcase on a parallel corpus, how the methods can be used for modeling the relations between nine selected languages. We demonstrate that the proposed method scales to large corpora consisting of hundreds of thousands of aligned sentences on an of-the-shelf laptop. We observe that on the one hand properties such as communities, capture some of the known differences between the languages, while others can be seen as novel opportunities for linguistic studies.

Files

Škrlj_language_comparison.pdf

Files (798.8 kB)

Name	Size	Download all
Škrlj_language_comparison.pdf md5:70930bd6711108fd9742cbef09569191	798.8 kB	Preview Download

Additional details

European Commission
EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153

	All versions	This version
Views	83	83
Downloads	86	86
Data volume	68.7 MB	68.7 MB

Language comparison via network topology

Authors/Creators

Description

Files

Škrlj_language_comparison.pdf

Files (798.8 kB)

Additional details

Funding