Published October 30, 2018 | Version v1
Conference paper Open

DistLODStats: Distributed Computation of RDF Dataset Statistics

Description

. Over the last years, the Semantic Web has been growing steadily. To-
count more than 10,000 datasets made available online following Se-
eb standards. Nevertheless, many applications, such as data integration,
nd interlinking, may not take the full advantage of the data without hav-
ori statistical information about its internal structure and coverage. In
e are already a number of tools, which offer such statistics, providing
ormation about RDF datasets and vocabularies. However, those usually
ere deficiencies in terms of performance once the dataset size grows
he capabilities of a single machine. In this paper, we introduce a soft-
mponent for statistical calculations of large RDF datasets, which scales
sters of machines. More specifically, we describe the first distributed in-
approach for computing 32 different statistical criteria for RDF datasets
ache Spark. The preliminary results show that our distributed approach
 upon a previous centralized approach we compare against and provides
ately linear horizontal scale-up. The criteria are extensible beyond the
t criteria, is integrated into the larger SANSA framework and employed
 four major usage scenarios beyond the SANSA community.
 

Files

iswc_distlodstats.pdf

Files (268.2 kB)

Name Size Download all
md5:c90d23a31b82ef4f75345c0315395ff5
268.2 kB Preview Download

Additional details

Funding

QROWD – QROWD - Because Big Data Integration is Humanly Possible 732194
European Commission
BigDataOcean – BigDataOcean - Exploiting Ocean's of Data for Maritime Applications 732310
European Commission
WDAqua – Answering Questions using Web Data 642795
European Commission
BigDataEurope – Integrating Big Data, Software and Communities for Addressing Europe’s Societal Challenges 644564
European Commission