DistLODStats: Distributed Computation of RDF Dataset Statistics
Description
. Over the last years, the Semantic Web has been growing steadily. To-
count more than 10,000 datasets made available online following Se-
eb standards. Nevertheless, many applications, such as data integration,
nd interlinking, may not take the full advantage of the data without hav-
ori statistical information about its internal structure and coverage. In
e are already a number of tools, which offer such statistics, providing
ormation about RDF datasets and vocabularies. However, those usually
ere deficiencies in terms of performance once the dataset size grows
he capabilities of a single machine. In this paper, we introduce a soft-
mponent for statistical calculations of large RDF datasets, which scales
sters of machines. More specifically, we describe the first distributed in-
approach for computing 32 different statistical criteria for RDF datasets
ache Spark. The preliminary results show that our distributed approach
upon a previous centralized approach we compare against and provides
ately linear horizontal scale-up. The criteria are extensible beyond the
t criteria, is integrated into the larger SANSA framework and employed
four major usage scenarios beyond the SANSA community.
Files
iswc_distlodstats.pdf
Files
(268.2 kB)
Name | Size | Download all |
---|---|---|
md5:c90d23a31b82ef4f75345c0315395ff5
|
268.2 kB | Preview Download |
Additional details
Related works
- Is documented by
- https://link.springer.com/chapter/10.1007/978-3-030-00668-6_13 (URL)
Funding
- QROWD – QROWD - Because Big Data Integration is Humanly Possible 732194
- European Commission
- BigDataOcean – BigDataOcean - Exploiting Ocean's of Data for Maritime Applications 732310
- European Commission
- WDAqua – Answering Questions using Web Data 642795
- European Commission
- BigDataEurope – Integrating Big Data, Software and Communities for Addressing Europe’s Societal Challenges 644564
- European Commission