Dataset Open Access
This data set contains all the 49 million non-singleton equivalence classes resulting from the transitive closure of over 556 million owl:sameAs statements extracted from the LOD Cloud in the 2015 LOD Laundromat crawl. These equivalence classes are the result of the transitive closure of the owl:sameAs links available in the sameAs.cc data set.
We represent these non-singleton equivalence classes using two CSV files:
1. id2terms.csv: contains in the first column the equivalence class identifier (randomly generated number) and in the rest of the columns all IRIs belonging to this equivalence class, which theoretically should refer to the same real world entity. In the following, we present an example of one row of this file, where "42467584" in the first column represents the ID of this equivalence class, and the 4 other columns represent the IRIs that are identical after transitive closure:
42467584 <http://nl.dbpedia.org/resource/Cnodocentron_trilineatum> <http://sv.dbpedia.org/resource/Cnodocentron_trilineatum> <http://vi.dbpedia.org/resource/Cnodocentron_trilineatum> <http://www.wikidata.org/entity/Q2304468>
2. terms2id.csv: contains two columns, representing a mapping between each IRI in the sameAs.cc data set involved in a owl:sameAs link with the equivalence class it belongs to. In the following, we present an example of one row in this file:
In addition to the closure of all owl:sameAs links (available in the folder closure_all.zip), this data set contains an additional two closures, with each closure also containing two CSV files with the same structure as presented above. These two additional closures are the following:
- closure_099.zip represents the closure of all owl:sameAs links in the sameAs.cc data set after discarding around 1 million probably erroneous owl:sameAs links (with error degree >0.99). This error degree is computed based on the community structure of the network, described in the approach of [Raad et al., 2018].
- closure_04.zip represents the closure of all owl:sameAs links in the sameAs.cc data set after discarding around 150 million owl:sameAs links (with error degree >0.4). The evaluation conducted in [Raad et al., 2018] shows that the 400M owl:sameAs links with an error degree <= 0.4 have higher probability of correctness compared to other links.
The availability of these 3 different closures allows Linked Data practitioners for the first time to control in practice, the trade-off between (a) using more identity links, possibly not all correct, and benefiting from more contextual information from the LOD Cloud, and (b) using a smaller subset of higher quality identity links for limiting the risk of propagating erroneous identity links and information through the application of owl:sameAs semantics, i.e. transitive, symmetric, reflexive and property sharing.