Web Data Commons (December 2020) Property and Datatype Usage Dataset
Creators
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (December 2020) based on the Common Crawl September 2020 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v1.0.0, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.
Dataset Properties
- Size: 0.6 GiB compressed, 11.0 GiB uncompressed, 53 641 457 rows plus 1 head line determined using
gunzip -c measurements.csv.gz | wc -l
- Parsing Failures: The scanner failed to parse 45 307 472 triples (~0.1 %) of the source dataset (containing 37 971 812 425 triples). The main reasons for failures were malformed IRIs and illegal character encodings.
- Content:
- CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
- FILE_URL: The URL of the Web Data Commons file that has been measured.
- MEASUREMENT: The applied measurement with specific conditions, one of:
- UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of
xsd:double
. - UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of
xsd:float
. - UsedAsDatatype: The total number of literals with the datatype.
- UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
- ValidDateNotation: The number of lexicals that are in the lexical space of
xsd:date
. - ValidDateTimeNotation: The number of lexicals that are in the lexical space of
xsd:dateTime
. - ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of
xsd:decimal
,xsd:float
, andxsd:double
. - ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of
xsd:float
, andxsd:double
. - ValidInfOrNaNNotation: The number of lexicals that equals either
INF
,+INF
,-INF
orNaN
and whose lexical representation is thereby in the lexical space ofxsd:float
, andxsd:double
. - ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of
xsd:integer
,xsd:decimal
,xsd:float
, andxsd:double
. - ValidTimeNotation: The number of lexicals that are in the lexical space of
xsd:time
. - ValidTrueOrFalseNotation: The number of lexicals that equal either
true
orfalse
and whose lexical representation is thereby in the lexical space ofxsd:boolean
. - ValidZeroOrOneNotation: The number of lexicals that equal either
0
or1
and whose lexical representation is thereby in the lexical space ofxsd:boolean
, andxsd:integer
,xsd:decimal
,xsd:float
, andxsd:double
.
xsd:double
values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures. - UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of
- PROPERTY: The property that has been measured.
- DATATYPE: The datatype that has been measured.
- QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.
Preview
"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/height","http://www.w3.org/2001/XMLSchema#string","11"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://www.w3.org/ns/csvw#value","http://www.w3.org/2001/XMLSchema#string","27"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/saturatedFatContent","http://www.w3.org/2001/XMLSchema#string","1"
…
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMaxValue","http://www.w3.org/2001/XMLSchema#float","2"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMinValue","http://www.w3.org/2001/XMLSchema#float","4"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://rdfs.org/sioc/ns#num_replies","http://www.w3.org/2001/XMLSchema#integer","1062"
Note: The data contain malformed IRIs, like "xsd:dateTime
" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime
"), which are caused by missing namespace definitions in the original source website.
Files
Files
(599.7 MB)
Name | Size | Download all |
---|---|---|
md5:234d3b4b77a68e4b5074e71bf394affa
|
599.7 MB | Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.6359894 (DOI)
- Is compiled by
- Software: 10.5281/zenodo.6258887 (DOI)
- Is continued by
- Dataset: 10.5281/zenodo.6337660 (DOI)
- Is derived from
- Dataset: http://webdatacommons.org/structureddata/2020-12/ (URL)
- Is required by
- Software: 10.5281/zenodo.6264286 (DOI)