Published May 10, 2022 | Version 1.0.0
Dataset Open

Web Data Commons (November 2017) Property and Datatype Usage Dataset

Authors/Creators

Description

This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2017) based on the Common Crawl November 2017 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

Dataset Properties

  • Size: 16.2 MiB compressed, 327.9 MiB uncompressed, 1 509 385 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l
  • Parsing Failures: The scanner failed to parse 6 024 240
      triples (~0.11 %) of the source dataset (containing 5 252 606 731 triples).
  • Content:
    • CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
    • FILE_URL: The URL of the Web Data Commons file that has been measured.
    • MEASUREMENT: The applied measurement with specific conditions, one of:
      • UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.
      • UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.
      • UsedAsDatatype: The total number of literals with the datatype.
      • UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
      • ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.
      • ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.
      • ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.
      • ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
      • ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
      • ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.
      • ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.
      • ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.
      • ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.
      Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.
    • PROPERTY: The property that has been measured.
    • DATATYPE: The datatype that has been measured.
    • QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

Preview

"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://search.yahoo.com/searchmonkey/product/identifier","http://search.yahoo.com/searchmonkey-datatype/use/sku","5"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://purl.org/dc/elements/1.1/title","http://www.w3.org/2001/XMLSchema#string","1"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://opengraphprotocol.org/schema/longitude","http://www.w3.org/2001/XMLSchema#string","2290"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","http://dbpedia.org/ontology/Work/runtime","http://www.w3.org/2001/XMLSchema#string","2"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://schema.org/number","http://www.w3.org/2001/XMLSchema#string","1"
[…]
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-embedded-jsonld.nq-00609.gz","ValidZeroOrOneNotation","http://schema.org/worstRating","http://www.w3.org/2001/XMLSchema#integer","38"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-embedded-jsonld.nq-00609.gz","ValidZeroOrOneNotation","http://schema.org/price","http://www.w3.org/2001/XMLSchema#integer","1"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-embedded-jsonld.nq-00609.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","3"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-embedded-jsonld.nq-00609.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","1"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2017-12/quads/dpef.html-embedded-jsonld.nq-00609.gz","ValidZeroOrOneNotation","http://schema.org/pageEnd","http://www.w3.org/2001/XMLSchema#integer","4"

Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

Reproduce

To reproduce this dataset checkout the RDF Property and Datatype Usage Scanner v2.1.1 and execute:

mvn clean package
java -jar target/Scanner.jar --category html-rdfa --list http://webdatacommons.org/structureddata/2017-12/files/rdfa.list November2017
java -jar target/Scanner.jar --category  html-embedded-jsonld --list http://webdatacommons.org/structureddata/2017-12/files/html-embedded-jsonld.list November2017
./measure.sh November2017
# Wait until the scan has completed. This will take a few days
java -jar target/Scanner.jar --results ./November2017/measurements.csv.gz November2017

Files

Files (17.0 MB)

Name Size Download all
md5:09f39cbb4806019f491e6012ce5b6fc9
17.0 MB Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.6338129 (DOI)
Is continued by
Dataset: 10.5281/zenodo.6477443 (DOI)
Is derived from
Dataset: http://webdatacommons.org/structureddata/2017-12/ (URL)