Wikary: A Dataset of N-ary Wikipedia Tables Matched to Qualified Wikidata Statements
Authors/Creators
Description
Wikary: A Dataset of N-ary Wikipedia Tables Matched to Qualified Wikidata Statements
Created for The SemTab 2022 Datasets Track challenge.
General
Explanation of columns names used in both files
`lang` - language and Wikipedia version used
`pageTitle` - page title
`tableIndex` - index where the given table is located on the page
Tables file
Columns names are used only in the tables file
`pageEntity` - Wikidata entity associated with the page
`sectionTitle` - the title of the section where the table is located
`tableCaption` - caption of the table
`headers` - headers of the table
`HTML` - HTML of the table
Matches file
Columns names are used only in the matches file
`rowIndex` - index of a row where the match is found for a given table
`wikidata_ids` - Wikidata entities in the row including `pageEntity`
`entities_index` - indexes in which cell Wikidata entities were found, -1 used for Wikidata entity associated with the page, -9 used for cells that include a date in a cell
`entities_anchor` - anchor text of cells where Wikidata entities were found
`entities_cell_text` - cell text of cells where Wikidata entities were found
`subject` - subject of Wikidata statement
`property` - property of Wikidata statement
`object` - object of Wikidata statement
`property_qualifier` - property qualifier of Wikidata statement
`qualifier_value` - qualifier value of Wikidata statement
`id_match` - 1 means that row contains *Wikidata identifier match*, 0 means no match
`year_match` - 1 means that row contains *Year cell match*, 0 means no match
`year_part_match` - 1 means that row contains *Within cell year match*, 0 means no match