Published March 14, 2024 | Version 1
Dataset Open

Wikidata Lemmatization Dataset

  • 1. FactGrid
  • 2. ROR icon University of California, Berkeley

Description

The Wikidata Lemmatization Dataset was collected using the following SPARQL query: https://w.wiki/9TwH

Languages included in the dataset:

  • Akkadian : AKK (Q35518)
  • Arabic : AR (Q13955)
  • Czech : CS (Q9056)
  • German : DE (Q188)
  • English : EN (Q1860)
  • French : FR (Q150)
  • Hebrew : HE (Q9288)
  • Hittite : HIT (Q35668)
  • Italian : IT (Q652)
  • Russian : RU (Q7737)
  • Sumerian : SUX (Q36790)
  • Turkish : TR (Q256)

The choice of languages to include have to do with a collection of primary and secondary source documents which we have digitized (OCR) and are using as references for the FactGrid Cuneiform project. The resulting lexemes for each language are shared in CSV with the file names references each language, their Wikidata Q-ids, the number of lexemes at that date, and the date of access (MM_YYYY).

The format of each CSV includes the following fields:

  1. lexeme : the Wikidata lexeme id (L-id)
  2. lexemeLabel : the label assigned to the lexeme in Wikidata
  3. lexical_category : the Wikidata Q-item for the part of speech
  4. lexical_categoryLabel : the label assigned to the lexical category (e.g. noun, verb, adjective, etc.)

This dataset will be updated periodically using standard version control.

Files

AKK_Q35518_6582_03_2024.csv

Files (59.9 MB)

Name Size Download all
md5:744bcf70b6032d5a9fe76c7b3cbbe437
597.1 kB Preview Download
md5:b3ae717b7e141d6f8a184fa0e94365f2
166.1 kB Preview Download
md5:b624d65a2b11f53a9a53c08e4bddd1a3
2.5 MB Preview Download
md5:243bb56363ad70b9c1ddfdc9bf6ae48f
20.3 MB Preview Download
md5:86d409fe0cfce388a85f8857f928a592
6.9 MB Preview Download
md5:ddd647f4253feef9fb3e65383061601e
5.1 MB Preview Download
md5:67cedfe9ccc09283e2264db0f8eb5815
1.8 MB Preview Download
md5:dfdf59e111eabf38401b630a87ce3371
4.9 MB Preview Download
md5:3dc5cfb4934126d72cf390906c077381
2.3 kB Preview Download
md5:7c08b8c293c6222b6eaea610cb433c84
5.8 MB Preview Download
md5:6c25a0b80d768aab9a898665b575b089
10.3 MB Preview Download
md5:ed0ab281f589bc156aa2212ebf846eb9
1.1 MB Preview Download
md5:6e82ae8e380b1baacc5597dbdac0c535
306.8 kB Preview Download

Additional details

Dates

Collected
2024-03-14
Short link to SPARQL query: https://w.wiki/9Twe