Published May 28, 2021 | Version 1.0.0
Dataset Open

Hachidaishu part of speech dataset

  • 1. Tokyo Institute of Technology
  • 2. Osaka University

Description

Hachidaishu part-of-speech dataset

 

This dataset contains the part-of-speech information of the Imperial Anthology of Japanese Poetry and the Hachidaishu. 

 

Data offset

Example:  #1 Kokinshu

10001 年/名/とし の/格助/の 内/名/うち に/格助/に 春/名/はる は/係助/は き/カ変-用:来:く/き に/完-用:ぬ:ぬ/に けり/過-終:けり:けり/けり 一とせ/名/ひととせ を/*助/を こそ/名/こぞ と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む ことし/名/ことし と/格助/と や/係助/や いは/ハ四-未:言ふ:いふ/いは ん/推-終体:む:む/む

A line a poem: tokens are separated by spaces; and a token consists of pos elements separated by slashes.

  •  1st column "10001" contains two elements: the first digit is an anthology ID and the rest is a poem ID; the anthology ID: 1..Kokinshu, 2..Gosenshu, 3..Shuishu, 4..Goshuishu, 5..Kin'yoshu, 6..Shikashu, 7..Senzaishu, and 8..Shinkokinshu.
  •  The poem ID is the same as in the database "Nijuichidaishu."
  •  2nd column and the followings are the information of each token.
  •  In case of noun and particle, such as tokens not having conjugations: text/POS/reading.
  •  In case of verb, adjectives, such as tokens having conjugations: text/POS:lemma-kanji:lemma-reading/reading.

 

Files

hachidaishu-pos.txt

Files (4.2 MB)

Name Size Download all
md5:585dc98f23348b331f68f58bf63440b2
4.2 MB Preview Download

Additional details

References

  • Hilofumi Yamamoto. POS tagger for Classical Japanese Poems, The Study of Japanese Linguistics, The Society of Japanese Linguistics, Vol. 3, No. 3, pp. 33-39, July 2007.
  • Hilofumi Yamamoto. Thesaurus for the Hachidaishu (ca. 905-1205) with the classification codes based on semantic principles, The Study of Japanese Linguistics, The Society of Japanese Linguistics, Vol. 5, No. 1, pp. 46-52, Jan. 2009.