README - Tokenlevel Metadata of the ChildPoeDE Corpus
(Lehmann, Heumann, Kuijpers, Lauer & Lüdtke, 2023)


Word_Id*
	The word's unique identifier in childPoeDE. All word ids start with "w_" followed by consecutive numbers, e.g. w_000001.	

Poem_Id*	
	The poem's unique identifier in childPoeDE. All poem ids start with "p_" followed by consecutive numbers, e.g. p_00001.

Title_Txt_File*	
	The title of the poem's txt file excluding the id. Identical with the poem's title.
	If a poem has no title, the file name consists of the first few words of the poem.
	For poems with identical titles, the first letter of the author's surname is included, e.g. Abendlied_B.txt and Abendlied_C.txt.
	File names do not contain ä, ö or ü.

Title_Poem*	
	The title of the poem.
	"Kein Titel" for poems without title.	

Has_Title*	
	Data on whether the poem has a title or not.
	0: no title
	1: with title

Special_Layout*	
	Data on whether the poem includes at least one tab character.
	Tab characters can be used as a proxy for measuring deviations from the standard poem layout (all lines left-aligned).
	0: no tab characters
	1: at least one tab character

Has_Punct*	
	Data on whether one or more of the following punctuation marks appear in the poem: \.,;:!\?–\-\*\(\)\[\]\{\}·…„“
	0: no punctuation marks
	1: at least one punctuation mark from the list

Has_Uppercase*
	Data on whether the poem contains uppercase letters.
	0: no uppercase letters
	1: at least one uppercase letter

Has_Lowercase*	
	Data on whether the poem contains lowercase letters.
	0: no lowercase letters
	1: at least one lowercase letter

Has_Titlecase*
	Data on whether the poem contains words in title case.
	0: no words in title case
	1: at least one word in title case	

Has_Sentence_Like_Structure*	
	Data on whether the poem is (most likely) structured in sentences.
	Based on the assumption that the presence of punctuation marks combined with words in lowercase, uppercase as well as title case indicates that the poem could be structured in sentences.
	0: one or more of the variables Has_Punct, Has_Uppercase, Has_Lowercase and Has_Titlecase is 0.
	1: all of the variables Has_Punct, Has_Uppercase, Has_Lowercase and Has_Titlecase are 1.

Stanza_Nr*
	The number of the stanza to which the word belongs. If the word is part of the title, "Stanza_Nr" is 0.

Line_Nr*
	The number of the line to which the word belongs. If the word is part of the title, "Line_Nr" is 0.	

Line_Length_Words*	
	Length of the line containing the word (measured in the number of words).  

Word_Typ_Tagger**	
	Part-of-Speech-Tag determined by TreeTagger (detailed POS-Tags). For words not included in the standard dictionary of TreeTagger, Part-of-Speech-Tags were inserted manually.  

Word_Typ_Tagger_Content_Function**	
	Data on whether the word is a content word or a function word.
	c: content word
	f: function word
        (Word types coded as content words can be reconstructed by crossing the columns Word_Typ_Tagger and Word_Typ_Tagger_Content_Function)

Word_Typ_Tagger_Rough**	
	Broader POS-Tags, summarized to the categories: noun, adjective/adverb, verb, CARD, FM, function, ITJ, NGONO (onomatopoeic words)

Onomatopoeia***
	Data on whether the word is an onomatopoeic word.	
	0: no
	1: yes 

Word_Nr_In_Poem*	
	The number of the words within the poem.

Word_Nr_In_Stanza*	
	The number of the words within the stanza.

Word_Nr_In_Line*	
	The number of the words within the line.

Word_Length*	
	Length of the word (number of characters).

Word_In_Title*	
	Data on whether the word is part of the title
	0: not in title
	1: word in title

Sonority_Score*	
	Sonority score of the word calculated as the average sonority of the characters within the word (cf. Jacobs, 2017, Stenneken et al, 2005). Maximum: 7, Minimum: 1.

*	Data created with "poemtool.py"
**	Data created with TreeTagger or derived from TreeTagger data (TreeTagger output was manually corrected)
***	Manually added data

Note:
The output of the Python script "poemtool.py" is the basis for the tokenlevel metadata file. However, this script does not determine POS-Tags and lemmas.
The tokenlevel metadata file is a combination of the output by poemtool.py, additional data generated with TreeTagger and some manually added data (e.g. Onomatopoeia).

Format: csv, delimiter: |