The dative dataset of World Englishes
Description
This dataset is distributed under a Creative Commons Attribution Non Commercial 4.0 International license. Use for research purposes only!
The dataset contains 13,171 variable double-object and prepositional datives extracted from the International Corpus of English series and the Corpus of Global web-based English sampling from nine national varieties of English:
- British English
- Canadian English
- New Zealand English
- Irish English
- Hong Kong English
- Philippine English
- Singapore English
- Indian English
- Jamaican English
The dataframe contains the following columns:
1 TokenID: Unique identifier for the individual token
2 Variety: The variety from which the token is taken
3 Nativity: Native or non-native variety of English (L1 vs. L2)
4 Corpus: The corpus from which the token stems
5 Subcorpus: Combination of Corpus and Variety
6 FileID: ID of the corpus file in which the token was found. Format: VARIETY:FILENAME
7 TextID: ID of the corpus text in which the token was found. Individual files in ICE can have multiple texts. Format: VARIETY:FILENAME:TEXTNUMBER
8 LineID: ID of the line in the text in which the token sentence was found. Format: VARIETY:FILENAME:TEXTNUMBER:LINENUMBER
9 SpeakerID: ID of the speaker of the sentence. Speakers in spoken texts are indicated with capital letters. Authors of written texts have ID ‘A’. Format: VARIETY:FILENAME:TEXTNUMBER:SPEAKERID
10 UnitMarker: UnitMarker of the utterance in the text. Format UTTERANCE NUMBER:TEXTNUMBER:SPEAKERID
11 GenreFine: 14-level distinction: The 12-level ICE sub-register in which the token was found and the two levels in GloWbE (blog vs. general). Levels: See ICE documentation
12 GenreCoarse: 5-level distinction: The 4-level ICE register in which the token was found and GloWbE (online = 1 level). Levels: See ICE documentation
13 Mode: The mode (‘spoken’, ‘written’) of the token.
14 Register: The 4-level Register along two axes – spoken vs. written / informal vs. formal
15 PriorContextPlain: The plain text version of the 100 words preceding the dative token.
16 PriorContextTag: The POS-tagged version of the 100 words preceding the dative token.
17 SentencePlain: The plain text version of the sentence containing the dative token.
18 SentenceTag: The POS-tagged version of the sentence containing the dative token.
19 WholeConstructionPlain: The plain text version of the VP containing the dative token (i.e. verb + object + object).
20 WholeConstructionTag: The POS-tagged version of the VP containing the dative token.
21 Verb: The lemma of the verbal head (give in gave it some thought)
22 VerbForm: The verb form of the verbal head (gave in gave it some thought)
23 RecipientShort: The short plain text version of the recipient without hesitations or repetitions
24 ThemeShort: The short plain text version of the theme without hesitations or repetitions
25 RecipientLong: The long plain text version of the recipient with hesitations or repetitions
26 ThemeLong: The long plain text version of the theme with hesitations or repetitions
27 RecHeadPlain: The plain text version of the recipient head
28 RecHeadTag: The POS-tagged version of the recipient head
29 RecHeadLemma: The lemma of the recipient head
30 ThemeHeadPlain: The plain text version of the theme head
31 ThemeHeadTag: The POS-tagged version of the theme head
32 ThemeHeadLemma: The lemma of the theme head
33 VerbThemeLemma: Combination of the verb lemma and the theme head. Format: VERB_THEME
34 VerbSense: Semantics of the verb based on the whole construction combined with the verb lemma. Format: VERB.VERBSEMANTICS
35 VerbSemantics: 5-level distinction of verb semantics (‘a’, ‘t’, ‘p’, ‘f’, ‘c’).
36 Resp: The variant order. Levels: ‘do’ (=ditransitive), ‘pd’ (=prepositional)
37 RecAnimacy: 6-level distinction of recipient animacy following previous research: human (a1) > animal (a2) > collective (c) > locative (l) > temporal (t) > inanimate (i)
38 ThemeAnimacy: 6-level distinction of theme animacy following previous research: human (a1) > animal (a2) > collective (c) > locative (l) > temporal (t) > inanimate (i)
39 RecWordLth: Length of recipient NP in words
40 RecLetterLth: Length of recipient NP in orthographic characters
41 ThemeWordLth: Length of theme NP in words
42 ThemeLetterLth: Length of theme NP in orthographic characters
43 RecComplexity
15-level distinction of recipient complexity indicating type and number of posthead dependents, restricted to the ICE components. (GloWbE components make simplified distinction between ‘simple’ and ‘complex’). Levels: ‘s’ = simple (no postmodifications), ‘co’ = coordinated, ‘ge’ = general extender, ‘gn’ = genitive, ‘postad’ = postmodifying adverbial/adjective, ‘pp’ = modifying prepositional phrase, ‘appnom’ = nominal apposition, ‘rc’ = relative clause, ‘cp’ = complement clause, ‘advc’ = adverbial clause, ‘nonfin’ = nonfinite clause, ‘tpp’ = two nominal posthead dependents, ‘tvp’ = two posthead dependents involving at least one VP, ‘mpp’ = more than two nominal posthead dependents, ‘mvp’ = more than two posthead dependents involving at least one VP
44 ThemeComplexity
15-level distinction of theme complexity indicating type and number of posthead dependents, restricted to the ICE components. (GloWbE components make simplified distinction between ‘simple’ and ‘complex’). Levels: ‘s’ = simple (no postmodifications), ‘co’ = coordinated, ‘ge’ = general extender, ‘gn’ = genitive, ‘postad’ = postmodifying adverbial/adjective, ‘pp’ = modifying prepositional phrase, ‘appnom’ = nominal apposition, ‘rc’ = relative clause, ‘cp’ = complement clause, ‘advc’ = adverbial clause, ‘nonfin’ = nonfinite clause, ‘tpp’ = two nominal posthead dependents, ‘tvp’ = two posthead dependents involving at least one VP, ‘mpp’ = more than two nominal posthead dependents, ‘mvp’ = more than two posthead dependents involving at least one VP
45 RecNPExprType: Syntactic category of the recipient NP
Levels: ‘dem’ = bare demonstrative; ‘nc’ = common noun; ‘np’ = proper noun; ‘pprn’ = personal pronoun; ‘iprn’ = impersonal pronoun; ‘rprn’ = reflexive pronoun; ‘vp’ = gerund (-ing) NP; ‘wh’ = NP headed by wh- word
46 ThemeNPExprType: Syntactic category of the theme NP
Levels: ‘dem’ = bare demonstrative; ‘nc’ = common noun; ‘np’ = proper noun; ‘pprn’ = personal pronoun; ‘iprn’ = impersonal pronoun; ‘rprn’ = reflexive pronoun; ‘vp’ = gerund (-ing) NP; ‘wh’ = NP headed by wh- word
47 RecGivenness: Givenness of the recipient NP. Levels: ‘given’, ‘new’
48 ThemeGivenness: Givenness of the theme NP. Levels: ‘given’, ‘new’
49 RecDefiniteness: Definiteness of the recipient NP. Levels: ‘def’, ‘indef’
50 ThemeDefiniteness: Definiteness of the theme NP. Levels: ‘def’, ‘indef’
51 RecBinComplexity: Binary predictor of recipient complexity indicating following postmodifications after the head noun. Levels: ‘simple’, ‘complex’
52 ThemeBinComplexity: Binary predictor of theme complexity indicating following postmodifications after the head noun. Levels: ‘simple’, ‘complex’
53 RecPerson: Person of recipient. Levels: ‘local’, ‘non-local’
54 ThemeConcreteness: Concreteness of theme based on verb semantics. Levels: ‘concrete’, ‘non-concrete’
55 TypeTokenRatio: Type-token ratio of the 100 word context surrounding the token
56 RecHeadFreq: Frequency of recipient head lemma in GloWbE
57 ThemeHeadFreq: Frequency of theme head lemma in GloWbE
58 RecThematicity: Normalized frequency of recipient head lemma in its text (per 2000 words)
59 ThemeThematicity: Normalized frequency of theme head lemma in its text (per 2000 words)
60 PrimeType: The response type of the preceding dative token, if any. Levels: ‘do, ‘pd, ‘NA’
61 Persistence: Indicates whether preceding dative token, if any, is the same or not. Levels: ‘none’, ‘yes’, ‘no’
62 SameUtterance: Indicates whether the preceding dative token occurred in the same utterance or not. Necessary for manual coding of persistence.
63 DistanceToPrevious: Number of utterances between current and preceding dative token. ‘None’ if no preceding dative token.
64 RecPron: Binary factor of recipient pronominality. Levels: ‘pron’, ‘non-pron’
65 ThemePron: Binary factor of theme pronominality: Levels: ‘pron’, ‘non-pron’
66 RecBinAnimacy: Binary factor of recipient animacy. Levels of RecAnimacy conflated to: ‘animate’, ‘inanimate’
67 ThemeBinAnimacy: Binary factor of theme animacy. Levels of ThemeAnimacy conflated to: ‘animate’, ‘inanimate’
68 logRecLetterLth: Natural logarithm of recipient length in orthographic characters
69 logThemeLetterLth: Natural logarithm of theme length in orthographic characters
70 WeightRatio: Ratio of object lengths: Recipient length in characters divided by theme length in characters
71 logWeightRatio: Natural logarithm of weight ratio
72 PrimeTypePruned: The variant of the preceding dative token within the previous 10 utterances. Levels: ‘none’, ‘do’, ‘pd’
73 NumDistanceToPrevious: Numeric distance to previous token (for calculations in R)
74 PersistencePruned: Indicates whether the preceding token within the previous 10 utterances is the same as the current token. Levels: ‘none’, ‘yes’, ‘no’
75-82 z.__________: Numeric predictor centered around the mean and scaled by two standard deviations.
83 Variety.Sum: Column used for sum coding in modeling process
Notes
Files
license_CC-BY-NC_for research only.txt
Files
(36.9 MB)
Name | Size | Download all |
---|---|---|
md5:102cd102cbc948f4342b393ce2781c51
|
19.4 kB | Preview Download |
md5:1d81144e92b61960a4f31232190279e4
|
36.9 MB | Preview Download |
Additional details
Related works
- Is documented by
- Thesis: 10.5281/zenodo.4022349 (DOI)