Published February 21, 2018 | Version v1
Dataset Open

The dative dataset of World Englishes

  • 1. KU Leuven

Description

This dataset is distributed under a Creative Commons Attribution Non Commercial 4.0 International license. Use for research purposes only!

The dataset contains 13,171 variable double-object and prepositional datives extracted from the International Corpus of English series and the Corpus of Global web-based English sampling from nine national varieties of English:

  • British English
  • Canadian English
  • New Zealand English
  • Irish English
  • Hong Kong English
  • Philippine English
  • Singapore English
  • Indian English
  • Jamaican English

 

The dataframe contains the following columns:

1 TokenID: Unique identifier for the individual token

2 Variety: The variety from which the token is taken

3 Nativity: Native or non-native variety of English (L1 vs. L2)

4 Corpus: The corpus from which the token stems

5 Subcorpus: Combination of Corpus and Variety

6 FileID: ID of the corpus file in which the token was found. Format: VARIETY:FILENAME

7 TextID: ID of the corpus text in which the token was found. Individual files in ICE can have multiple texts. Format: VARIETY:FILENAME:TEXTNUMBER

8 LineID: ID of the line in the text in which the token sentence was found. Format: VARIETY:FILENAME:TEXTNUMBER:LINENUMBER

9 SpeakerID: ID of the speaker of the sentence. Speakers in spoken texts are indicated with capital letters. Authors of written texts have ID ‘A’. Format: VARIETY:FILENAME:TEXTNUMBER:SPEAKERID

10 UnitMarker: UnitMarker of the utterance in the text. Format UTTERANCE NUMBER:TEXTNUMBER:SPEAKERID

11 GenreFine: 14-level distinction: The 12-level ICE sub-register in which the token was found and the two levels in GloWbE (blog vs. general). Levels: See ICE documentation

12 GenreCoarse: 5-level distinction: The 4-level ICE register in which the token was found and GloWbE (online = 1 level). Levels: See ICE documentation

13 Mode: The mode (‘spoken’, ‘written’) of the token.

14 Register: The 4-level Register along two axes – spoken vs. written / informal vs. formal

15 PriorContextPlain: The plain text version of the 100 words preceding the dative token.

16 PriorContextTag: The POS-tagged version of the 100 words preceding the dative token.

17 SentencePlain: The plain text version of the sentence containing the dative token.

18 SentenceTag: The POS-tagged version of the sentence containing the dative token.

19 WholeConstructionPlain: The plain text version of the VP containing the dative token (i.e. verb + object + object).

20 WholeConstructionTag: The POS-tagged version of the VP containing the dative token.

21 Verb: The lemma of the verbal head (give in gave it some thought)

22 VerbForm: The verb form of the verbal head (gave in gave it some thought)

23 RecipientShort: The short plain text version of the recipient without hesitations or repetitions

24 ThemeShort: The short plain text version of the theme without hesitations or repetitions

25 RecipientLong: The long plain text version of the recipient with hesitations or repetitions

26 ThemeLong: The long plain text version of the theme with hesitations or repetitions

27 RecHeadPlain: The plain text version of the recipient head

28 RecHeadTag: The POS-tagged version of the recipient head

29 RecHeadLemma: The lemma of the recipient head

30 ThemeHeadPlain: The plain text version of the theme head

31 ThemeHeadTag: The POS-tagged version of the theme head

32 ThemeHeadLemma: The lemma of the theme head

33 VerbThemeLemma: Combination of the verb lemma and the theme head. Format: VERB_THEME

34 VerbSense: Semantics of the verb based on the whole construction combined with the verb lemma. Format: VERB.VERBSEMANTICS

35 VerbSemantics: 5-level distinction of verb semantics (‘a’, ‘t’, ‘p’, ‘f’, ‘c’).

36 Resp: The variant order. Levels: ‘do’ (=ditransitive), ‘pd’ (=prepositional)

37 RecAnimacy: 6-level distinction of recipient animacy following previous research: human (a1) > animal (a2) > collective (c) > locative (l) > temporal (t) > inanimate (i)

38 ThemeAnimacy: 6-level distinction of theme animacy following previous research: human (a1) > animal (a2) > collective (c) > locative (l) > temporal (t) > inanimate (i)

39 RecWordLth: Length of recipient NP in words

40 RecLetterLth: Length of recipient NP in orthographic characters

41 ThemeWordLth: Length of theme NP in words

42 ThemeLetterLth: Length of theme NP in orthographic characters

43 RecComplexity

15-level distinction of recipient complexity indicating type and number of posthead dependents, restricted to the ICE components. (GloWbE components make simplified distinction between ‘simple’ and ‘complex’). Levels: ‘s’ = simple (no postmodifications), ‘co’ = coordinated, ‘ge’ = general extender, ‘gn’ = genitive, ‘postad’ = postmodifying adverbial/adjective, ‘pp’ = modifying prepositional phrase, ‘appnom’ = nominal apposition, ‘rc’ = relative clause, ‘cp’ = complement clause, ‘advc’ = adverbial clause, ‘nonfin’ = nonfinite clause, ‘tpp’ = two nominal posthead dependents, ‘tvp’ = two posthead dependents involving at least one VP, ‘mpp’ = more than two nominal posthead dependents, ‘mvp’ = more than two posthead dependents involving at least one VP

44 ThemeComplexity

15-level distinction of theme complexity indicating type and number of posthead dependents, restricted to the ICE components. (GloWbE components make simplified distinction between ‘simple’ and ‘complex’). Levels: ‘s’ = simple (no postmodifications), ‘co’ = coordinated, ‘ge’ = general extender, ‘gn’ = genitive, ‘postad’ = postmodifying adverbial/adjective, ‘pp’ = modifying prepositional phrase, ‘appnom’ = nominal apposition, ‘rc’ = relative clause, ‘cp’ = complement clause, ‘advc’ = adverbial clause, ‘nonfin’ = nonfinite clause, ‘tpp’ = two nominal posthead dependents, ‘tvp’ = two posthead dependents involving at least one VP, ‘mpp’ = more than two nominal posthead dependents, ‘mvp’ = more than two posthead dependents involving at least one VP

45 RecNPExprType: Syntactic category of the recipient NP

Levels: ‘dem’ = bare demonstrative; ‘nc’ = common noun; ‘np’ = proper noun; ‘pprn’ = personal pronoun; ‘iprn’ = impersonal pronoun; ‘rprn’ = reflexive pronoun; ‘vp’ = gerund (-ing) NP; ‘wh’ = NP headed by wh- word

46 ThemeNPExprType: Syntactic category of the theme NP

Levels: ‘dem’ = bare demonstrative; ‘nc’ = common noun; ‘np’ = proper noun; ‘pprn’ = personal pronoun; ‘iprn’ = impersonal pronoun; ‘rprn’ = reflexive pronoun; ‘vp’ = gerund (-ing) NP; ‘wh’ = NP headed by wh- word

47 RecGivenness: Givenness of the recipient NP. Levels: ‘given’, ‘new’

48 ThemeGivenness: Givenness of the theme NP. Levels: ‘given’, ‘new’

49 RecDefiniteness: Definiteness of the recipient NP. Levels: ‘def’, ‘indef’

50 ThemeDefiniteness: Definiteness of the theme NP. Levels: ‘def’, ‘indef’

51 RecBinComplexity: Binary predictor of recipient complexity indicating following postmodifications after the head noun. Levels: ‘simple’, ‘complex’

52 ThemeBinComplexity: Binary predictor of theme complexity indicating following postmodifications after the head noun. Levels: ‘simple’, ‘complex’

53 RecPerson: Person of recipient. Levels: ‘local’, ‘non-local’

54 ThemeConcreteness: Concreteness of theme based on verb semantics. Levels: ‘concrete’, ‘non-concrete’

55 TypeTokenRatio: Type-token ratio of the 100 word context surrounding the token

56 RecHeadFreq: Frequency of recipient head lemma in GloWbE

57 ThemeHeadFreq: Frequency of theme head lemma in GloWbE

58 RecThematicity: Normalized frequency of recipient head lemma in its text (per 2000 words)

59 ThemeThematicity: Normalized frequency of theme head lemma in its text (per 2000 words)

60 PrimeType: The response type of the preceding dative token, if any. Levels: ‘do, ‘pd, ‘NA’

61 Persistence: Indicates whether preceding dative token, if any, is the same or not. Levels: ‘none’, ‘yes’, ‘no’

62 SameUtterance: Indicates whether the preceding dative token occurred in the same utterance or not. Necessary for manual coding of persistence.

63 DistanceToPrevious: Number of utterances between current and preceding dative token. ‘None’ if no preceding dative token.

64 RecPron: Binary factor of recipient pronominality. Levels: ‘pron’, ‘non-pron’

65 ThemePron: Binary factor of theme pronominality: Levels: ‘pron’, ‘non-pron’

66 RecBinAnimacy: Binary factor of recipient animacy. Levels of RecAnimacy conflated to: ‘animate’, ‘inanimate’

67 ThemeBinAnimacy: Binary factor of theme animacy. Levels of ThemeAnimacy conflated to: ‘animate’, ‘inanimate’

68 logRecLetterLth: Natural logarithm of recipient length in orthographic characters

69 logThemeLetterLth: Natural logarithm of theme length in orthographic characters

70 WeightRatio: Ratio of object lengths: Recipient length in characters divided by theme length in characters

71 logWeightRatio: Natural logarithm of weight ratio

72 PrimeTypePruned: The variant of the preceding dative token within the previous 10 utterances. Levels: ‘none’, ‘do’, ‘pd’

73 NumDistanceToPrevious: Numeric distance to previous token (for calculations in R)

74 PersistencePruned: Indicates whether the preceding token within the previous 10 utterances is the same as the current token. Levels: ‘none’, ‘yes’, ‘no’

75-82 z.__________: Numeric predictor centered around the mean and scaled by two standard deviations.

83 Variety.Sum: Column used for sum coding in modeling process

Notes

Research Foundation Flanders (Belgium) grant no. G.0C59.13N

Files

license_CC-BY-NC_for research only.txt

Files (36.9 MB)

Name Size Download all
md5:102cd102cbc948f4342b393ce2781c51
19.4 kB Preview Download
md5:1d81144e92b61960a4f31232190279e4
36.9 MB Preview Download

Additional details

Related works

Is documented by
Thesis: 10.5281/zenodo.4022349 (DOI)