Overview

collogetr currently has one function (viz. colloc_leipzig()) to retrieve window-span collocates for a set of word forms (viz. the node word) from the (Indonesian) Leipzig Corpora. There are two functions to process the output of colloc_leipzig() into tabular formats as input for association measure between the collocates and the node, as in Stefanowitsch and Gries’ (2003) collostructional/collocation analysis (see also, Gries, 2015; Stefanowitsch, 2013; Stefanowitsch & Gries, 2009). These functions are assoc_prepare() and assoc_prepare_dca(). The former generates input for Simple Collexeme/Collocational Analysis, which is computed using collex_fye(), meanwhile the latter uses the output of assoc_prepare() to generate input for Distinctive Collexeme/Collocates Analysis (Gries & Stefanowitsch, 2004; Hilpert, 2006), which is computed using collex_fye_dca(). collogetr is built on top of the core packages in the tidyverse.

Installation

Install collogetr from GitHub with devtools:

library(devtools)
install_github("gederajeg/collogetr")

Usages

Load collogetr

library(collogetr)

Package data

The package has three data sets for demonstration. The important one is the demo_corpus_leipzig whose documentation can be accessed via ?demo_corpus_leipzig. Another data is a list of Indonesian stopwords (i.e. stopwords) that can be filtered out when performing collocational measure. The last one is leipzig_corpus_path containing character vector of full path to my Leipzig Corpus files in my computer.

Accepted inputs

colloc_leipzig() accepts two types of corpus-input data:

  1. A named-list object with character-vector elements of each Leipzig Corpus Files, represented by demo_corpus_leipzig and the format of which is shown below:
  1. Full-paths to the Leipzig Corpus plain texts, as in the leipzig_corpus_path.

In terms of the input strings for the pattern argument, colloc_leipzig() accepts three scenarios:

  1. Plain string representing a whole word form, such as "memberikan" ‘to give’

  2. Regex of a whole word, such as "^memberikan$" ‘to give’

  3. Regex of a whole word with word boundary character (\\b), such as "\\bmemberikan\\b".

All of these three forms will be used to match the exact word form of the search pattern after the corpus file is tokenised into individual words. That is, input patterns following scenario 1 or 3 will be turned into their exact search pattern represented in scenario 2 (i.e., with the beginning- and end-of-line anchors, hence "^...$"). So user can directly use the input pattern in scenario 2 for the pattern argument. If there are more than one word to be searched, put them into a character vector (e.g., c("^memberi$", "^membawa$")).

Demo

Retrieving the collocates

The codes below show how one may retrieve the collocates for the Indonesian verb mengatakan ‘to say sth.’. The function colloc_leipzig() will print out progress messages of the stages onto the console. It generates warning(s) when a search pattern or node word is not found in a corpus file or in all loaded corpus files.

In the example above, the collocates are restricted to those occurring one word (i.e. span = 1L) to the right (window = "r") of mengatakan ‘to say’. The "r" character in window stands for right-side collocates ("l" for left-side collocates and "b" for both right- and left-side collocates). The span argument requires integer (i.e., a whole number) to indicate the range of words covered in the specified window. The pattern argument requires one or more exact word forms; if more than one, put into a character vector (e.g., c("mengatakan", "menjanjikan")).

The save_interim is FALSE means that no output is saved into the computer, but in the console (i.e., in the out object). If save_interim = TRUE, the function will save the outputs into the files in the computer. colloc_leipzig() has specified the default file names for the outputs via these arguments: (i) freqlist_output_file, (ii) colloc_output_file, (iii) corpussize_output_file, and (iv) search_pattern_output_file. It is recommended that the output filenames are stored as a character vector. See Examples “(2)” in the documentation of colloc_leipzig() for a call when save_interim = TRUE.

Exploring the output of colloc_leipzig().

The output of colloc_leipzig() is a list of 4 elements:

  1. colloc_df; a table/tibble of raw collocates data with columns for:
    1. corpus names
    2. sentence id in which the collocates and the node word(s) are found
    3. the collocates (column w)
    4. the span information (e.g., "r1" for one-word, right-side collocates)
    5. the node word
    6. the text/sentence match in which the collocates and the node are found
  2. freqlist_df; a table/tibble of word-frequency list in the loaded corpus
  3. corpussize_df; a table/tibble of total word-tokens in the loaded corpus
  4. pattern; a character vector of the search pattern/node
str(out)
#> List of 4
#>  $ colloc_df    :Classes 'tbl_df', 'tbl' and 'data.frame':   151 obs. of  6 variables:
#>   ..$ corpus_names: chr [1:151] "ind_mixed_2012_1M" "ind_mixed_2012_1M" "ind_mixed_2012_1M" "ind_news_2008_300K" ...
#>   ..$ sent_id     : int [1:151] 185 191 215 1 93 96 122 130 136 158 ...
#>   ..$ w           : chr [1:151] "kalau" "ia" "bahwa" "rupiah" ...
#>   ..$ span        : chr [1:151] "r1" "r1" "r1" "r1" ...
#>   ..$ node        : chr [1:151] "mengatakan" "mengatakan" "mengatakan" "mengatakan" ...
#>   ..$ sent_match  : chr [1:151] "705166 Beberapa kawan mengatakan kalau voting dilakukan secara tertutup satu orang satu suara dan tidak ada kes"| __truncated__ "870266 Pak haji mengatakan, ia sebenarnya menginginkan seorang menantu yang bisa mengajarkan caranya menggunaka"| __truncated__ "256689 Catatan: sebelum bagian ini Edwin Louis Cole mengatakan bahwa Allah memberikan firman kepada Martin Luth"| __truncated__ "270199 Ia mengatakan, rupiah makin terpuruk sulit dipertahankan, karena faktor negatif internal sangat kuat men"| __truncated__ ...
#>  $ freqlist_df  :Classes 'tbl_df', 'tbl' and 'data.frame':   30093 obs. of  3 variables:
#>   ..$ corpus_names: chr [1:30093] "ind_mixed_2012_1M" "ind_mixed_2012_1M" "ind_mixed_2012_1M" "ind_mixed_2012_1M" ...
#>   ..$ w           : chr [1:30093] "yang" "dan" "di" "dengan" ...
#>   ..$ n           : int [1:30093] 128 93 59 53 50 45 37 32 31 28 ...
#>  $ corpussize_df:Classes 'tbl_df', 'tbl' and 'data.frame':   15 obs. of  2 variables:
#>   ..$ corpus_names: chr [1:15] "ind_mixed_2012_1M" "ind_news_2008_300K" "ind_news_2009_300K" "ind_news_2010_300K" ...
#>   ..$ size        : int [1:15] 3676 4663 4740 4904 4690 4881 4018 3854 3831 3827 ...
#>  $ pattern      : chr "mengatakan"

The freqlist_df and corpussize_df are important for performing the collocational strength measure for the search pattern with the collocates.

Preparing input data for Simple Collexeme/Collocational Analysis (SCA).

First we need to call assoc_prepare() for generating the data SCA. The demo illustrates it with in-console output of colloc_leipzig(). See the Examples “2.2” in the documentation for assoc_prepare() for handling saved outputs (?assoc_prepare()).

Inspect the output of assoc_prepare():

The assoc_prepare() and collex_fye() functions are designed following the tidy principle so that the association/collocation measure is performed in a row-wise fashion, benefiting from the combination of nested column (cf., Wickham & Grolemund, 2017, p. 409) for the input-data (using tidyr::nest()) and purrr’s map_* function. assoc_prepare() includes calculating the expected co-occurrence frequencies between the collocates/collexemes and the node word/construction.

The column data in assoc_tb above consists of nested tibble/table as a list. Each contains required data for performing association measure for each of the collocates in column w (Gries, 2013, 2015; Stefanowitsch & Gries, 2003, 2009). This nested column can be inspected as follows (for the first row, namely for the word pihaknya ‘the party’).

Column a indicates the co-occurrence frequency between the node word and the collocates column w, meanwhile a_exp indicates the expected co-occurrence frequency between them. The n_w_in_corp represents the total token/occurrence frequency of a given collocate. The n_pattern stores the total token/occurrence frequency of the node word in the corpus. Column b, c, and d are required for the association measure that is essentially based on 2-by-2 crosstabulation table. The assoc column indicates whether the value in a is higher than that in a_exp, thus indicating attraction or positive association between the node word and the collocate. The reverse is repulsion or negative association when the value in a is less/lower than that in a_exp.

Simple Collexeme/Collocates Analysis (SCA)

As in the Collostructional Analysis (Stefanowitsch & Gries, 2003), collex_fye() uses one-tailed Fisher-Yates Exact test whose p-FisherExactvalue is log-transformed to the base of 10 to indicate the collostruction strength between the collocates and the node word (Gries, Hampe, & Schönefeld, 2005). collex_fye() simultaneously performs two uni-directional measures of Delta P (Gries, 2013, 2015, p. 524). One of these shows the extent to which the presence of the node-word cues the presence of the collocates/collexemes; the other one determines the extent to which the collocates/collexemes cues the presence of the node-word.

Here is the codes to perform the SCA using collex_fye():

Now we can retrieve the top-10 most strongly attracted collocates to mengatakan ‘to say sth.’. The association strength is shown in the collstr column, which stands for collostruction strength. The higher, the stronger the association.

Column a contains the co-occurrence frequency of the collocates (w) with the node as its R1 collocates in the demo corpus. p_fye shows the one-tailed pFisherExact-value.

Distinctive Collexeme/Collocate Analysis (DCA)

The idea of distinctive collexemes/collocates is to contrast two functionally/semantically similar constructions or words in terms of the collocates that are (significantly) more frequent for one of the two contrasted constructions/words (see Gries & Stefanowitsch, 2004; Hilpert, 2006). colloc_leipzig() can be used to retrieve collocates of two functionally/semantically similar words by specifying the pattern argument with two character vectors of words.

The following example use one of the Leipzig corpus files (not included in the package but can be downloaded from the Leipzig Corpora webpage for free), namely the "ind_mixed_2012_1M-sentences". The aim is to contrast collocational preferences of two deadjectival transitive verbs based on the root kuat ‘strong’ framed within two causative morphological schemas: one with per-+ADJ and the other with ADJ+-kan. Theoretically, the per- schema indicates that the direct object of the verb is caused to have more of the characteristic indicated by the adjectival root, meanwhile the -kan schema indicates that the direct object is caused to have the characteristic indicated by the root (that is not previously had). The focus here is on the R1 collocates of the verbs (i.e. one word immediately to the right of the verbs in the sentences).

Then, we prepare the output into the format required for performing DCA with collex_fye_dca().

Compute DCA for the two verbs and view the results snippet.

The package also includes a function called dca_top_collex() to retrieve the top-n distinctive collocates for one of the two contrasted words. The dist_for argument can be specified by either the character vector of the name of the contrasted words, or the character IDs of the constructions/words (e.g., [i] ..., dist_for = "a", ... or ..., dist_for = "A", ... for construction/word appearing in the second column from the output of collex_fye_dca(), or [ii] ..., dist_for = "b", ... or ..., dist_for = "B", ... for construction/word appearing in the third column).

The codes below retrieve the distinctive collocates for menguatkan ‘to strengthen’ or Construction B.

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS  10.14.3              
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2019-03-17                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.4.0)
#>  backports     1.1.2   2017-12-13 [1] CRAN (R 3.5.0)
#>  callr         3.1.1   2018-12-21 [1] CRAN (R 3.5.0)
#>  cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.0)
#>  collogetr   * 1.1.3   2019-03-17 [1] local         
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.4.1)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools      2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
#>  digest        0.6.15  2018-01-28 [1] CRAN (R 3.5.0)
#>  dplyr         0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
#>  evaluate      0.11    2018-07-17 [1] CRAN (R 3.5.0)
#>  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.5.0)
#>  fs            1.2.3   2018-06-08 [1] CRAN (R 3.5.0)
#>  glue          1.3.0   2018-07-17 [1] CRAN (R 3.5.0)
#>  hms           0.4.2   2018-03-10 [1] CRAN (R 3.4.4)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
#>  knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.4.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.4.0)
#>  pillar        1.3.1   2018-12-15 [1] CRAN (R 3.5.0)
#>  pkgbuild      1.0.2   2018-10-16 [1] CRAN (R 3.5.0)
#>  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.2.1   2018-12-05 [1] CRAN (R 3.5.0)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.0)
#>  purrr         0.3.0   2019-01-27 [1] CRAN (R 3.5.2)
#>  R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.0)
#>  Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
#>  readr         1.3.1   2018-12-21 [1] CRAN (R 3.5.0)
#>  remotes       2.0.2   2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang         0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
#>  rmarkdown     1.11    2018-12-08 [1] CRAN (R 3.5.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.4.3)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.0)
#>  stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.2)
#>  testthat      2.0.1   2018-10-13 [1] CRAN (R 3.5.0)
#>  tibble        2.0.1   2019-01-12 [1] CRAN (R 3.5.2)
#>  tidyr         0.8.3   2019-03-01 [1] CRAN (R 3.5.2)
#>  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.0)
#>  usethis       1.4.0   2018-08-14 [1] CRAN (R 3.5.0)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.5.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.4.4)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.0)
#> 
#> [1] /Users/Primahadi/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

References

Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–166. doi:[10.1075/ijcl.18.1.09gri](https://doi.org/10.1075/ijcl.18.1.09gri)

Gries, S. T. (2015). More (old and new) misunderstandings of collostructional analysis: On Schmid and Küchenhoff (2013). Cognitive Linguistics, 26(3), 505–536. doi:[10.1515/cog-2014-0092](https://doi.org/10.1515/cog-2014-0092)

Gries, S. T., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspective on ’alternations’. International Journal of Corpus Linguistics, 9(1), 97–129.

Gries, S. T., Hampe, B., & Schönefeld, D. (2005). Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics, 16(4), 635–676.

Hilpert, M. (2006). Distinctive collexeme analysis and diachrony. Corpus Linguistics and Linguistic Theory, 2(2), 243–256.

Stefanowitsch, A. (2013). Collostructional analysis. In T. Hoffmann & G. Trousdale (Eds.), The Oxford handbook of Construction Grammar (pp. 290–306). Oxford: Oxford University Press. doi:[10.1093/oxfordhb/9780195396683.013.0016](https://doi.org/10.1093/oxfordhb/9780195396683.013.0016)

Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.

Stefanowitsch, A., & Gries, S. T. (2009). Corpora and grammar. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 933–951). Berlin: Mouton de Gruyter.

Wickham, H., & Grolemund, G. (2017). R for Data Science. Canada: O’Reilly. Retrieved from http://r4ds.had.co.nz/