This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.
group_str(strings, maxdist = 2, method = "lv", strict = FALSE, trim.whitespace = TRUE, remove.empty = TRUE, verbose = FALSE)
strings | Character vector with string elements. |
---|---|
maxdist | Maximum distance between two string elements, which is allowed to treat two elements as similar or equal. |
method | Method for distance calculation. The default is |
strict | Logical; if |
trim.whitespace | Logical; if |
remove.empty | Logical; if |
verbose | Logical; if |
A character vector where similar string elements (values) are recoded
into a new, single value. The return value is of same length as
strings
, i.e. grouped elements appear multiple times, so
the count for each grouped string is still avaiable (see 'Examples').
oldstring <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") newstring <- group_str(oldstring) # see result newstring#> [1] "Hello, Helo" "Hello, Helo" "Hole" "Ape, Apple" #> [5] "Ape, Apple" "New" "Old" "System, Systemic" #> [9] "System, Systemic"# count for each groups table(newstring)#> newstring #> Ape, Apple Hello, Helo Hole New #> 2 2 1 1 #> Old System, Systemic #> 1 2#> #> # x <character> #> # total N=9 valid N=9 mean=5.00 sd=2.74 #> #> val frq raw.prc valid.prc cum.prc #> Ape 1 11.11 11.11 11.11 #> Apple 1 11.11 11.11 22.22 #> Hello 1 11.11 11.11 33.33 #> Helo 1 11.11 11.11 44.44 #> Hole 1 11.11 11.11 55.56 #> New 1 11.11 11.11 66.67 #> Old 1 11.11 11.11 77.78 #> System 1 11.11 11.11 88.89 #> Systemic 1 11.11 11.11 100.00 #> <NA> 0 0.00 NA NA #> #>frq(newstring)#> #> # x <character> #> # total N=9 valid N=9 mean=3.33 sd=2.00 #> #> val frq raw.prc valid.prc cum.prc #> Ape, Apple 2 22.22 22.22 22.22 #> Hello, Helo 2 22.22 22.22 44.44 #> Hole 1 11.11 11.11 55.56 #> New 1 11.11 11.11 66.67 #> Old 1 11.11 11.11 77.78 #> System, Systemic 2 22.22 22.22 100.00 #> <NA> 0 0.00 NA NA #> #>#> #> # x <character> #> # total N=9 valid N=9 mean=5.00 sd=2.74 #> #> val frq raw.prc valid.prc cum.prc #> Ape 1 11.11 11.11 11.11 #> Apple 1 11.11 11.11 22.22 #> Hello 1 11.11 11.11 33.33 #> Helo 1 11.11 11.11 44.44 #> Hole 1 11.11 11.11 55.56 #> New 1 11.11 11.11 66.67 #> Old 1 11.11 11.11 77.78 #> System 1 11.11 11.11 88.89 #> Systemic 1 11.11 11.11 100.00 #> <NA> 0 0.00 NA NA #> #>frq(newstring)#> #> # x <character> #> # total N=9 valid N=9 mean=2.44 sd=1.13 #> #> val frq raw.prc valid.prc cum.prc #> Ape, Apple 2 22.22 22.22 22.22 #> Hello, Helo, Hole 3 33.33 33.33 55.56 #> New, Old 2 22.22 22.22 77.78 #> System, Systemic 2 22.22 22.22 100.00 #> <NA> 0 0.00 NA NA #> #># be more strict with matching pairs newstring <- group_str(oldstring, maxdist = 3, strict = TRUE) frq(oldstring)#> #> # x <character> #> # total N=9 valid N=9 mean=5.00 sd=2.74 #> #> val frq raw.prc valid.prc cum.prc #> Ape 1 11.11 11.11 11.11 #> Apple 1 11.11 11.11 22.22 #> Hello 1 11.11 11.11 33.33 #> Helo 1 11.11 11.11 44.44 #> Hole 1 11.11 11.11 55.56 #> New 1 11.11 11.11 66.67 #> Old 1 11.11 11.11 77.78 #> System 1 11.11 11.11 88.89 #> Systemic 1 11.11 11.11 100.00 #> <NA> 0 0.00 NA NA #> #>frq(newstring)#> #> # x <character> #> # total N=9 valid N=9 mean=2.89 sd=1.54 #> #> val frq raw.prc valid.prc cum.prc #> Ape, Apple 2 22.22 22.22 22.22 #> Hello, Helo 2 22.22 22.22 44.44 #> Hole, Old 2 22.22 22.22 66.67 #> New 1 11.11 11.11 77.78 #> System, Systemic 2 22.22 22.22 100.00 #> <NA> 0 0.00 NA NA #> #>