A new release of the stringi
package is
available on CRAN
(please wait a few days for Windows and OS X binary builds).
stringi
is a package providing (but definitely not limiting to) equivalents
of nearly all the character string processing functions known
from base R. While developing the package we had
high performance and portability of its facilities in our minds.
stringi
's user interface is inspired by
and consistent with that of Hadley Wickham's great
stringr
package.
Quoting its README,
stringr
(and “hence” stringi
):
- Processes factors and characters in the same way,
- Gives functions consistent names and arguments,
- Simplifies string operations by eliminating options that you don't need 95% of the time,
- Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs,
- Completes R's string handling functions with useful functions from other programming languages.
While base R as well as stringr
functions are great for simple
text processing tasks, dealing with more complex ones
(such as natural language processing) may be a bit problematic.
First of all, some time ago we mentioned in
our blog post
that regex search may provide different outputs on different platforms.
For example, Polish letters such as ą, ę, ś etc. are correctly
captured with [[:alpha:]]
by the (default) ERE engine on Linux (native encoding=UTF-8),
while on Windows the results are quite surprising.
(A year ago my students got (of course, initially)
very bad marks from a Polish text processing
task just because they had written their R scripts on Windows while I
ran them on Linux.) :-)
Secondly, natural language processing relies on a set of very complex,
locale-specific rules. However, the rules available
(via e.g. glibc)
in base R string functions may sometimes give incorrect results.
For example, when we convert German ß
(es-zett/double small s)
character to upper case, we rather expect SS
in result than:
toupper("groß") # GROSS? No...
## [1] "GROß"
Moreover, let's assume that we are asked to sort a character
vector according to the rules specific to the Slovak language.
Here, quite interestingly, the word hladný
(hungry)
can be found in a dictionary before the word chladný
(cold).
Of course, as not everyone works in a Slovak locale,
we don't expect to obtain a proper order immediately:
sort(c("hladný", "chladný"))
## [1] "chladný" "hladný"
In order to obtain a proper order, we should temporarily switch to a Slovak “environment”:
oldlocale <- Sys.getlocale("LC_COLLATE")
Sys.setlocale("LC_COLLATE", "sk_SK")
## [1] "sk_SK"
sort(c("hladný", "chladný"))
## [1] "hladný" "chladný"
Sys.setlocale("LC_COLLATE", oldlocale)
## [1] "pl_PL.UTF-8"
This code works on my Linux, but is not portable. It's because:
Slovak_Slovakia.1250
is appropriate.And so on.
stringi
facilitiesIn order to overcome such problems we decided to reimplement each string processing function from scratch (of course, purely in C++). The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known and established IBM's ICU4C library (refer to ICU's website for more details).
Here is a very general list of the most important
features available in the current version of stringi
:
and many more.
Here's a bunch of examples.
library(stringi)
NA
handling:stri_length(c("aaa", NA, ""))
## [1] 3 NA 0
stri_replace_all_fixed(c("aba", "bab"), c("a", "b"), c("c", "d")) # 1-1-1 and 2-2-2
## [1] "cbc" "dad"
stri_replace_all_fixed(c("aba", "bab"), "a", "c") # 1-1-1 and 2-1-1
## [1] "cbc" "bcb"
stri_replace_all_fixed("aba", c("a", "b"), "c") # 1-1-1 and 1-2-1
## [1] "cbc" "aca"
stri_replace_all_fixed("aba", "a", c("c", "d")) # 1-1-1 and 1-1-2
## [1] "cbc" "dbd"
(all the functions are vectorized w.r.t most of their arguments)
stri_sort(c("hladný", "chladný"), opts = stri_opts_collator(locale = "sk_SK"))
## [1] "hladný" "chladný"
stri_trans_toupper("Groß")
## [1] "GROSS"
In our upcoming blog posts we will present some exciting
features of stringi
.
They are definitely worth to be discussed separately! Stay tuned.
And some benchmarks.
set.seed(123L)
library(microbenchmark)
x <- stri_rand_strings(1e+05, 10) # 10000 random ASCII 'words' of length 10 each
head(x, 5)
## [1] "HmPsw2WtYS" "xSgZ6tF2Kx" "tgdzehXaH9" "xtgn1TlDJE" "8PPM98ESGr"
microbenchmark(sort(x), stri_sort(x))
## Unit: milliseconds
## expr min lq median uq max neval
## sort(x) 1050.4 1062.8 1076.1 1110 1176.6 100
## stri_sort(x) 234.2 239.7 243.5 250 303.7 100
microbenchmark(paste(x, collapse = ", "), stri_paste(x, collapse = ", "))
## Unit: milliseconds
## expr min lq median uq max neval
## paste(x, collapse = ", ") 45.21 45.70 46.64 53.15 244.28 100
## stri_paste(x, collapse = ", ") 10.14 10.44 10.70 16.36 18.71 100
set.seed(123L)
y <- stri_rand_strings(10000, 10, "[ACGT]") # 10000 random 'genomes' of length 10
head(y, 5)
## [1] "CTCTTAGTGC" "TCGGATAACT" "TGGTGGGGCA" "TTGTACTACA" "ACCCAAACCT"
microbenchmark(grepl("ACCA", y), grepl("ACCA", y, fixed = TRUE), grepl("ACCA",
y, perl = TRUE), stri_detect_fixed(y, "ACCA"), stri_detect_regex(y, "ACCA"))
## Unit: microseconds
## expr min lq median uq max neval
## grepl("ACCA", y) 4928.0 4968.9 4987.0 5008.9 12723.2 100
## grepl("ACCA", y, fixed = TRUE) 899.0 906.9 912.0 919.2 2441.2 100
## grepl("ACCA", y, perl = TRUE) 2145.7 2155.5 2162.8 2174.6 9707.1 100
## stri_detect_fixed(y, "ACCA") 514.9 523.0 532.2 558.6 893.4 100
## stri_detect_regex(y, "ACCA") 3720.2 3750.8 3805.6 3891.6 7411.8 100
microbenchmark(substr(y, 2, 4), stri_sub(y, 2, 4))
## Unit: microseconds
## expr min lq median uq max neval
## substr(y, 2, 4) 908.8 915.4 920.3 945.4 3640 100
## stri_sub(y, 2, 4) 924.4 945.4 955.4 1007.5 2476 100
As a rule of thumb: stringi
functions should often be
faster than the R ones for long ASCII and UTF-8 strings.
They often have poorer performance for short 8-bit encoded ones.
For more information check out the stringi
package website
and its on-line documentation.
For bug reports and feature requests visit our GitHub profile.
In the future versions of stringi
we plan to include:
123 -> one hundred twenty three
);Any comments and suggestions are warmly welcome.
Have fun!
Notable changes since the previous CRAN release (v0.1-25):
[IMPORTANT CHANGE] stri_cmp*
now do not allow for passing opts_collator=NA
.
From now on, stri_cmp_eq
, stri_cmp_neq
, and the new operators
%===%
, %!==%
, %stri===%
, and %stri!==%
are locale-independent operations,
which base on code point comparisons. New functions stri_cmp_equiv
and stri_cmp_nequiv
(and from now on also %==%
, %!=%
, %stri==%
,
and %stri!=%
) test for canonical equivalence.
[IMPORTANT CHANGE] stri_*_fixed
search functions now perform
a locale-independent exact (bytewise, of course after conversion to UTF-8)
pattern search. All the Collator-based, locale-dependent search routines
are now available via stri_*_coll
. The reason for this is that
ICU USearch has currently very poor performance and in many search tasks
in fact it is sufficient to do exact pattern matching.
[IMPORTANT CHANGE] stri_enc_nf*
and stri_enc_isnf*
function families
have been renamed to stri_trans_nf*
and stri_trans_isnf*
, respectively.
This is because they deal with text transforming, and not with character
encoding. Moreover, all such operation may be performed by
ICU's Transliterator (see below).
[IMPORTANT CHANGE] stri_*_charclass
search functions now
rely solely on ICU's UnicodeSet patterns. All previously accepted
charclass identifiers became invalid. However, new patterns
should now be more familiar to the users (they are regex-like).
Moreover, we observe a very nice performance gain.
[IMPORTANT CHANGE] stri_sort now does not include NAs
in output vectors by default, for compatibility with sort()
.
Moreover, currently none of the input vector's attributes are preserved.
[NEW FUNCTION] stri_trans_general
, stri_trans_list
gives access
to ICU's Transliterator: may be used to perform very general
text transforms.
[NEW FUNCTION stri_split_boundaries
utilizes ICU's BreakIterator
to split strings at specific text boundaries. Moreover,
stri_locate_boundaries
indicates positions of these boundaries.
[NEW FUNCTION] stri_extract_words
uses ICU's BreakIterator to
extract all words from a text. Additionally, stri_locate_words
locates start and end positions of words in a text.
[NEW FUNCTION] stri_pad
, stri_pad_left
, stri_pad_right
, stri_pad_both
pads a string with a specific code point.
[NEW FUNCTION] stri_wrap
breaks paragraphs of text into lines.
Two algorihms (greedy and minimal-raggedness) are available.
[NEW FUNCTION] stri_unique
extracts unique elements from
a character vector.
[NEW FUNCTIONS] stri_duplicated
any stri_duplicated_any
determine duplicate elements in a character vector.
[NEW FUNCTION] stri_replace_na
replaces NA
s in a character vector
with a given string, useful for emulating e.g. R's paste()
behavior.
[NEW FUNCTION] stri_rand_shuffle
generates a random permutation
of code points in a string.
[NEW FUNCTION] stri_rand_strings
generates random strings.
[NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_cmp_eq
, stri_cmp_neq
, stri_cmp_lt
, stri_cmp_le
, stri_cmp_gt
,
stri_cmp_ge
, %==%
, %!=%
, %<%
, %<=%
, %>%
, %>=%
.
[NEW FUNCTION] stri_enc_mark
reads declared encodings of character strings
as seen by stringi
.
[NEW FUNCTION] stri_enc_tonative(str)
is an alias to
stri_encode(str, NULL, NULL)
.
[NEW FEATURE] stri_order
and stri_sort
now have an additional argument
na_last
(defaults to TRUE
and NA
, respectively).
[NEW FEATURE] stri_replace_all_charclass
now has merge
arg
(defaults to FALSE
for backward-compatibility). It may be used
to e.g. replace sequences of white spaces with a single space.
[NEW FEATURE] stri_enc_toutf8
now has a new validate
arg (defaults
to FALSE
for backward-compatibility). It may be used in a (rare) case
in which a user wants to fix an invalid UTF-8 byte sequence.
stri_length (among others) now detect invalid UTF-8 byte sequences.
[NEW FEATURE] All binary operators %???%
now also have aliases %stri???%
.
stri_*_fixed
now use a tweaked Knuth-Morris-Pratt search algorithm,
which improves the search performance drastically.
Significant performance improvements in stri_join
, stri_flatten
,
stri_cmp
, stri_trans_to*
, and others.
Refer to NEWS for a complete list of changes, new features and bug fixes.