stringi 0.2-2 (2014-04-19 ) Compatibility Tables: Character Encodings

Introduction

As we know, character vectors in R in computer's memory resemble lists of raw vectors (each ended up with a 00 byte). Each string has to be properly "decoded" so that textual information may be read from such a byte stream. This "decoding scheme" is simply called a character encoding.

In other words, data in computer's memory are just bytes (small integer values) – an encoding is a way to represent characters with such numbers, it is a semantic "key" to understand a given byte sequence. For example, in ISO-8859-2 (Central European), the value 177 represents Polish “a with ogonek”, and in ISO-8859-1 (Western European), the same value meas the “plus-minus” sign. Thus, a character encoding is a translation scheme.

Below you will find a list of functions that deal with character encodings. Mostly, you will use them when reading or writing a text file. Functions in stringi process each string internally in Unicode, which is a superset of all character representation schemes. This is why while working with stringi you will often use (sometimes without even knowing that explicitly) the following workflow scheme: READ FILE -> CONVERT TO UTF-8 -> PROCESS -> CONVERT BACK TO DESIRED ENCODING -> WRITE FILE.

TODO: add stri_enc_toutf8

Conversion to Raw Vectors

Basic Functionality

base
charToRaw() – single string to a raw vector only
charToRaw("aA1")
## [1] 61 41 31
stringr
(none)
stringi
stri_encode() with argument to_raw=TRUE is vectorized over the first argument; it returns a list of raw vectors.
stri_encode("aA1", "", "", to_raw=TRUE)[[1]]
## [1] 61 41 31
stri_encode(c("aA1", " "), "", "", to_raw=TRUE)
## [[1]]
## [1] 61 41 31
## 
## [[2]]
## [1] 20

Performance comparison

test1 <- "abcdefghijklmnopqrstuvwxyz"
microbenchmark(charToRaw(test1), stri_encode(test1, "", "", to_raw=TRUE)[[1]])
## Unit: microseconds
##                                            expr    min     lq  median      uq    max neval
##                                charToRaw(test1)  1.018  1.575  2.2325  2.5135 12.558   100
##  stri_encode(test1, "", "", to_raw = TRUE)[[1]] 11.782 12.847 13.4095 13.9705 65.901   100
test2 <- rep(c("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "0123456789"), 10)
microbenchmark(lapply(test2, charToRaw), stri_encode(test2, "", "", to_raw=TRUE))
## Unit: microseconds
##                                       expr    min      lq  median      uq     max neval
##                   lapply(test2, charToRaw) 27.857 32.1365 35.0335 36.5075  89.783   100
##  stri_encode(test2, "", "", to_raw = TRUE) 20.489 21.7905 22.7170 23.9065 100.140   100

Conversion from Raw Vectors

Basic Functionality

base
rawToChar() – single raw vector to a single string only
rawToChar(as.raw(c(97, 65, 49)))
## [1] "aA1"
stringr
(none)
stringi
stri_encode() also accepts a raw vector or a list of raw vectors as its first argument; by default, i.e. when to_raw=FALSE, the result is a character vector.
stri_encode(as.raw(c(97, 65, 49)), "")
## [1] "61" "41" "31"
stri_encode(list(as.raw(c(97, 65, 49)),
   as.raw(32)), "")
## [1] "aA1" " "

Performance comparison

test1 <- as.raw(97:122)
microbenchmark(rawToChar(test1), stri_encode(test1, ""))
## Unit: nanoseconds
##                    expr   min      lq  median      uq    max neval
##        rawToChar(test1)   892  1332.5  1534.0  1815.5   9120   100
##  stri_encode(test1, "") 18978 19507.5 19920.5 20382.0 138394   100
test2 <- rep(list(as.raw(97:122), as.raw(65:90), as.raw(48:57)), 10)
microbenchmark(lapply(test2, rawToChar), stri_encode(test2, ""))
## Unit: microseconds
##                      expr    min     lq median     uq    max neval
##  lapply(test2, rawToChar) 39.063 43.325 45.785 47.876 93.074   100
##    stri_encode(test2, "") 19.470 20.394 20.841 21.780 56.470   100

Conversion to Integer Vectors (i.e. UTF-32)

Basic Functionality

base
utf8ToInt() – single string in UTF-8 to an integer vector only
utf8ToInt(enc2utf8("aA1"))
## [1] 97 65 49
stringr
(none)
stringi
stri_enc_toutf32() accepts a character vector on input and returns a list of integer vectors; like in all other functions from our package, native and UTF-8 encodings are handled properly
stri_enc_toutf32("aA1")[[1]]
## [1] 97 65 49
stri_enc_toutf32(c("aA1", " "))
## [[1]]
## [1] 97 65 49
## 
## [[2]]
## [1] 32

Performance comparison

test1 <- enc2utf8("abcdefghijklmnopqrstuvwxyz")
microbenchmark(utf8ToInt(test1), stri_enc_toutf32(test1)[[1]])
## Unit: nanoseconds
##                          expr  min     lq median     uq   max neval
##              utf8ToInt(test1)  667  727.0  906.5 1043.0  3818   100
##  stri_enc_toutf32(test1)[[1]] 2839 2995.5 3123.5 3247.5 42244   100
test2 <- enc2utf8(rep(c("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "0123456789"), 10))
microbenchmark(lapply(test2, utf8ToInt), stri_enc_toutf32(test2))
## Unit: microseconds
##                      expr    min      lq  median      uq    max neval
##  lapply(test2, utf8ToInt) 32.316 35.0260 36.7355 37.9725 62.752   100
##   stri_enc_toutf32(test2)  7.575  8.5445  9.0055  9.6810 63.978   100

Conversion from Integer Vectors (i.e. UTF-32)

Basic Functionality

base
intToUtf8() – single integer vector to a single string only
intToUtf8(c(97L, 65L, 49L))
## [1] "aA1"
stringr
(none)
stringi
stri_enc_fromutf32() a single integer vector or a list of integer vectors as its argument; the result is a UTF-8-encoded character vector.
stri_enc_fromutf32(c(97L, 65L, 49L))
## [1] "aA1"
stri_enc_fromutf32(list(c(97L, 65L, 49L), 32L))
## [1] "aA1" " "

Performance comparison

test1 <- 97:122
microbenchmark(intToUtf8(test1), stri_enc_fromutf32(test1))
## Unit: microseconds
##                       expr   min     lq median     uq    max neval
##           intToUtf8(test1) 1.320 1.4495 1.5295 1.6165  6.733   100
##  stri_enc_fromutf32(test1) 2.331 2.3695 2.4615 2.5525 31.403   100
test2 <- rep(list(97:122, 65:90, 48:57), 10)
microbenchmark(lapply(test2, intToUtf8), stri_enc_fromutf32(test2))
## Unit: microseconds
##                       expr    min      lq  median     uq    max neval
##   lapply(test2, intToUtf8) 50.286 52.9190 56.1345 57.717 84.865   100
##  stri_enc_fromutf32(test2)  8.274  8.8315  9.2000  9.448 30.225   100

List of Supported Encodings

Basic Functionality

base
iconvlist() – returns a character vector with supported encoding names (as well as its aliases).

Note that, as R manual states, the names are rarely valid across all platforms.

sample(iconvlist(), 4) # a sample of supported encodings
## [1] "CSIBM9448"        "IBM1143"          "ISO-IR-8-1"       "ISO_8859-14:1998"
length(iconvlist()) # count; Fedora Linux 19 x64_86
## [1] 1168
stringr
(none)
stringi
stri_enc_list() with argument provides a character vector with all supported encodings and their aliases in many different forms.

By default, howewer, a list of character vectors is returned. Each list element contains the list of aliases for the given encoding.

Please, note that apart from given encodings, ICU tries to normalize encoding specifiers, e.g. "utf8" is a valid specifier for "UTF-8".

Depending on the version of the ICU library used, each encoding should be supported across all platforms.

By the way, stri_enc_info() returns detailed information of a given encoding specifier.

sample(stri_enc_list(TRUE), 4)
## [1] "windows-1256"             "windows-1250"             "ibm-1235"                 "ibm-949_VASCII_VSUB_VPUA"
length(stri_enc_list(TRUE)) # includes aliases
## [1] 1198
length(stri_enc_list()) # true number of supported encodings
## [1] 229
str(stri_enc_info("cp1250"))
## List of 13
##  $ Name.friendly: chr "windows-1250"
##  $ Name.ICU     : chr "ibm-5346_P100-1998"
##  $ Name.UTR22   : chr "ibm-5346_P100-1998"
##  $ Name.IBM     : chr "ibm-5346"
##  $ Name.WINDOWS : chr "windows-1250"
##  $ Name.JAVA    : chr "windows-1250"
##  $ Name.IANA    : chr "windows-1250"
##  $ Name.MIME    : chr NA
##  $ ASCII.subset : logi TRUE
##  $ Unicode.1to1 : logi TRUE
##  $ CharSize.8bit: logi TRUE
##  $ CharSize.min : int 1
##  $ CharSize.max : int 1

Convert Strings Between Encodings

Basic Functionality

base
iconv() – converts a character vector between two given encodings. Argument from or to equal to "" denotes default (native) encoding, which is used by R session.
utf8ToInt(
   iconv(rawToChar(as.raw(c(177, 182))), "latin2", "utf-8")
)
## [1] 261 347
stringr
(none)
stringi
stri_encode() provides a very similar functionality as iconv().

Note that currently used default encoding may be obtained by calling stri_enc_get() and changed any time with a call to stri_enc_set(). This is not dangerous as almost every function in stringi returns UTF-8-encoded strings.

stri_encode() and iconv() differ in the treatment of unsupported characters. If an incorrect code point is found on input, stri_encode() replaces it by the default (for that target encoding) substitute character and generates a warning. iconv() in turn, by default silently returns NA.

stri_enc_toutf32(
   stri_encode(rawToChar(as.raw(c(177, 182))), "latin2", "utf-8")
)[[1]]
## [1] 261 347

Performance comparison

test1 <- as.raw(128:255)
microbenchmark(iconv(test1, "latin2", "utf8"), stri_encode(test1, "latin2", "utf8"))
## Unit: microseconds
##                                  expr    min      lq  median      uq     max neval
##        iconv(test1, "latin2", "utf8") 53.614 54.9265 55.8775 57.6275 113.434   100
##  stri_encode(test1, "latin2", "utf8")  7.470  8.1910  9.3175 10.4570  81.452   100

Unicode Normalization

TODO: this is text transform

Basic Functionality

base
(none)
stringr
(none)
stringi
stri_trans_isnfc(), stri_trans_isnfkc(), stri_trans_isnfd(), stri_trans_isnfkd(), stri_trans_isnfkc_casefold() check whether given UTF-8-encoded strings are properly normalized.

Moreover, stri_trans_nfc(), stri_trans_nfkc(), stri_trans_nfd(), stri_trans_nfkd(), stri_trans_nfkc_casefold() perform the desired normalization.

Automatic Encoding Detection

Basic Functionality

base
(none)
stringr
(none)
stringi
stri_enc_detect() and stri_enc_detect2() provide two experimental facilities for automatic encoding detection. The first one uses ICU's native algorithm and the second one provides our own implementation for locale-dependent guessing.

TO DO: stri_enc_detect2() - choose best match from a given list of guesses.

Moreover, the functions stri_enc_isascii(), stri_enc_utf8(), str_enci_isutf16le(), stri_enc_isutf16le(), stri_enc_isutf32le(), stri_enc_isutf32le() check whether given byte sequences form a valid character sequence in a given encoding.