As we know, character vectors in R in computer's memory resemble lists of raw vectors (each ended up with a 00 byte). Each string has to be properly "decoded" so that textual information may be read from such a byte stream. This "decoding scheme" is simply called a character encoding.
In other words, data in computer's memory are just bytes (small integer values) – an encoding is a way to represent characters with such numbers, it is a semantic "key" to understand a given byte sequence. For example, in ISO-8859-2 (Central European), the value 177 represents Polish “a with ogonek”, and in ISO-8859-1 (Western European), the same value meas the “plus-minus” sign. Thus, a character encoding is a translation scheme.
Below you will find a list of functions that deal with character encodings. Mostly, you will use them when reading or writing a text file. Functions in stringi process each string internally in Unicode, which is a superset of all character representation schemes. This is why while working with stringi you will often use (sometimes without even knowing that explicitly) the following workflow scheme: READ FILE -> CONVERT TO UTF-8 -> PROCESS -> CONVERT BACK TO DESIRED ENCODING -> WRITE FILE.
TODO: add stri_enc_toutf8charToRaw()
– single string to a raw vector only
charToRaw("aA1")
## [1] 61 41 31
stri_encode()
with argument to_raw=TRUE
is vectorized over the first argument;
it returns a list of raw vectors.
stri_encode("aA1", "", "", to_raw=TRUE)[[1]]
## [1] 61 41 31
stri_encode(c("aA1", " "), "", "", to_raw=TRUE)
## [[1]] ## [1] 61 41 31 ## ## [[2]] ## [1] 20
test1 <- "abcdefghijklmnopqrstuvwxyz" microbenchmark(charToRaw(test1), stri_encode(test1, "", "", to_raw=TRUE)[[1]])
## Unit: microseconds ## expr min lq median uq max neval ## charToRaw(test1) 1.018 1.575 2.2325 2.5135 12.558 100 ## stri_encode(test1, "", "", to_raw = TRUE)[[1]] 11.782 12.847 13.4095 13.9705 65.901 100
test2 <- rep(c("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "0123456789"), 10) microbenchmark(lapply(test2, charToRaw), stri_encode(test2, "", "", to_raw=TRUE))
## Unit: microseconds ## expr min lq median uq max neval ## lapply(test2, charToRaw) 27.857 32.1365 35.0335 36.5075 89.783 100 ## stri_encode(test2, "", "", to_raw = TRUE) 20.489 21.7905 22.7170 23.9065 100.140 100
rawToChar()
– single raw vector to a single string only
rawToChar(as.raw(c(97, 65, 49)))
## [1] "aA1"
stri_encode()
also accepts a raw vector
or a list of raw vectors as its first argument;
by default, i.e. when to_raw=FALSE
,
the result is a character vector.
stri_encode(as.raw(c(97, 65, 49)), "")
## [1] "61" "41" "31"
stri_encode(list(as.raw(c(97, 65, 49)), as.raw(32)), "")
## [1] "aA1" " "
test1 <- as.raw(97:122) microbenchmark(rawToChar(test1), stri_encode(test1, ""))
## Unit: nanoseconds ## expr min lq median uq max neval ## rawToChar(test1) 892 1332.5 1534.0 1815.5 9120 100 ## stri_encode(test1, "") 18978 19507.5 19920.5 20382.0 138394 100
test2 <- rep(list(as.raw(97:122), as.raw(65:90), as.raw(48:57)), 10) microbenchmark(lapply(test2, rawToChar), stri_encode(test2, ""))
## Unit: microseconds ## expr min lq median uq max neval ## lapply(test2, rawToChar) 39.063 43.325 45.785 47.876 93.074 100 ## stri_encode(test2, "") 19.470 20.394 20.841 21.780 56.470 100
utf8ToInt()
– single string in UTF-8 to an integer vector only
utf8ToInt(enc2utf8("aA1"))
## [1] 97 65 49
stri_enc_toutf32()
accepts a character vector on input
and returns a list of integer vectors;
like in all other functions from our package, native and UTF-8
encodings are handled properly
stri_enc_toutf32("aA1")[[1]]
## [1] 97 65 49
stri_enc_toutf32(c("aA1", " "))
## [[1]] ## [1] 97 65 49 ## ## [[2]] ## [1] 32
test1 <- enc2utf8("abcdefghijklmnopqrstuvwxyz") microbenchmark(utf8ToInt(test1), stri_enc_toutf32(test1)[[1]])
## Unit: nanoseconds ## expr min lq median uq max neval ## utf8ToInt(test1) 667 727.0 906.5 1043.0 3818 100 ## stri_enc_toutf32(test1)[[1]] 2839 2995.5 3123.5 3247.5 42244 100
test2 <- enc2utf8(rep(c("abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "0123456789"), 10)) microbenchmark(lapply(test2, utf8ToInt), stri_enc_toutf32(test2))
## Unit: microseconds ## expr min lq median uq max neval ## lapply(test2, utf8ToInt) 32.316 35.0260 36.7355 37.9725 62.752 100 ## stri_enc_toutf32(test2) 7.575 8.5445 9.0055 9.6810 63.978 100
intToUtf8()
– single integer vector to a single string only
intToUtf8(c(97L, 65L, 49L))
## [1] "aA1"
stri_enc_fromutf32()
a single integer vector
or a list of integer vectors as its argument;
the result is a UTF-8-encoded character vector.
stri_enc_fromutf32(c(97L, 65L, 49L))
## [1] "aA1"
stri_enc_fromutf32(list(c(97L, 65L, 49L), 32L))
## [1] "aA1" " "
test1 <- 97:122 microbenchmark(intToUtf8(test1), stri_enc_fromutf32(test1))
## Unit: microseconds ## expr min lq median uq max neval ## intToUtf8(test1) 1.320 1.4495 1.5295 1.6165 6.733 100 ## stri_enc_fromutf32(test1) 2.331 2.3695 2.4615 2.5525 31.403 100
test2 <- rep(list(97:122, 65:90, 48:57), 10) microbenchmark(lapply(test2, intToUtf8), stri_enc_fromutf32(test2))
## Unit: microseconds ## expr min lq median uq max neval ## lapply(test2, intToUtf8) 50.286 52.9190 56.1345 57.717 84.865 100 ## stri_enc_fromutf32(test2) 8.274 8.8315 9.2000 9.448 30.225 100
iconvlist()
– returns a character vector with supported encoding
names (as well as its aliases).
Note that, as R manual states, the names are rarely valid across all platforms.
sample(iconvlist(), 4) # a sample of supported encodings
## [1] "CSIBM9448" "IBM1143" "ISO-IR-8-1" "ISO_8859-14:1998"
length(iconvlist()) # count; Fedora Linux 19 x64_86
## [1] 1168
stri_enc_list()
with argument
provides a character vector with all supported encodings and
their aliases in many different forms.
By default, howewer, a list of character vectors is returned. Each list element contains the list of aliases for the given encoding.
Please, note that apart from given encodings, ICU tries to normalize encoding specifiers, e.g. "utf8" is a valid specifier for "UTF-8".
Depending on the version of the ICU library used, each encoding should be supported across all platforms.
By the way,
returns detailed information of a given encoding specifier.stri_enc_info()
sample(stri_enc_list(TRUE), 4)
## [1] "windows-1256" "windows-1250" "ibm-1235" "ibm-949_VASCII_VSUB_VPUA"
length(stri_enc_list(TRUE)) # includes aliases
## [1] 1198
length(stri_enc_list()) # true number of supported encodings
## [1] 229
str(stri_enc_info("cp1250"))
## List of 13 ## $ Name.friendly: chr "windows-1250" ## $ Name.ICU : chr "ibm-5346_P100-1998" ## $ Name.UTR22 : chr "ibm-5346_P100-1998" ## $ Name.IBM : chr "ibm-5346" ## $ Name.WINDOWS : chr "windows-1250" ## $ Name.JAVA : chr "windows-1250" ## $ Name.IANA : chr "windows-1250" ## $ Name.MIME : chr NA ## $ ASCII.subset : logi TRUE ## $ Unicode.1to1 : logi TRUE ## $ CharSize.8bit: logi TRUE ## $ CharSize.min : int 1 ## $ CharSize.max : int 1
iconv()
– converts a character vector between two given encodings.
Argument from
or to
equal to ""
denotes default (native) encoding,
which is used by R session.
utf8ToInt( iconv(rawToChar(as.raw(c(177, 182))), "latin2", "utf-8") )
## [1] 261 347
stri_encode()
provides a very similar functionality
as iconv()
.
Note that currently used default encoding may be obtained by calling
and changed any time with a call to stri_enc_get()
.
This is not dangerous as almost every function in stringi
returns UTF-8-encoded strings.stri_enc_set()
and
stri_encode()
differ in the treatment of
unsupported characters. If an incorrect code point is found on input,
iconv()
replaces it by the default
(for that target encoding) substitute character and generates a warning.
stri_encode()
in turn, by default silently returns
iconv()
NA
.
stri_enc_toutf32( stri_encode(rawToChar(as.raw(c(177, 182))), "latin2", "utf-8") )[[1]]
## [1] 261 347
test1 <- as.raw(128:255) microbenchmark(iconv(test1, "latin2", "utf8"), stri_encode(test1, "latin2", "utf8"))
## Unit: microseconds ## expr min lq median uq max neval ## iconv(test1, "latin2", "utf8") 53.614 54.9265 55.8775 57.6275 113.434 100 ## stri_encode(test1, "latin2", "utf8") 7.470 8.1910 9.3175 10.4570 81.452 100
stri_trans_isnfc(), stri_trans_isnfkc(), stri_trans_isnfd(), stri_trans_isnfkd(), stri_trans_isnfkc_casefold()
check whether given UTF-8-encoded strings are properly normalized.
Moreover,
perform the desired normalization.stri_trans_nfc(), stri_trans_nfkc(), stri_trans_nfd(), stri_trans_nfkd(), stri_trans_nfkc_casefold()
stri_enc_detect()
and stri_enc_detect2()
provide two experimental facilities for automatic encoding detection.
The first one uses ICU's native algorithm and the second one
provides our own implementation for locale-dependent guessing.
TO DO:
- choose best
match from a given list of guesses.stri_enc_detect2()
Moreover, the functions
check whether given byte sequences
form a valid character sequence in a given encoding.stri_enc_isascii(), stri_enc_utf8(), str_enci_isutf16le(), stri_enc_isutf16le(), stri_enc_isutf32le(), stri_enc_isutf32le()