This function replaces specific values of variables with NA. set_na_if() is a scoped variant of set_na(), where values will be replaced only with NA's for those variables that match the logical condition of predicate.

set_na(x, ..., na, drop.levels = TRUE, as.tag = FALSE)

set_na_if(x, predicate, na, drop.levels = TRUE, as.tag = FALSE)

Arguments

x

A vector or data frame.

...

Optional, unquoted names of variables that should be selected for further processing. Required, if x is a data frame (and no vector) and only selected variables from x should be processed. You may also use functions like : or tidyselect's select_helpers. See 'Examples' or package-vignette.

na

Numeric vector with values that should be replaced with NA values, or a character vector if values of factors or character vectors should be replaced. For labelled vectors, may also be the name of a value label. In this case, the associated values for the value labels in each vector will be replaced with NA. na can also be a named vector. If as.tag = FALSE, values will be replaced only in those variables that are indicated by the value names (see 'Examples').

drop.levels

Logical, if TRUE, factor levels of values that have been replaced with NA are dropped. See 'Examples'.

as.tag

Logical, if TRUE, values in x will be replaced by tagged_na, else by usual NA values. Use a named vector to assign the value label to the tagged NA value (see 'Examples').

predicate

A predicate function to be applied to the columns. The variables for which predicate returns TRUE are selected.

Value

x, with all values in na being replaced by NA. If x is a data frame, the complete data frame x will be returned, with NA's set for variables specified in ...; if ... is not specified, applies to all variables in the data frame.

Details

set_na() converts all values defined in na with a related NA or tagged NA value (see tagged_na). Tagged NAs work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter ("a" to "z") or character number ("0" to "9").

Different NA values for different variables

If na is a named vector and as.tag = FALSE, the names indicate variable names, and the associated values indicate those values that should be replaced by NA in the related variable. For instance, set_na(x, na = c(v1 = 4, v2 = 3)) would replace all 4 in v1 with NA and all 3 in v2 with NA.

If na is a named list and as.tag = FALSE, it is possible to replace different multiple values by NA for different variables separately. For example, set_na(x, na = list(v1 = c(1, 4), v2 = 5:7)) would replace all 1 and 4 in v1 with NA and all 5 to 7 in v2 with NA.

Furthermore, see also 'Details' in get_na.

Note

Labels from values that are replaced with NA and no longer used will be removed from x, however, other value and variable label attributes are preserved. For more details on labelled data, see vignette Labelled Data and the sjlabelled-Package.

See also

replace_na to replace NA's with specific values, rec for general recoding of variables and recode_to for re-shifting value ranges. See get_na to get values of missing values in labelled vectors.

Examples

# create random variable dummy <- sample(1:8, 100, replace = TRUE) # show value distribution table(dummy)
#> dummy #> 1 2 3 4 5 6 7 8 #> 15 12 12 13 12 10 15 11
# set value 1 and 8 as missings dummy <- set_na(dummy, na = c(1, 8)) # show value distribution, including missings table(dummy, useNA = "always")
#> dummy #> 2 3 4 5 6 7 <NA> #> 12 12 13 12 10 15 26
# add named vector as further missing value set_na(dummy, na = c("Refused" = 5), as.tag = TRUE)
#> [1] NA 3 7 NA 7 4 NA NA NA 4 NA 6 4 NA 3 NA 3 7 6 NA 6 NA 7 NA 6 #> [26] 7 6 NA 7 4 2 3 NA NA NA 6 7 NA 2 NA 4 NA 4 NA 2 NA 3 7 7 7 #> [51] NA 4 6 NA 2 2 2 4 6 NA NA 7 3 6 2 NA 3 2 3 2 3 2 NA 4 NA #> [76] NA NA 7 7 3 4 NA 7 4 NA NA 6 NA 7 NA NA NA 4 2 2 4 3 3 NA NA #> attr(,"labels") #> Refused #> NA
# see different missing types library(haven) library(sjlabelled) print_tagged_na(set_na(dummy, na = c("Refused" = 5), as.tag = TRUE))
#> [1] NA 3 7 NA(5) 7 4 NA NA(5) NA 4 NA 6 #> [13] 4 NA 3 NA(5) 3 7 6 NA 6 NA 7 NA #> [25] 6 7 6 NA 7 4 2 3 NA(5) NA NA 6 #> [37] 7 NA 2 NA(5) 4 NA(5) 4 NA 2 NA 3 7 #> [49] 7 7 NA 4 6 NA 2 2 2 4 6 NA(5) #> [61] NA(5) 7 3 6 2 NA 3 2 3 2 3 2 #> [73] NA 4 NA NA NA 7 7 3 4 NA(5) 7 4 #> [85] NA NA 6 NA 7 NA(5) NA(5) NA 4 2 2 4 #> [97] 3 3 NA NA(5)
# create sample data frame dummy <- data.frame(var1 = sample(1:8, 100, replace = TRUE), var2 = sample(1:10, 100, replace = TRUE), var3 = sample(1:6, 100, replace = TRUE)) # set value 2 and 4 as missings dummy %>% set_na(na = c(2, 4)) %>% head()
#> var1 var2 var3 #> 1 8 7 1 #> 2 3 5 1 #> 3 3 NA NA #> 4 NA NA NA #> 5 5 9 5 #> 6 NA 8 1
dummy %>% set_na(na = c(2, 4), as.tag = TRUE) %>% get_na()
#> $var1 #> 2 4 #> NA NA #> #> $var2 #> 2 4 #> NA NA #> #> $var3 #> 2 4 #> NA NA #>
dummy %>% set_na(na = c(2, 4), as.tag = TRUE) %>% get_values()
#> $var1 #> [1] "NA(2)" "NA(4)" #> #> $var2 #> [1] "NA(2)" "NA(4)" #> #> $var3 #> [1] "NA(2)" "NA(4)" #>
data(efc) dummy <- data.frame( var1 = efc$c82cop1, var2 = efc$c83cop2, var3 = efc$c84cop3 ) # check original distribution of categories lapply(dummy, table, useNA = "always")
#> $var1 #> #> 1 2 3 4 <NA> #> 3 97 591 210 7 #> #> $var2 #> #> 1 2 3 4 <NA> #> 186 547 130 39 6 #> #> $var3 #> #> 1 2 3 4 <NA> #> 516 252 82 52 6 #>
# set 3 to NA for two variables lapply(set_na(dummy, var1, var3, na = 3), table, useNA = "always")
#> $var1 #> #> 1 2 4 <NA> #> 3 97 210 598 #> #> $var2 #> #> 1 2 3 4 <NA> #> 186 547 130 39 6 #> #> $var3 #> #> 1 2 4 <NA> #> 516 252 52 88 #>
# if 'na' is a named vector *and* 'as.tag = FALSE', different NA-values # can be specified for each variable set.seed(1) dummy <- data.frame( var1 = sample(1:8, 10, replace = TRUE), var2 = sample(1:10, 10, replace = TRUE), var3 = sample(1:6, 10, replace = TRUE) ) dummy
#> var1 var2 var3 #> 1 3 3 6 #> 2 3 2 2 #> 3 5 7 4 #> 4 8 4 1 #> 5 2 8 2 #> 6 8 5 3 #> 7 8 8 1 #> 8 6 10 3 #> 9 6 4 6 #> 10 1 8 3
# Replace "3" in var1 with NA, "5" in var2 and "6" in var3 set_na(dummy, na = c(var1 = 3, var2 = 5, var3 = 6))
#> var1 var2 var3 #> 1 NA 3 NA #> 2 NA 2 2 #> 3 5 7 4 #> 4 8 4 1 #> 5 2 8 2 #> 6 8 NA 3 #> 7 8 8 1 #> 8 6 10 3 #> 9 6 4 NA #> 10 1 8 3
# if 'na' is a named list *and* 'as.tag = FALSE', for each # variable different multiple NA-values can be specified set_na(dummy, na = list(var1 = 1:3, var2 = c(7, 8), var3 = 6))
#> var1 var2 var3 #> 1 NA 3 NA #> 2 NA 2 2 #> 3 5 NA 4 #> 4 8 4 1 #> 5 NA NA 2 #> 6 8 5 3 #> 7 8 NA 1 #> 8 6 10 3 #> 9 6 4 NA #> 10 NA NA 3
# drop unused factor levels when being set to NA x <- factor(c("a", "b", "c")) x
#> [1] a b c #> Levels: a b c
set_na(x, na = "b", as.tag = TRUE)
#> [1] a <NA> c #> attr(,"labels") #> b #> NA #> Levels: a c
set_na(x, na = "b", drop.levels = FALSE, as.tag = TRUE)
#> [1] a <NA> c #> attr(,"labels") #> b #> NA #> Levels: a b c
# set_na() can also remove a missing by defining the value label # of the value that should be replaced with NA. This is in particular # helpful if a certain category should be set as NA, however, this category # is assigned with different values accross variables x1 <- sample(1:4, 20, replace = TRUE) x2 <- sample(1:7, 20, replace = TRUE) x1 <- set_labels(x1, labels = c("Refused" = 3, "No answer" = 4)) x2 <- set_labels(x2, labels = c("Refused" = 6, "No answer" = 7)) tmp <- data.frame(x1, x2) get_labels(tmp)
#> $x1 #> [1] "Refused" "No answer" #> #> $x2 #> [1] "Refused" "No answer" #>
table(tmp, useNA = "always")
#> x2 #> x1 1 2 3 4 5 6 7 <NA> #> 1 0 1 0 2 0 0 0 0 #> 2 0 0 1 2 0 1 0 0 #> 3 2 0 2 0 2 0 2 0 #> 4 1 1 1 1 0 0 1 0 #> <NA> 0 0 0 0 0 0 0 0
get_labels(set_na(tmp, na = "No answer"))
#> $x1 #> [1] "Refused" #> #> $x2 #> [1] "Refused" #>
table(set_na(tmp, na = "No answer"), useNA = "always")
#> x2 #> x1 1 2 3 4 5 6 <NA> #> 1 0 1 0 2 0 0 0 #> 2 0 0 1 2 0 1 0 #> 3 2 0 2 0 2 0 2 #> <NA> 1 1 1 1 0 0 1
# show values tmp
#> x1 x2 #> 1 2 4 #> 2 3 7 #> 3 2 4 #> 4 1 2 #> 5 4 1 #> 6 3 1 #> 7 4 3 #> 8 1 4 #> 9 3 5 #> 10 2 3 #> 11 4 7 #> 12 3 3 #> 13 4 4 #> 14 3 3 #> 15 3 5 #> 16 4 2 #> 17 1 4 #> 18 2 6 #> 19 3 1 #> 20 3 7
set_na(tmp, na = c("Refused", "No answer"))
#> x1 x2 #> 1 2 4 #> 2 NA NA #> 3 2 4 #> 4 1 2 #> 5 NA 1 #> 6 NA 1 #> 7 NA 3 #> 8 1 4 #> 9 NA 5 #> 10 2 3 #> 11 NA NA #> 12 NA 3 #> 13 NA 4 #> 14 NA 3 #> 15 NA 5 #> 16 NA 2 #> 17 1 4 #> 18 2 NA #> 19 NA 1 #> 20 NA NA