get_count_vector.Rd
The return value is an integer vector. The length of the vector is the number of unique tokens in the corpus / the number of unique ids. The order of the counts corresponds to the number of ids.
get_count_vector(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus | a CWB corpus |
---|---|
p_attribute | a positional attribute |
registry | registry directory |
an integer vector
registry <- if (!check_pkg_registry_files()) use_tmp_registry() else get_pkg_registry() y <- get_count_vector( corpus = "REUTERS", p_attribute = "word", registry = registry ) df <- data.frame(token_id = 0:(length(y) - 1), count = y) df[["token"]] <- cl_id2str( "REUTERS", p_attribute = "word", id = df[["token_id"]], registry = registry ) df <- df[,c("token", "token_id", "count")] # reorder columns df <- df[order(df[["count"]], decreasing = TRUE),] head(df)#> token token_id count #> 32 the 31 206 #> 30 to 29 134 #> 38 of 37 97 #> 36 in 35 84 #> 16 oil 15 78 #> 41 and 40 77