This vignette provides a basic overview of quanteda’s features and capabilities. For additional vignettes, see the articles at quanteda.io.
An R package for managing and analyzing text.
quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.
Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.
quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the data.table package for indexing large documents efficiently, and the Matrix package for sparse matrix objects. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)
quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
The tools for getting texts into a corpus object include:
The tools for working with a corpus include:
For extracting features from a corpus, quanteda
provides the following tools:
For analyzing the resulting document-feature matrix created when features are abstracted from a corpus, quanteda
provides:
Additional features of quanteda include:
the ability to explore texts using key-words-in-context;
fast computation of a variety of readability indexes;
fast computation of a variety of lexical diversity measures;
quick computation of word or document association measures, for clustering or to compute similarity scores for other purposes; and
a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document.
Planned features coming soon to quanteda are:
quanteda
is hardly unique in providing facilities for working with text – the excellent tm package already provides many of the features we have described. quanteda
is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow. quanteda
corpus structures are simpler objects than in tms, as are the document-feature matrix objects from quanteda
, compared to the sparse matrix implementation found in tm. However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in quanteda
.
Once constructed, a quanteda “dfm”" can be easily passed to other text-analysis packages for additional analysis of topic models or scaling, such as:
topic models (including converters for direct use with the topicmodels, LDA, and stm packages)
document scaling using quanteda’s own functions for the “wordfish” and “Wordscores” models, and a sparse method for correspondence analysis
document classification methods, using (for example) Naive Bayes, k-nearest neighbour, or Support Vector Machines
more sophisticated machine learning through a variety of other packages that take matrix or matrix-like inputs.
graphical analysis, including word clouds and strip plots for selected themes or words.
Through a normal installation of the package from CRAN, or for the GitHub version, see the installation instructions at https://github.com/kbenoit/quanteda.
require(quanteda)
## Loading required package: quanteda
## quanteda version 0.9.9000
## Using 4 of 8 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda has a simple and powerful companion package for loading texts: readtext. The main function in this package, readtext()
, takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the corpus()
constructor function, to create a quanteda corpus object.
readtext()
works on:
.txt
) files;.csv
) files;The corpus constructor command corpus()
works directly on:
VCorpus
corpus object from the tm package.The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.
If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called data_char_ukimmig2010
).
myCorpus <- corpus(data_char_ukimmig2010) # build a new corpus from the texts
summary(myCorpus)
## Corpus consisting of 9 documents.
##
## Text Types Tokens Sentences
## BNP 1125 3280 88
## Coalition 142 260 4
## Conservative 251 499 15
## Greens 322 679 21
## Labour 298 683 29
## LibDem 251 483 14
## PC 77 114 5
## SNP 88 134 4
## UKIP 346 723 27
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Thu Aug 10 12:43:43 2017
## Notes:
If we wanted, we could add some document-level variables – what quanteda calls docvars
– to this corpus.
We can do this using the R’s names()
function to get the names of the character vector data_char_ukimmig2010
, and assign this to a document variable (docvar
).
docvars(myCorpus, "Party") <- names(data_char_ukimmig2010)
docvars(myCorpus, "Year") <- 2010
summary(myCorpus)
## Corpus consisting of 9 documents.
##
## Text Types Tokens Sentences Party Year
## BNP 1125 3280 88 BNP 2010
## Coalition 142 260 4 Coalition 2010
## Conservative 251 499 15 Conservative 2010
## Greens 322 679 21 Greens 2010
## Labour 298 683 29 Labour 2010
## LibDem 251 483 14 LibDem 2010
## PC 77 114 5 PC 2010
## SNP 88 134 4 SNP 2010
## UKIP 346 723 27 UKIP 2010
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Thu Aug 10 12:43:43 2017
## Notes:
If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.
metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource") <- paste("data_char_ukimmig2010", 1:ndoc(myCorpus), sep = "_")
summary(myCorpus, showmeta = TRUE)
## Corpus consisting of 9 documents.
##
## Text Types Tokens Sentences Party Year _language
## BNP 1125 3280 88 BNP 2010 english
## Coalition 142 260 4 Coalition 2010 english
## Conservative 251 499 15 Conservative 2010 english
## Greens 322 679 21 Greens 2010 english
## Labour 298 683 29 Labour 2010 english
## LibDem 251 483 14 LibDem 2010 english
## PC 77 114 5 PC 2010 english
## SNP 88 134 4 SNP 2010 english
## UKIP 346 723 27 UKIP 2010 english
## _docsource
## data_char_ukimmig2010_1
## data_char_ukimmig2010_2
## data_char_ukimmig2010_3
## data_char_ukimmig2010_4
## data_char_ukimmig2010_5
## data_char_ukimmig2010_6
## data_char_ukimmig2010_7
## data_char_ukimmig2010_8
## data_char_ukimmig2010_9
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Thu Aug 10 12:43:43 2017
## Notes:
The last command, metadoc
, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english"
, R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource
, we used the quanteda function ndoc()
to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow()
and ncol()
.
require(readtext)
# Twitter json
mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json")
myCorpusTwitter <- corpus(mytf1)
summary(myCorpusTwitter, 5)
# generic json - needs a textfield specifier
mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json",
textfield = "text")
summary(corpus(mytf2), 5)
# text file
mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE)
summary(corpus(mytf3), 5)
# multiple text files
mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE)
summary(corpus(mytf4), 5)
# multiple text files with docvars from filenames
mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt",
docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President"))
summary(corpus(mytf5), 5)
# XML data
mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml",
textfield = "COMMON")
summary(corpus(mytf6), 5)
# csv file
write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural),
docvars(data_corpus_inaugural)),
file = "/tmp/inaug_texts.csv", row.names = FALSE)
mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech")
summary(corpus(mytf7), 5)
A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.
A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.
To extract texts from a a corpus, we use an extractor, called texts()
.
texts(data_corpus_inaugural)[2]
## 1793-Washington
## "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "
To summarize the texts from a corpus, we can call a summary()
method defined for a corpus.
summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents.
##
## Text Types Tokens Sentences year debate
## 2010_BUDGET_01_Brian_Lenihan_FF 1953 8641 374 2010 BUDGET
## 2010_BUDGET_02_Richard_Bruton_FG 1040 4446 217 2010 BUDGET
## 2010_BUDGET_03_Joan_Burton_LAB 1624 6393 307 2010 BUDGET
## 2010_BUDGET_04_Arthur_Morgan_SF 1595 7107 343 2010 BUDGET
## 2010_BUDGET_05_Brian_Cowen_FF 1629 6599 250 2010 BUDGET
## 2010_BUDGET_06_Enda_Kenny_FG 1148 4232 153 2010 BUDGET
## 2010_BUDGET_07_Kieran_ODonnell_FG 678 2297 133 2010 BUDGET
## 2010_BUDGET_08_Eamon_Gilmore_LAB 1181 4177 201 2010 BUDGET
## 2010_BUDGET_09_Michael_Higgins_LAB 488 1286 44 2010 BUDGET
## 2010_BUDGET_10_Ruairi_Quinn_LAB 439 1284 59 2010 BUDGET
## 2010_BUDGET_11_John_Gormley_Green 401 1030 49 2010 BUDGET
## 2010_BUDGET_12_Eamon_Ryan_Green 510 1643 90 2010 BUDGET
## 2010_BUDGET_13_Ciaran_Cuffe_Green 442 1240 45 2010 BUDGET
## 2010_BUDGET_14_Caoimhghin_OCaolain_SF 1188 4044 176 2010 BUDGET
## number foren name party
## 01 Brian Lenihan FF
## 02 Richard Bruton FG
## 03 Joan Burton LAB
## 04 Arthur Morgan SF
## 05 Brian Cowen FF
## 06 Enda Kenny FG
## 07 Kieran ODonnell FG
## 08 Eamon Gilmore LAB
## 09 Michael Higgins LAB
## 10 Ruairi Quinn LAB
## 11 John Gormley Green
## 12 Eamon Ryan Green
## 13 Ciaran Cuffe Green
## 14 Caoimhghin OCaolain SF
##
## Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:
We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:
tokenInfo <- summary(data_corpus_inaugural)
## Corpus consisting of 58 documents.
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1538 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2578 37 1797 Adams John
## 1801-Jefferson 717 1927 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2381 45 1805 Jefferson Thomas
## 1809-Madison 535 1263 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3680 121 1817 Monroe James
## 1821-Monroe 1259 4886 129 1821 Monroe James
## 1825-Adams 1003 3152 74 1825 Adams John Quincy
## 1829-Jackson 517 1210 25 1829 Jackson Andrew
## 1833-Jackson 499 1269 29 1833 Jackson Andrew
## 1837-VanBuren 1315 4165 95 1837 Van Buren Martin
## 1841-Harrison 1896 9144 210 1841 Harrison William Henry
## 1845-Polk 1334 5193 153 1845 Polk James Knox
## 1849-Taylor 496 1179 22 1849 Taylor Zachary
## 1853-Pierce 1165 3641 104 1853 Pierce Franklin
## 1857-Buchanan 945 3086 89 1857 Buchanan James
## 1861-Lincoln 1075 4006 135 1861 Lincoln Abraham
## 1865-Lincoln 360 776 26 1865 Lincoln Abraham
## 1869-Grant 485 1235 40 1869 Grant Ulysses S.
## 1873-Grant 552 1475 43 1873 Grant Ulysses S.
## 1877-Hayes 831 2716 59 1877 Hayes Rutherford B.
## 1881-Garfield 1021 3212 111 1881 Garfield James A.
## 1885-Cleveland 676 1820 44 1885 Cleveland Grover
## 1889-Harrison 1352 4722 157 1889 Harrison Benjamin
## 1893-Cleveland 821 2125 58 1893 Cleveland Grover
## 1897-McKinley 1232 4361 130 1897 McKinley William
## 1901-McKinley 854 2437 100 1901 McKinley William
## 1905-Roosevelt 404 1079 33 1905 Roosevelt Theodore
## 1909-Taft 1437 5822 159 1909 Taft William Howard
## 1913-Wilson 658 1882 68 1913 Wilson Woodrow
## 1917-Wilson 549 1656 59 1917 Wilson Woodrow
## 1921-Harding 1169 3721 148 1921 Harding Warren G.
## 1925-Coolidge 1220 4440 196 1925 Coolidge Calvin
## 1929-Hoover 1090 3865 158 1929 Hoover Herbert
## 1933-Roosevelt 743 2062 85 1933 Roosevelt Franklin D.
## 1937-Roosevelt 725 1997 96 1937 Roosevelt Franklin D.
## 1941-Roosevelt 526 1544 68 1941 Roosevelt Franklin D.
## 1945-Roosevelt 275 647 26 1945 Roosevelt Franklin D.
## 1949-Truman 781 2513 116 1949 Truman Harry S.
## 1953-Eisenhower 900 2757 119 1953 Eisenhower Dwight D.
## 1957-Eisenhower 621 1931 92 1957 Eisenhower Dwight D.
## 1961-Kennedy 566 1566 52 1961 Kennedy John F.
## 1965-Johnson 568 1723 93 1965 Johnson Lyndon Baines
## 1969-Nixon 743 2437 103 1969 Nixon Richard Milhous
## 1973-Nixon 544 2012 68 1973 Nixon Richard Milhous
## 1977-Carter 527 1376 52 1977 Carter Jimmy
## 1981-Reagan 902 2790 128 1981 Reagan Ronald
## 1985-Reagan 925 2921 123 1985 Reagan Ronald
## 1989-Bush 795 2681 141 1989 Bush George
## 1993-Clinton 642 1833 81 1993 Clinton Bill
## 1997-Clinton 773 2449 111 1997 Clinton Bill
## 2001-Bush 621 1808 97 2001 Bush George W.
## 2005-Bush 773 2319 100 2005 Bush George W.
## 2009-Obama 938 2711 110 2009 Obama Barack
## 2013-Obama 814 2317 88 2013 Obama Barack
## 2017-Trump 582 1660 88 2017 Trump Donald J.
##
## Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes: http://www.presidency.ucsb.edu/inaugurals.php
if (require(ggplot2))
ggplot(data=tokenInfo, aes(x = Year, y = Tokens, group = 1)) + geom_line() + geom_point() +
scale_x_discrete(labels = c(seq(1789,2012,12)), breaks = seq(1789,2012,12) )
## Loading required package: ggplot2
# Longest inaugural address: William Henry Harrison
tokenInfo[which.max(tokenInfo$Tokens), ]
## Text Types Tokens Sentences Year President
## 1841-Harrison 1841-Harrison 1896 9144 210 1841 Harrison
## FirstName
## 1841-Harrison William Henry
The +
operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.
library(quanteda)
mycorpus1 <- corpus(data_corpus_inaugural[1:5], note = "First five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[1:5], note = "First five
## inaug speeches."): Argument note not used.
mycorpus2 <- corpus(data_corpus_inaugural[53:58], note = "Last five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[53:58], note = "Last five
## inaug speeches."): Argument note not used.
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
## Corpus consisting of 11 documents.
##
## Text Types Tokens Sentences
## 1789-Washington 625 1538 23
## 1793-Washington 96 147 4
## 1797-Adams 826 2578 37
## 1801-Jefferson 717 1927 41
## 1805-Jefferson 804 2381 45
## 1997-Clinton 773 2449 111
## 2001-Bush 621 1808 97
## 2005-Bush 773 2319 100
## 2009-Obama 938 2711 110
## 2013-Obama 814 2317 88
## 2017-Trump 582 1660 88
##
## Source: Combination of corpuses mycorpus1 and mycorpus2
## Created: Thu Aug 10 12:43:44 2017
## Notes:
There is a method of the corpus_subset()
function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to docvars:
summary(corpus_subset(data_corpus_inaugural, Year > 1990))
## Corpus consisting of 7 documents.
##
## Text Types Tokens Sentences Year President FirstName
## 1993-Clinton 642 1833 81 1993 Clinton Bill
## 1997-Clinton 773 2449 111 1997 Clinton Bill
## 2001-Bush 621 1808 97 2001 Bush George W.
## 2005-Bush 773 2319 100 2005 Bush George W.
## 2009-Obama 938 2711 110 2009 Obama Barack
## 2013-Obama 814 2317 88 2013 Obama Barack
## 2017-Trump 582 1660 88 2017 Trump Donald J.
##
## Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes: http://www.presidency.ucsb.edu/inaugurals.php
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
## Corpus consisting of 2 documents.
##
## Text Types Tokens Sentences Year President FirstName
## 1797-Adams 826 2578 37 1797 Adams John
## 1825-Adams 1003 3152 74 1825 Adams John Quincy
##
## Source: Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes: http://www.presidency.ucsb.edu/inaugurals.php
The kwic
function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:
options(width = 200)
kwic(data_corpus_inaugural, "terror")
##
## [1797-Adams, 1325] fraud or violence, by | terror | , intrigue, or venality
## [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror | which paralyzes needed efforts to
## [1941-Roosevelt, 287] seemed frozen by a fatalistic | terror | , we proved that this
## [1961-Kennedy, 866] alter that uncertain balance of | terror | that stays the hand of
## [1981-Reagan, 813] freeing all Americans from the | terror | of runaway living costs.
## [1997-Clinton, 1055] They fuel the fanaticism of | terror | . And they torment the
## [1997-Clinton, 1655] maintain a strong defense against | terror | and destruction. Our children
## [2009-Obama, 1632] advance their aims by inducing | terror | and slaughtering innocents, we
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
##
## [1797-Adams, 1325] fraud or violence, by | terror | , intrigue, or venality
## [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror | which paralyzes needed efforts to
## [1941-Roosevelt, 287] seemed frozen by a fatalistic | terror | , we proved that this
## [1961-Kennedy, 866] alter that uncertain balance of | terror | that stays the hand of
## [1961-Kennedy, 990] of science instead of its | terrors | . Together let us explore
## [1981-Reagan, 813] freeing all Americans from the | terror | of runaway living costs.
## [1981-Reagan, 2196] understood by those who practice | terrorism | and prey upon their neighbors
## [1997-Clinton, 1055] They fuel the fanaticism of | terror | . And they torment the
## [1997-Clinton, 1655] maintain a strong defense against | terror | and destruction. Our children
## [2009-Obama, 1632] advance their aims by inducing | terror | and slaughtering innocents, we
## [2017-Trump, 1117] civilized world against radical Islamic | terrorism | , which we will eradicate
kwic(data_corpus_inaugural, "communist*")
##
## [1949-Truman, 834] the actions resulting from the | Communist | philosophy are a threat to
## [1961-Kennedy, 519] -- not because the | Communists | may be doing it,
In the above summary, Year
and President
are variables associated with each document. We can access such variables with the docvars()
function.
# inspect the document-level variables
head(docvars(data_corpus_inaugural))
## Year President FirstName
## 1789-Washington 1789 Washington George
## 1793-Washington 1793 Washington George
## 1797-Adams 1797 Adams John
## 1801-Jefferson 1801 Jefferson Thomas
## 1805-Jefferson 1805 Jefferson Thomas
## 1809-Madison 1809 Madison James
# inspect the corpus-level metadata
metacorpus(data_corpus_inaugural)
## $source
## [1] "Gerhard Peters and John T. Woolley. The American Presidency Project."
##
## $notes
## [1] "http://www.presidency.ucsb.edu/inaugurals.php"
##
## $created
## [1] "Tue Jun 13 14:51:47 2017"
More corpora are available from the quantedaData package.
In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm
function to produce such a matrix. “dfm” is short for document-feature matrix, and always refers to documents in rows and “features” as columns. We fix this dimensional orientation because is is standard in data analysis to have a unit of analysis as a row, and features or variables pertaining to each unit as columns. We call them “features” rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.
To simply tokenize a text, quanteda provides a powerful command called tokens()
. This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.
tokens()
is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.
txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!",
text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
## tokens from 2 documents.
## text1 :
## [1] "This" "is" "$" "10" "in" "999" "different" "ways" "," "up" "and" "down" ";" "left" "and" "right"
## [17] "!"
##
## text2 :
## [1] "@kenbenoit" "working" ":" "on" "#quanteda" "2day" "4ever" "," "http" ":" "/"
## [12] "/" "textasdata.com" "?" "page" "=" "123" "."
tokens(txt, remove_numbers = TRUE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
## [1] "This" "is" "in" "different" "ways" "up" "and" "down" "left" "and" "right"
##
## text2 :
## [1] "@kenbenoit" "working" "on" "#quanteda" "2day" "4ever" "http" "textasdata.com" "page"
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
## [1] "This" "is" "10" "in" "999" "different" "ways" "up" "and" "down" "left" "and" "right"
##
## text2 :
## [1] "@kenbenoit" "working" "on" "#quanteda" "2day" "4ever" "http" "textasdata.com" "page" "123"
tokens(txt, remove_numbers = TRUE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "This" "is" "$" "in" "different" "ways" "," "up" "and" "down" ";" "left" "and" "right" "!"
##
## text2 :
## [1] "@kenbenoit" "working" ":" "on" "#quanteda" "2day" "4ever" "," "http" ":" "/"
## [12] "/" "textasdata.com" "?" "page" "=" "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "This" "is" "$" "10" "in" "999" "different" "ways" "," "up" "and" "down" ";" "left" "and" "right"
## [17] "!"
##
## text2 :
## [1] "@kenbenoit" "working" ":" "on" "#quanteda" "2day" "4ever" "," "http" ":" "/"
## [12] "/" "textasdata.com" "?" "page" "=" "123" "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "This" " " "is" " " "$" "10" " " "in" " " "999" " " "different" " " "ways" "," "\n"
## [17] " " "up" " " "and" " " "down" ";" " " "left" " " "and" " " "right" "!"
##
## text2 :
## [1] "@kenbenoit" " " "working" ":" " " "on" " " "#quanteda" " " "2day" "\t"
## [12] "4ever" "," " " "http" ":" "/" "/" "textasdata.com" "?" "page" "="
## [23] "123" "."
We also have the option to tokenize characters:
tokens("Great website: http://textasdata.com?page=123.", what = "character")
## tokens from 1 document.
## text1 :
## [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."
tokens("Great website: http://textasdata.com?page=123.", what = "character",
remove_separators = FALSE)
## tokens from 1 document.
## text1 :
## [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."
and sentences:
# sentence level
tokens(c("Kurt Vongeut said; only assholes use semi-colons.",
"Today is Thursday in Canberra: It is yesterday in London.",
"En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"),
what = "sentence")
## tokens from 3 documents.
## text1 :
## [1] "Kurt Vongeut said; only assholes use semi-colons."
##
## text2 :
## [1] "Today is Thursday in Canberra: It is yesterday in London."
##
## text3 :
## [1] "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"
Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm()
, which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens()
, the dfm()
function applies certain options by default, such as toLower()
– a separate function for lower-casing texts – and removes punctuation. All of the options to tokens()
can be passed to dfm()
, however.
myCorpus <- corpus_subset(data_corpus_inaugural, Year > 1990)
# make a dfm
myDfm <- dfm(myCorpus)
myDfm[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (0% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
## features
## docs my fellow citizens , today
## 1993-Clinton 7 5 2 139 10
## 1997-Clinton 6 7 7 131 5
## 2001-Bush 3 1 9 110 2
## 2005-Bush 2 3 6 120 3
## 2009-Obama 2 1 1 130 6
## 2013-Obama 3 3 6 99 4
## 2017-Trump 1 1 4 96 4
Other options for a dfm()
include removing stopwords, and stemming the tokens.
# make a dfm, removing stopwords and applying stemming
myStemMat <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
myStemMat[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (17.1% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
## features
## docs fellow citizen today celebr mysteri
## 1993-Clinton 5 2 10 4 1
## 1997-Clinton 7 8 6 1 0
## 2001-Bush 1 10 2 0 0
## 2005-Bush 3 7 3 2 0
## 2009-Obama 1 1 6 2 0
## 2013-Obama 3 8 6 1 0
## 2017-Trump 1 4 5 3 1
The option remove
provides a list of tokens to be ignored. Most users will supply a list of pre-defined “stop words”, defined for numerous languages, accessed through the stopwords()
function:
head(stopwords("english"), 20)
## [1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves" "you" "your" "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
head(stopwords("russian"), 10)
## [1] "и" "в" "во" "не" "что" "он" "на" "я" "с" "со"
head(stopwords("arabic"), 10)
## [1] "فى" "في" "كل" "لم" "لن" "له" "من" "هو" "هي" "قوة"
The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View
function. Calling plot
on a dfm will display a wordcloud using the wordcloud package
mydfm <- dfm(data_char_ukimmig2010, remove = stopwords("english"), remove_punct = TRUE)
mydfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).
To access a list of the most frequently occurring features, we can use topfeatures()
:
topfeatures(mydfm, 20) # 20 top words
## immigration british people asylum britain uk system population country new immigrants ensure shall citizenship social national
## 66 37 35 29 28 27 27 21 20 19 17 17 17 16 14 14
## bnp illegal work percent
## 13 13 13 12
Plotting a word cloud is done using textplot_wordcloud()
, for a dfm
class object. This function passes arguments through to wordcloud()
from the wordcloud package, and can prettify the plot using the same arguments:
set.seed(100)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
rot.per = .25,
colors = RColorBrewer::brewer.pal(8,"Dark2"))
Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents which share the same value for a document variable when creating a dfm:
byPartyDfm <- dfm(data_corpus_irishbudget2010, groups = "party", remove = stopwords("english"), remove_punct = TRUE)
We can sort this dfm, and inspect it:
sort(byPartyDfm)[, 1:10]
## Warning: 'sort.dfm' is deprecated.
## Use 'dfm_sort' instead.
## See help("Deprecated")
## Document-feature matrix of: 5 documents, 10 features (0% sparse).
## 5 x 10 sparse Matrix of class "dfmSparse"
## features
## docs people budget government public minister tax economy pay jobs billion
## FF 23 44 47 65 11 60 37 41 41 32
## FG 78 71 61 47 62 11 20 29 17 21
## LAB 69 66 36 32 54 47 37 24 20 34
## SF 81 53 73 31 39 34 50 24 27 29
## Green 15 26 19 4 4 11 16 4 15 3
Note that the most frequently occurring feature is “will”, a word usually on English stop lists, but one that is not included in quanteda’s built-in English stopword list.
For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes.
For example, let’s look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:
recentCorpus <- corpus_subset(data_corpus_inaugural, Year > 1991)
Now we define a demonstration dictionary:
myDict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"),
economy = c("jobs", "business", "grow", "work")))
We can use the dictionary when making the dfm:
byPresMat <- dfm(recentCorpus, dictionary = myDict)
byPresMat
## Document-feature matrix of: 7 documents, 2 features (14.3% sparse).
## 7 x 2 sparse Matrix of class "dfmSparse"
## features
## docs terror economy
## 1993-Clinton 0 8
## 1997-Clinton 1 8
## 2001-Bush 0 4
## 2005-Bush 1 6
## 2009-Obama 1 10
## 2013-Obama 1 6
## 2017-Trump 1 5
The constructor function dictionary()
also works with two common “foreign” dictionary formats: the LIWC and Provalis Research’s Wordstat format. For instance, we can load the LIWC and apply this to the Presidential inaugural speech corpus:
liwcdict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2001_English.dic",
format = "LIWC")
liwcdfm <- dfm(data_corpus_inaugural[52:58], dictionary = liwcdict, verbose = FALSE)
liwcdfm[, 1:10]
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980),
remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
obamaSimil <- textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"),
margin = "documents", method = "cosine")
obamaSimil
## 2009-Obama 2013-Obama
## 2009-Obama 1.0000000 0.6815711
## 2013-Obama 0.6815711 1.0000000
## 1981-Reagan 0.6229949 0.6376412
## 1985-Reagan 0.6434472 0.6629428
## 1989-Bush 0.6253944 0.5784290
## 1993-Clinton 0.6280946 0.6265428
## 1997-Clinton 0.6593018 0.6466030
## 2001-Bush 0.6018113 0.6193608
## 2005-Bush 0.5266249 0.5867178
## 2017-Trump 0.5192075 0.5160104
# dotchart(as.list(obamaSimil)$"2009-Obama", xlab = "Cosine similarity")
We can use these distances to plot a dendrogram, clustering presidents:
data(data_corpus_SOTU, package = "quantedaData")
presDfm <- dfm(corpus_subset(data_corpus_SOTU, Date > as.Date("1980-01-01")),
stem = TRUE, remove_punct = TRUE,
remove = stopwords("english"))
presDfm <- dfm_trim(presDfm, min_count = 5, min_docfreq = 3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- textstat_dist(dfm_weight(presDfm, "relfreq"))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency")
(try it!)
We can also look at term similarities:
sim <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")
lapply(as.list(sim), head, 10)
## $fair
## economi begin jefferson author faith call struggl best creat courag
## 0.9080252 0.9075951 0.8981462 0.8944272 0.8866586 0.8608285 0.8451543 0.8366600 0.8347300 0.8326664
##
## $health
## shape generat wrong common knowledg planet task demand eye defin
## 0.9045340 0.8971180 0.8944272 0.8888889 0.8888889 0.8819171 0.8728716 0.8666667 0.8660254 0.8642416
##
## $terror
## potenti adversari commonplac miracl racial bounti martin dream polit guarante
## 0.9036961 0.9036961 0.8944272 0.8944272 0.8944272 0.8944272 0.8944272 0.8624394 0.8500000 0.8485281
We have a lot of development work to do on the textmodel()
function, but here is a demonstration of unsupervised document scaling comparing the “wordfish” model:
# make prettier document names
ieDfm <- dfm(data_corpus_irishbudget2010)
textmodel(ieDfm, model = "wordfish", dir = c(2, 1))
## Fitted wordfish model:
## Call:
## textmodel_wordfish.dfm(x = x, dir = ..1)
##
## Estimated document positions:
##
## Documents theta SE lower upper
## 1 2010_BUDGET_01_Brian_Lenihan_FF 1.8209414 0.02032315 1.78110806 1.86077480
## 2 2010_BUDGET_02_Richard_Bruton_FG -0.5932871 0.02818870 -0.64853690 -0.53803720
## 3 2010_BUDGET_03_Joan_Burton_LAB -1.1136878 0.01540286 -1.14387741 -1.08349818
## 4 2010_BUDGET_04_Arthur_Morgan_SF -0.1219203 0.02846354 -0.17770888 -0.06613179
## 5 2010_BUDGET_05_Brian_Cowen_FF 1.7724143 0.02364059 1.72607877 1.81874987
## 6 2010_BUDGET_06_Enda_Kenny_FG -0.7145831 0.02650301 -0.76652898 -0.66263717
## 7 2010_BUDGET_07_Kieran_ODonnell_FG -0.4844865 0.04171530 -0.56624848 -0.40272450
## 8 2010_BUDGET_08_Eamon_Gilmore_LAB -0.5616612 0.02967423 -0.61982268 -0.50349971
## 9 2010_BUDGET_09_Michael_Higgins_LAB -0.9703175 0.03850625 -1.04578972 -0.89484520
## 10 2010_BUDGET_10_Ruairi_Quinn_LAB -0.9589295 0.03892458 -1.03522168 -0.88263733
## 11 2010_BUDGET_11_John_Gormley_Green 1.1807223 0.07221386 1.03918317 1.32226151
## 12 2010_BUDGET_12_Eamon_Ryan_Green 0.1866512 0.06294145 0.06328595 0.31001643
## 13 2010_BUDGET_13_Ciaran_Cuffe_Green 0.7422014 0.07245394 0.60019164 0.88421107
## 14 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1840577 0.03666326 -0.25591771 -0.11219775
##
## Estimated feature scores: showing first 30 beta-hats for features
##
## when i presented the supplementary budget to this house last april ,
## -0.09929396 0.38792877 0.39870512 0.25585084 1.11577625 0.09906187 0.36998571 0.30684327 0.19897567 0.28962672 -0.09535080 0.34525935
## said we could work our way through period of severe economic distress
## -0.71939992 0.47983038 -0.52985450 0.58218329 0.74364373 0.33602301 0.65973997 0.55612703 0.33922983 1.27902260 0.47857924 1.84448008
## . today can report that notwithstanding
## 0.27343280 0.17410070 0.36369449 0.69166796 0.08824170 1.84448008
quanteda makes it very easy to fit topic models as well, e.g.:
quantdfm <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english"))
quantdfm <- dfm_trim(quantdfm, min_count = 4, max_docfreq = 10, verbose = TRUE)
## Removing features occurring:
## - fewer than 4 times: 3,427
## - in more than 10 documents: 72
## Total features removed: 3,499 (73.5%).
quantdfm
## Document-feature matrix of: 14 documents, 1,263 features (64.5% sparse).
if (require(topicmodels)) {
myLDAfit20 <- LDA(convert(quantdfm, to = "topicmodels"), k = 20)
get_terms(myLDAfit20, 5)
}
## Loading required package: topicmodels
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16
## [1,] "hit" "measures" "failed" "levy" "welfare" "million" "kind" "fianna" "payment" "child" "taoiseach" "million" "welfare" "million" "fianna" "fianna"
## [2,] "hospital" "spending" "strategy" "million" "workers" "local" "imagination" "fáil" "measures" "benefit" "employees" "welfare" "system" "investment" "fáil" "taxation"
## [3,] "family" "increases" "needed" "carbon" "million" "rate" "policies" "national" "aware" "day" "rate" "even" "government's" "scheme" "side" "high"
## [4,] "system" "recent" "ministers" "change" "child" "measures" "wit" "irish" "child" "bank" "referred" "measures" "policies" "review" "level" "system"
## [5,] "allowance" "reduced" "system" "welfare" "fianna" "level" "create" "support" "day" "society" "debate" "investment" "child" "spending" "third" "fáil"
## Topic 17 Topic 18 Topic 19 Topic 20
## [1,] "level" "support" "society" "taoiseach"
## [2,] "million" "back" "equal" "fine"
## [3,] "reduction" "continue" "enterprising" "gael"
## [4,] "measures" "investment" "sense" "may"
## [5,] "create" "million" "nation" "irish"