Segment corpus text(s) or a character vector into tokens, sentences, paragraphs, or other sections. segment works on a character vector or corpus object, and allows the delimiters to be user-defined. This is useful for breaking the texts of a corpus into smaller documents based on sentences, or based on a user defined "tag" pattern. See Details.

corpus_segment(x, what = c("sentences", "paragraphs", "tokens", "tags",
  "other"), delimiter = NULL, valuetype = c("regex", "fixed", "glob"),
  omit_empty = TRUE, use_docvars = TRUE, ...)

char_segment(x, what = c("sentences", "paragraphs", "tokens", "tags",
  "other"), delimiter = NULL, valuetype = c("regex", "fixed", "glob"),
  omit_empty = TRUE, use_docvars = TRUE, ...)

Arguments

x

character or corpus object whose texts will be segmented

what

unit of segmentation. Current options are "sentences" (default), "paragraphs", "tokens", "tags", and "other".

Segmenting on "other" allows segmentation of a text on any user-defined value, and must be accompanied by the delimiter argument. Segmenting on "tags" performs the same function but preserves the tags as a document variable in the segmented corpus.

delimiter

delimiter defined as a regex for segmentation; only relevant for what = "paragraphs" (where the default is two newlines), "tags" (where the default is a tag preceded by two pound or "hash" signs ##), and "other".

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

omit_empty

if TRUE, empty texts are removed

use_docvars

(for corpus objects only) if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

...

provides additional arguments passed to tokens, if what = "tokens" is used

Value

corpus_segment returns a corpus of segmented texts, with a tag docvar if what = "tags". char_segment returns a character vector of segmented texts

Details

Tokens are delimited by separators. For tokens and sentences, these are determined by the tokenizer behaviour in tokens. For paragraphs, the default is two carriage returns, although this could be changed to a single carriage return by changing the value of delimiter to "\\n{1}" which is the R version of the regex for one newline character. (You might need this if the document was created in a word processor, for instance, and the lines were wrapped in the window rather than being hard-wrapped with a newline character.)

Note

Does not currently record document segments if segmenting a multi-text corpus into smaller units. For this, use corpus_reshape instead.

Using delimiters

One of the most common uses for corpus_segment is to partition a corpus into sub-documents using tags. By default, the tag value is any word that begins with a double "hash" sign and is followed by a whitespace. This can be modified but be careful to use the syntax for the trailing word boundary (\\b) The default values for delimiter are, according to valuetype:

paragraphs

"\\n{2}", regular expression meaning two newlines. If you wish to define a paragaph as a single newline, change the 2 to a 1.

tags

"##\\w+\\b", a regular expression meaning two "hash" characters followed by any number of word characters followed by a word boundary (a whitespace or the end of the text).

other

No default; user must supply one.

tokens, sentences

Delimiters do not apply to these, and a warning will be issued if you attempt to supply one.

Delimiters may be defined for different valuetypes but these may produce unexpected results, for example the lack of the ability in a "glob" expression to define the word boundaries.

See also

corpus_reshape, tokens

Examples

## segmenting a corpus testCorpus <- corpus(c("##INTRO This is the introduction. ##DOC1 This is the first document. Second sentence in Doc 1. ##DOC3 Third document starts here. End of third document.", "##INTRO Document ##NUMBER Two starts before ##NUMBER Three.")) # add a docvar testCorpus[["serialno"]] <- paste0("textSerial", 1:ndoc(testCorpus)) testCorpusSeg <- corpus_segment(testCorpus, "tags") summary(testCorpusSeg)
#> Corpus consisting of 6 documents. #> #> Text Types Tokens Sentences serialno tag #> text1.1 5 5 1 textSerial1 ##INTRO #> text1.2 11 12 2 textSerial1 ##DOC1 #> text1.3 8 10 2 textSerial1 ##DOC3 #> text2.1 1 1 1 textSerial2 ##INTRO #> text2.2 3 3 1 textSerial2 ##NUMBER #> text2.3 2 2 1 textSerial2 ##NUMBER #> #> Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/reference/* on x86_64 by kbenoit #> Created: Thu Aug 10 12:42:42 2017 #> Notes: corpus_segment.corpus(testCorpus, "tags") #>
texts(testCorpusSeg)
#> text1.1 #> "This is the introduction." #> text1.2 #> "This is the first document. Second sentence in Doc 1." #> text1.3 #> "Third document starts here. End of third document." #> text2.1 #> "Document" #> text2.2 #> "Two starts before" #> text2.3 #> "Three."
# segment a corpus into sentences segmentedCorpus <- corpus_segment(corpus(data_char_ukimmig2010), "sentences") summary(segmentedCorpus)
#> Corpus consisting of 207 documents, showing 100 documents. #> #> Text Types Tokens Sentences #> BNP.1 145 257 1 #> BNP.2 18 18 1 #> BNP.3 26 29 1 #> BNP.4 30 38 1 #> BNP.5 103 191 1 #> BNP.6 25 29 1 #> BNP.7 58 90 1 #> BNP.8 40 48 1 #> BNP.9 72 106 1 #> BNP.10 40 69 1 #> BNP.11 38 47 1 #> BNP.12 42 54 1 #> BNP.13 26 31 1 #> BNP.14 29 31 1 #> BNP.15 30 34 1 #> BNP.16 9 9 1 #> BNP.17 10 10 1 #> BNP.18 42 67 1 #> BNP.19 50 63 1 #> BNP.20 41 52 1 #> BNP.21 103 169 1 #> BNP.22 35 40 1 #> BNP.23 25 29 1 #> BNP.24 45 54 1 #> BNP.25 25 27 1 #> BNP.26 18 19 1 #> BNP.27 36 40 1 #> BNP.28 29 32 1 #> BNP.29 28 30 1 #> BNP.30 21 22 1 #> BNP.31 28 30 1 #> BNP.32 29 31 1 #> BNP.33 18 21 1 #> BNP.34 14 14 1 #> BNP.35 26 29 1 #> BNP.36 30 35 1 #> BNP.37 14 14 1 #> BNP.38 31 34 1 #> BNP.39 2 2 1 #> BNP.40 16 19 1 #> BNP.41 23 24 1 #> BNP.42 17 18 1 #> BNP.43 18 21 1 #> BNP.44 26 31 1 #> BNP.45 23 25 1 #> BNP.46 2 2 1 #> BNP.47 24 27 1 #> BNP.48 20 21 1 #> BNP.49 19 23 1 #> BNP.50 26 32 1 #> BNP.51 30 35 1 #> BNP.52 25 28 1 #> BNP.53 29 32 1 #> BNP.54 24 26 1 #> BNP.55 2 2 1 #> BNP.56 39 45 1 #> BNP.57 25 30 1 #> BNP.58 23 24 1 #> BNP.59 2 2 1 #> BNP.60 24 32 1 #> BNP.61 12 15 1 #> BNP.62 31 35 1 #> BNP.63 2 2 1 #> BNP.64 20 22 1 #> BNP.65 18 21 1 #> BNP.66 2 2 1 #> BNP.67 32 34 1 #> BNP.68 2 2 1 #> BNP.69 16 18 1 #> BNP.70 18 21 1 #> BNP.71 2 2 1 #> BNP.72 83 143 1 #> BNP.73 38 49 1 #> BNP.74 46 60 1 #> BNP.75 32 34 1 #> BNP.76 25 26 1 #> BNP.77 35 41 1 #> BNP.78 23 26 1 #> BNP.79 32 41 1 #> BNP.80 34 41 1 #> BNP.81 21 22 1 #> BNP.82 22 23 1 #> BNP.83 20 21 1 #> BNP.84 18 22 1 #> BNP.85 31 39 1 #> BNP.86 25 26 1 #> BNP.87 14 15 1 #> BNP.88 30 33 1 #> Coalition.1 2 2 1 #> Coalition.2 26 29 1 #> Coalition.3 71 117 1 #> Coalition.4 72 112 1 #> Conservative.1 9 9 1 #> Conservative.2 25 30 1 #> Conservative.3 12 12 1 #> Conservative.4 24 26 1 #> Conservative.5 22 28 1 #> Conservative.6 58 76 1 #> Conservative.7 34 37 1 #> Conservative.8 16 16 1 #> #> Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/reference/* on x86_64 by kbenoit #> Created: Thu Aug 10 12:42:42 2017 #> Notes: corpus_segment.corpus(corpus(data_char_ukimmig2010), "sentences") #>
## segmenting a character object # same as tokenize() identical(as.character(tokens(data_char_ukimmig2010)), as.character(char_segment(data_char_ukimmig2010, what = "tokens")))
#> [1] TRUE
# segment into paragraphs char_segment(data_char_ukimmig2010[3:4], "paragraphs")
#> Conservative.1 #> "Attract the brightest and best to our country." #> Conservative.2 #> "Immigration has enriched our nation over the years and we want to attract the brightest and the best people who can make a real difference to our economic growth. But immigration today is too high and needs to be reduced. We do not need to attract people to do jobs that could be carried out by British citizens, given the right training and support. So we will take steps to take net migration back to the levels of the 1990s - tens of thousands a year, not hundreds of thousands." #> Conservative.3 #> "To help achieve this goal, we will introduce a number of measures, such as:" #> Conservative.4 #> "- Setting an annual limit on the number of non-EU economic migrants admitted into the UK to live and work;" #> Conservative.5 #> "- Limiting access only to those who will bring the most value to the British economy; and," #> Conservative.6 #> "- Applying transitional controls as a matter of course in the future for all new EU member States." #> Conservative.7 #> "In addition, we will promote integration into British society, as we believe that everyone coming to this country must be ready to embrace our core values and become a part of their local community. So there will be an English language test for anyone coming here to get married." #> Conservative.8 #> "We want to encourage students to come to our universities and colleges, but our student visa system has become the biggest weakness in our border controls. A Conservative government will strengthen the system of granting student visas so that it is less open to abuse. We want to make it easier for reputable universities and colleges to accept applications, while putting extra scrutiny on new institutions looking to accept foreign students or existing institutions not registered with Companies house. In addition, we will:" #> Conservative.9 #> "- Insist foreign students at new or unregistered institutions pay a bond in order to study in this country, to be repaid after the student has left the country at the end of their studies;" #> Conservative.10 #> "- Ensure foreign students can prove that they have the financial means to support themselves in the UK; and," #> Conservative.11 #> "- Require that students must usually leave the country and reapply if they want to switch to another course or apply for a work permit." #> Conservative.12 #> "Extremists, serious criminals and others find our borders far too easy to penetrate. That is why we will create a dedicated border Police force, as part of a refocused Serious Organised Crime agency, to enhance national security, improve immigration controls, and crack down on the trafficking of people, weapons and drugs. We will work with police forces to strengthen arrangements to deal with serious crime and other cross-boundary policing challenges, and extend collaboration between forces to deliver better value for money." #> Greens.1 #> "Immigration." #> Greens.2 #> "Migration is a fact of life. People have always moved from one country to another, and as a practical matter the ability to control borders without oppressive measures is more limited than most politicians like to pretend. Much of our language, culture and way of life have been enriched by successive new arrivals over two thousand years. It is not just a matter of immigration: over 5 million British Citizens benefit from other countries' liberal immigration policies by living abroad." #> Greens.3 #> "The causes of a person moving to the UK are complex. For the person concerned, there may be escape from persecution and improved economic prospects, but also separation from home, friends and family. For the country of origin, there may be the loss of skilled workers, especially health professionals, but also the receipt of remittances from the immigrant, and many migrants return with improved skills. For the community that receives the immigrant there may be the benefits of getting done jobs that no one in that community wants to or can do, more taxes being paid and the creation of a more cosmopolitan atmosphere. But there may also be costs in terms of unwelcome competition for jobs, pressure on housing and other resources and longer-term pressures on overall population." #> Greens.4 #> "In deciding policy on immigration it is important that all these points are considered and balanced against each other. We must accept too our legal and moral obligations to give sanctuary to those fleeing persecution, and the principle of free movement throughout the European Union. Against this background our policy is as follows:" #> Greens.5 #> "- Where we are limiting numbers, our priority must be to meet our obligations to refugees and those seeking sanctuary, including the increasing numbers of people displaced by environmental change, above the needs of our economy." #> Greens.6 #> "- Our immigration policies must be fair and non-discriminatory, respect the integrity of families and be applied promptly and effectively." #> Greens.7 #> "- Our international policies should every- where seek to reduce the economic, political and environmental factors that force people to migrate. Emigration should be a positive choice, not the outcome of desperation. In particular, free movement within the EU is a fact. We should press for EU policies that make all parts of the EU an attractive place to live." #> Greens.8 #> "- We reject the use of immigration as a political issue to mask problems such as a lack of high-quality social housing. The proper solution is to provide enough social housing, as we propose elsewhere in this manifesto." #> Greens.9 #> "- We should not tolerate the long-term presence of large numbers of people whose immigration status is not defined. Such immigrants are vulnerable to exploitation by unscrupulous employers and others, under- mining national terms and conditions of employment. We would open up ways for existing illegal migrants who have been here for three years to become legal. In particular, a legal status must be provided for people who have not succeeded in their claim for humanitarian protection but who cannot be returned to their country of origin due to the political situation there." #> Greens.10 #> "- We would review the asylum procedures to ensure that destitution plays no role in the asylum process by allowing those seeking sanctuary to work." #> Greens.11 #> "- We would review the Nationality, Immigration and Asylum Act 2002, particularly with regard to issues of access to legal advice, childcare and levels of subsistence allowance." #> Greens.12 #> "- Those who have been trafficked should not be subject to summary deportation. They should receive a temporary right to stay and have the same right to apply to remain as others seeking to migrate." #> Greens.13 #> "- Those seeking sanctuary should not be detained, and in particular the administrative detention of children is unacceptable and should cease immediately."
# segment a text into sentences segmentedChar <- char_segment(data_char_ukimmig2010, "sentences") segmentedChar[3]
#> BNP.3 #> "In the absence of urgent action, we, the indigenous British people, will be reduced to minority status in our own ancestral homeland within two generations."