qdapRegex

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Build Status Coverage Status DOI Version

qdapRegex Logo
qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.

The qdapRegex package does not aim to compete with string manipulation packages such as stringr or stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R's own regular expression functions, or add on string manipulation packages such as stringr and stringi.

The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues

Educational

The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa and ?regex_supplement are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.

The following regular expressions sites were very helpful to my own regular expression education:

  1. Regular-Expression.info
  2. Rex Egg
  3. Regular Expressions as used in R
  4. Debuggex (Visualizing Regex)

Being able to discuss and ask questions is also important to learning…in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:

  1. Talk Stats + Posting Guidelines
  2. stackoverflow + Posting Guidelines

Installation

To download the development version of qdapRegex:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/qdapRegex")

Help

Contact

You are welcome to:

Examples

The following examples demonstrate some of the functionality of qdapRegex.

library(qdapRegex)

Extract Citations

w <- c("Hello World (V. Raptor, 1986) bye",
    "Narcissism is not dead (Rinker, 2014)",
    "The R Core Team (2014) has many members.",
    paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and",
        "beautiful. When I grow up, I want to marry R.\""),
    "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
    "Wickham's (in press) Tidy Data should be out soon.",
    "Rinker's (n.d.) dissertation not so much.",
    "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
    "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\""
)

rm_citation(w, extract=TRUE)
## [[1]]
## [1] "V. Raptor, 1986"
## 
## [[2]]
## [1] "Rinker, 2014"
## 
## [[3]]
## [1] "The R Core Team (2014)"
## 
## [[4]]
## [1] "Bunn (2005)"
## 
## [[5]]
## [1] "Baer, 2005"
## 
## [[6]]
## [1] "Wickham's (in press)"
## 
## [[7]]
## [1] "Rinker's (n.d.)"
## 
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
## 
## [[9]]
## [1] "Uwe Ligges (2007)"

Extract Twitter Hash Tags, Name Tages, & URLs

x <- c("@hadley I like #rstats for #ggplot2 work.",
    "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
        http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
    "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
        presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)

rm_hash(x, extract=TRUE)
## [[1]]
## [1] "#rstats"  "#ggplot2"
## 
## [[2]]
## [1] "#magrittr" "#pipeR"    "#rstats"  
## 
## [[3]]
## [1] "#user2014"
rm_tag(x, extract=TRUE)
## [[1]]
## [1] "@hadley"
## 
## [[2]]
## [1] "@timelyportfolio"
## 
## [[3]]
## [1] "@ramnath_vaidya"
rm_url(x, extract=TRUE)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] "http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html"
## 
## [[3]]
## [1] "http://ramnathv.github.io/user2014-rcharts/#1"

Extract Bracketed Text

y <- c("I love chicken [unintelligible]!", 
    "Me too! (laughter) It's so good.[interrupting]",
    "Yep it's awesome {reading}.", "Agreed. {is so much fun}")

rm_bracket(y, extract=TRUE)
## [[1]]
## [1] "unintelligible"
## 
## [[2]]
## [1] "laughter"     "interrupting"
## 
## [[3]]
## [1] "reading"
## 
## [[4]]
## [1] "is so much fun"
rm_curly(y, extract=TRUE)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "reading"
## 
## [[4]]
## [1] "is so much fun"
rm_round(y, extract=TRUE)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] "laughter"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA
rm_square(y, extract=TRUE)
## [[1]]
## [1] "unintelligible"
## 
## [[2]]
## [1] "interrupting"
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA

Extract Numbers

z <- c("-2 is an integer.  -4.3 and 3.33 are not.",
    "123,456 is a lot more than -.2",
    "hello world -.q")
rm_number(z)
## [1] "is an integer. and are not." "is a lot more than"         
## [3] "hello world -.q"
rm_number(z, extract=TRUE)
## [[1]]
## [1] "-2"   "-4.3" "3.33"
## 
## [[2]]
## [1] "123,456" "-.2"    
## 
## [[3]]
## [1] NA

Help topics

Removing/Extracting/Replacing

Function for removing/extracting/replacing text using regular expressions.

  • rm_default
    Remove/Replace/Extract Template
  • rm_
    Remove/Replace/Extract Function Generator
  • rm_abbreviation
    Remove/Replace/Extract Abbreviations
  • rm_between(rm_between_multiple)
    Remove/Replace/Extract Strings Between 2 Markers
  • rm_bracket(rm_angle, rm_bracket_multiple, rm_curly, rm_round, rm_square)
    Remove/Replace/Extract Brackets
  • rm_caps
    Remove/Replace/Extract All Caps
  • rm_caps_phrase
    Remove/Replace/Extract All Caps Phrases
  • rm_citation
    Remove/Replace/Extract Citations
  • rm_citation_tex
    Remove/Replace/Extract LaTeX Citations
  • rm_city_state
    Remove/Replace/Extract City & State
  • rm_city_state_zip
    Remove/Replace/Extract City, State, & Zip
  • rm_date
    Remove/Replace/Extract Dates
  • rm_dollar
    Remove/Replace/Extract Dollars
  • rm_email
    Remove/Replace/Extract Email Addresses
  • rm_emoticon
    Remove/Replace/Extract Emoticons
  • rm_endmark
    Remove/Replace/Extract Endmarks
  • rm_hash
    Remove/Replace/Extract Hash Tags
  • rm_nchar_words
    Remove/Replace/Extract N Letter Words
  • rm_non_ascii
    Remove/Replace/Extract Non-ASCII
  • rm_number
    Remove/Replace/Extract Numbers
  • rm_percent
    Remove/Replace/Extract Percentages
  • rm_phone
    Remove/Replace/Extract Phone Numbers
  • rm_postal_code
    Remove/Replace/Extract Postal Codes
  • rm_repeated_characters
    Remove/Replace/Extract Words With Repeating Characters
  • rm_repeated_phrases
    Remove/Replace/Extract Repeating Phrases
  • rm_repeated_words
    Remove/Replace/Extract Repeating Words
  • rm_tag
    Remove/Replace/Extract Person Tags
  • rm_time
    Remove/Replace/Extract Time
  • rm_title_name
    Remove/Replace/Extract Title + Person Name
  • rm_url(rm_twitter_url)
    Remove/Replace/Extract URLs
  • rm_white(rm_white_bracket, rm_white_colon, rm_white_comma, rm_white_endmark, rm_white_lead, rm_white_lead_trail, rm_white_multiple, rm_white_punctuation, rm_white_trail)
    Remove/Replace/Extract White Space
  • rm_zip
    Remove/Replace/Extract Zip Codes

Testing

Functions for testing regular expressions.

  • is.regex
    Test Regular Expression Validity
  • validate
    Regex Validation Function Generator

Educational

Functions used within qdapRegex that are intended for education around regular expressions.

  • cheat
    A Cheat Sheet of Common Regex Task Chunks
  • grab
    Grab Regular Expressions from Dictionaries
  • explain
    Visualize Regular Expressions

qdapRegex Tools

Other functions used within qdapRegex that are not specific to removing/extracting/replacing text with regular expressions.

  • bind
    Add Left/Right Character(s) Boundaries
  • grab
    Grab Regular Expressions from Dictionaries
  • group
    Group Regular Expressions
  • escape
    Escape Strings From Parsing
  • pastex(%+%, %|%)
    Paste Regular Expressions
  • S
    Use C-style String Formatting Commands
  • TC(L, U)
    Upper/Lower/Title Case

Regular Expression Dictionaries

Data sets with canned regular expressions.

  • regex_cheat
    A dataset containing the regex chunk name, the regex string, and a
  • regex_usa
    Canned Regular Expressions (United States of America)
  • regex_supplement
    Supplemental Canned Regular Expressions

Other

  • group_or
    Group Wrap and `or` Concatenate Elements
  • print.explain
    Prints a explain object
  • print.regexr
    Prints a regexr Object
  • qdapRegex(package-qdapRegex, qdapRegex-package)
    qdapRegex: Regular Expression Removal, Extraction, & Replacement Tools for the

Dependencies

  • Depends:
  • Imports: stringi
  • Suggests: testthat
  • Extends:

Author

Tyler W. Rinker