Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special...
Transcript of Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special...
![Page 1: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/1.jpg)
Regular Expressions 1STAT 133
Gaston Sanchez
Department of Statistics, UC–Berkeley
gastonsanchez.com
github.com/gastonstat/stat133
Course web: gastonsanchez.com/stat133
![Page 2: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/2.jpg)
Motivation
2
![Page 3: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/3.jpg)
Motivation
I So far we have seen some basic and intermediate functionsfor handling and working with text in R
I These functions are very useful functions and they allowsus to do many interesting things.
I However, if we truly want to unleash the power of stringsmanipulation, we need to talk about regular expressions.
3
![Page 4: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/4.jpg)
Motivation
## name gender height weight born
## 1 Luke male 1.72m 77gr 19BBY
## 2 Leia Female 1.50m 49gr 19BBY
## 3 R2-D2 male 0.96m 32gr 33BBY
## 4 C-3PO MALE 1.67m 75gr 112BBY
It is not uncommon to have datasets that need some cleaningand preprocessing
4
![Page 5: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/5.jpg)
Motivation
## name gender height weight born
## 1 Luke male 1.72m 77gr 19BBY
## 2 Leia Female 1.50m 49gr 19BBY
## 3 R2-D2 male 0.96m 32gr 33BBY
## 4 C-3PO MALE 1.67m 75gr 112BBY
It is not uncommon to have datasets that need some cleaningand preprocessing
4
![Page 6: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/6.jpg)
Motivation
Some common tasks
I finding pieces of text or characters that meet a certainpattern
I extracting pieces of text in non-standard formats
I transforming text into a uniform format
I resolving inconsistencies
I substituting certain characters
I splitting a piece of text into various parts
5
![Page 7: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/7.jpg)
Regular Expressions
6
![Page 8: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/8.jpg)
Regular Expressions
I A regular expression is a special text string for describinga certain amount of text.
I This “certain amount of text” receives the formal name ofpattern.
I A regular expression is a pattern that describes a set ofstrings.
I It is common to abbreviate the term “regular expression”as regex
7
![Page 9: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/9.jpg)
Regular Expressions
Simply put, working with regular expressionsis nothing more than pattern matching
8
![Page 10: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/10.jpg)
Regular Expressions
I Regex patterns consist of a combination of alphanumericcharacters as well as special characters.
– e.g. [a-zA-Z0-9 .]*
I A regex pattern can be as simple as a single character
I But it can also be formed by several characters with a morecomplex structure
9
![Page 11: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/11.jpg)
Basic Concepts
Regular expressions are constructed from 3 things:
I Literal characters are matched only by the character itself
I A character class is matched by any single member of thespecified class
I Modifiers operate on literal characters, character classes,or combinations of the two.
10
![Page 12: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/12.jpg)
Literal Characters
Consider the following text:
The quick brown fox jumps over the lazy dog
A basic regex can be something as simple as fox. Thecharacters fox match the word “fox” in the sentence above.
11
![Page 13: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/13.jpg)
Special Characters
Consider this other text:
One. Two. Three. And four* and Five!
Not all characters are matched literally. There are somecharacters that have a special meaning in regular expressions: .or * are some of these special characters.
12
![Page 14: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/14.jpg)
About Regex
I “Regular Expressions” is not a full programming language
I Regex have a special syntax and instructions that you mustlearn
I Regular expressions are supported in a variety of forms onalmost every computing platform
I R has functions and packages designed to work withregular expressions
13
![Page 15: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/15.jpg)
Regex Tasks
Common operations
I Identify a match to a pattern
I Locate a pattern match (positions)
I Replace a matched pattern
I Extract a matched pattern
14
![Page 16: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/16.jpg)
Identify a Match
15
![Page 17: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/17.jpg)
Identify Matches
Functions for identifying match to a pattern:
I grep(..., value = FALSE)
I grepl()
I str detect() ("stringr")
16
![Page 18: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/18.jpg)
grep(..., value = FALSE)
# some text
text <- c("one word", "a sentence", "three two one")
# pattern
pat <- "one"
# default usage
grep(pat, text)
## [1] 1 3
17
![Page 19: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/19.jpg)
grepl()
I grepl() is very similar to grep()
I The difference resides in that the output are not numericindices, but logical
I You can think of grepl() as grep-logical
18
![Page 20: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/20.jpg)
grepl()
# some text
text <- c("one word", "a sentence", "three two one")
# pattern
pat <- "one"
# default usage
grepl(pat, text)
## [1] TRUE FALSE TRUE
19
![Page 21: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/21.jpg)
str detect() in "stringr"
# some text
text <- c("one word", "a sentence", "three two one")
# pattern
pat <- "one"
# default usage
str_detect(text, pat)
## [1] TRUE FALSE TRUE
20
![Page 22: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/22.jpg)
Locate a Match
21
![Page 23: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/23.jpg)
Locate Matches
Functions for locating match to a pattern:
I regexpr()
I gregexpr()
I str locate() ("stringr")
I str locate all() ("stringr")
22
![Page 24: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/24.jpg)
regexpr()
I regexpr() is used to find exactly where the pattern isfound in a given string
I it tells us which elements of the text vector actuallycontain the regex pattern, and
I identifies the position of the substring that is matched bythe regex pattern
23
![Page 25: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/25.jpg)
regexpr()
# some text
text <- c("one word", "a sentence", "three two one")
# default usage
regexpr(pattern = "one", text)
## [1] 1 -1 11
## attr(,"match.length")
## [1] 3 -1 3
## attr(,"useBytes")
## [1] TRUE
24
![Page 26: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/26.jpg)
gregexpr()
I gregexpr() does practically the same thing as regexpr()
I identifies where a pattern is within a string vector, bysearching each element separately
I The only difference is that gregexpr() has an output inthe form of a list
25
![Page 27: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/27.jpg)
gregexpr()
# some text
text <- c("one word", "a sentence", "three two one")
# default usage
gregexpr(pattern = "one", text)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 3
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 11
## attr(,"match.length")
## [1] 3
## attr(,"useBytes")
## [1] TRUE
26
![Page 28: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/28.jpg)
str locate() in "stringr"
I str locate() locates the position of the first occurrenceof a pattern in a string
I it tells us which elements of the text vector actuallycontain the regex pattern, and
I identifies the position of the substring that is matched bythe regex pattern
27
![Page 29: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/29.jpg)
str locate() in "stringr"
# some text
text <- c("one word", "a sentence", "three two one")
# default usage
str_locate(text, pattern = "one")
## start end
## [1,] 1 3
## [2,] NA NA
## [3,] 11 13
28
![Page 30: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/30.jpg)
str locate all() in "stringr"
I str locate all() locates the position of ALL theoccurrences of a pattern in a string
I it tells us which elements of the text vector actuallycontain the regex pattern, and
I the output is in the form of a list with as many elements asthe number of elements in the examined vector
29
![Page 31: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/31.jpg)
str locate all() in "stringr"
# some text
text <- c("one word", "a sentence", "one three two one")
# default usage
str_locate_all(text, pattern = "one")
## [[1]]
## start end
## [1,] 1 3
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 1 3
## [2,] 15 17
30
![Page 32: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/32.jpg)
Extract a Match
31
![Page 33: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/33.jpg)
Extract Matches
Functions for extracting positions of matched pattern:
I grep(..., value = TRUE)
I str extract() ("stringr")
I str extract all() ("stringr")
32
![Page 34: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/34.jpg)
Extraction with grep(..., value = TRUE)
I grep(..., value = TRUE) allows us to do basicextraction
I Actually, it will extract the entire element that matches thepattern
I Sometimes this is not exactly what we want to do (it’sbetter to use functions of "stringr")
33
![Page 35: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/35.jpg)
grep(..., value = TRUE)
# some text
text <- c("one word", "a sentence", "one three two one")
# extract first one
grep(pattern = "one", text, value = TRUE)
## [1] "one word" "one three two one"
34
![Page 36: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/36.jpg)
Pattern extraction with str extract()
I str extract() extracts the first occurrence of thematched pattern
I It will only extract the matched pattern
I If no pattern is matched, then a missing value is returned
35
![Page 37: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/37.jpg)
str extract() in "stringr"
# some text
text <- c("one word", "a sentence", "one three two one")
# extract first one
str_extract(text, pattern = "one")
## [1] "one" NA "one"
36
![Page 38: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/38.jpg)
Pattern extraction with str extract all()
I str extract all() extracts ALL the occurrences of thematched pattern
I It will only extract the matched pattern
I If no pattern is matched, then a character vector of lengthzero is returned
I the output is in list format
37
![Page 39: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/39.jpg)
str extract all() in "stringr"
# some text
text <- c("one word", "a sentence", "one three two one")
# extract all 'one's
str_extract_all(text, pattern = "one")
## [[1]]
## [1] "one"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "one" "one"
38
![Page 40: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/40.jpg)
Replace a Match
39
![Page 41: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/41.jpg)
Replace Matches
Functions for replacing a matched pattern:
I sub()
I gsub()
I str replace() ("stringr")
I str replace all() ("stringr")
40
![Page 42: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/42.jpg)
Replace first occurrence with sub()
About sub()
I sub() replaces the first occurrence of a pattern in a giventext
I if there is more than one occurrence of the pattern in eachelement of a string vector, only the first one will bereplaced.
41
![Page 43: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/43.jpg)
Replacing with sub()
# string
Rstring <- c("The R Foundation",
"for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute first 'R' with 'RR'
sub("R", "RR", Rstring)
## [1] "The RR Foundation" "for Statistical Computing"
## [3] "RR is FREE software" "RR is a collaborative project"
42
![Page 44: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/44.jpg)
Replace all occurrences with gsub()
To replace not only the first pattern occurrence, but all of theoccurrences we should use gsub() (think of it as generalsubstitution)
43
![Page 45: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/45.jpg)
Replacing with gsub()
# string
Rstring <- c("The R Foundation",
"for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute all 'R' with 'RR'
gsub("R", "RR", Rstring)
## [1] "The RR Foundation" "for Statistical Computing"
## [3] "RR is FRREE software" "RR is a collaborative project"
44
![Page 46: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/46.jpg)
Replacing with str replace()
# string
Rstring <- c("The R Foundation",
"for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute first 'R' with 'RR'
str_replace(Rstring, "R", "RR")
## [1] "The RR Foundation" "for Statistical Computing"
## [3] "RR is FREE software" "RR is a collaborative project"
45
![Page 47: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/47.jpg)
Replacing with str replace all()
# string
Rstring <- c("The R Foundation",
"for Statistical Computing",
"R is FREE software",
"R is a collaborative project")
# substitute first 'R' with 'RR'
str_replace_all(Rstring, "R", "RR")
## [1] "The RR Foundation" "for Statistical Computing"
## [3] "RR is FRREE software" "RR is a collaborative project"
46
![Page 48: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/48.jpg)
Split a string
47
![Page 49: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/49.jpg)
Split a string
Another common task is splitting a string based on a pattern.The idea is to split the elements of a character vector intosubstrings according to regex matches.
48
![Page 50: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/50.jpg)
Split a string
Functions for extracting positions of matched pattern:
I strsplit()
I str split() ("stringr")
49
![Page 51: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/51.jpg)
Split a string
# a sentence
sentence <- c("The quick brown fox")
# split into words
strsplit(sentence, " ")
## [[1]]
## [1] "The" "quick" "brown" "fox"
50
![Page 52: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/52.jpg)
Split a string
# telephone numbers
tels <- c("510-548-2238", "707-231-2440", "650-752-1300")
# split each number into its portions
strsplit(tels, "-")
## [[1]]
## [1] "510" "548" "2238"
##
## [[2]]
## [1] "707" "231" "2440"
##
## [[3]]
## [1] "650" "752" "1300"
51
![Page 53: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/53.jpg)
Split a string
# telephone numbers
tels <- c("510-548-2238", "707-231-2440", "650-752-1300")
# split each number into its portions
str_split(tels, "-")
## [[1]]
## [1] "510" "548" "2238"
##
## [[2]]
## [1] "707" "231" "2440"
##
## [[3]]
## [1] "650" "752" "1300"
52
![Page 54: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/54.jpg)
Special Characters
53
![Page 55: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/55.jpg)
Metacharacters
I The simplest form of regular expressions are those thatmatch a single character
I Most characters, including all letters and digits, are regularexpressions that match themselves
I For example, the pattern "1" matches the number 1
I The pattern "blu" matches the set of letters “blu”.
I However, there are some special characters that don’tmatch themselves
I These special characters are known as metacharacters.
54
![Page 56: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/56.jpg)
Metacharacters
The metacharacters in Extended Regular Expressions are:
. \ | ( ) [ { $ * + ?
To use a metacharacter symbol as a literal character, we needto escape them
55
![Page 57: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/57.jpg)
Metacharacters and how to escape them in RMetacharacter Literal meaning Escape in R
. the period or dot \\.$ the dollar sign \\$* the asterisk or star \\*+ the plus sign \\+? the question mark \\?| the vertical bar or pipe symbol \\|\ the backslash \\\\^ the caret \\^[ the opening bracket \\[] the closing bracket \\]{ the opening brace \\{} the closing brace \\}( the opening parenthesis \\() the closing parenthesis \\)
56
![Page 58: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/58.jpg)
Character Classes
I A character class or character set is a list of charactersenclosed by square brackets [ ]
I Character sets are used to match only one of severalcharacters.
I e.g. the regex character class [aA] matches any lower caseletter a or any upper case letter A.
I character classes including the caret ^ at the beginning ofthe list indicates that the regular expression matches anycharacter NOT in the list
I A dash symbol - (not at the beginning) is used to indicateranges: e.g. [0-9]
57
![Page 59: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/59.jpg)
Character Classes
# some string
transport <- c("car", "bike", "plane", "boat")
# look for 'e' or 'i'
grep(pattern = "[ei]", transport, value = TRUE)
## [1] "bike" "plane"
58
![Page 60: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/60.jpg)
Character Classes
Some (Regex) Character ClassesAnchor Description[aeiou] match any one lower case vowel[AEIOU] match any one upper case vowel
[0123456789] match any digit[0-9] match any digit (same as previous class)[a-z] match any lower case ASCII letter[A-Z] match any upper case ASCII letter
[a-zA-Z0-9] match any of the above classes[^aeiou] match anything other than a lowercase vowel[^0-9] match anything other than a digit
59
![Page 61: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/61.jpg)
POSIX Classes
Closely related to the regex character classes we have what isknown as POSIX character classes. In R, POSIX characterclasses are represented with expressions inside double brackets[[ ]].
60
![Page 62: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/62.jpg)
POSIX Classes
POSIX Character Classes in RClass Description
[[:lower:]] Lower-case letters[[:upper:]] Upper-case letters[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])[[:blank:]] Blank characters: space and tab[[:cntrl:]] Control characters[[:punct:]] Punctuation characters: ! ” # % & ’ ( ) * + , - . / : ;[[:space:]] Space characters: tab, newline, vertical tab, form feed,
carriage return, and space[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]])
61
![Page 63: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/63.jpg)
Special Sequences
Sequences define, no surprinsingly, sequences of characterswhich can match.
There are shorthand versions (or anchors) for commonly usedsequences
62
![Page 64: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/64.jpg)
Anchor Sequences in RAnchor Description\\d match a digit character\\D match a non-digit character\\s match a space character\\S match a non-space character\\w match a word character\\W match a non-word character\\b match a word boundary\\B match a non-(word boundary)\\h match a horizontal space\\H match a non-horizontal space\\v match a vertical space\\V match a non-vertical space
63
![Page 65: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/65.jpg)
Quantifiers
I Another important set of regex elements are the so-calledquantifiers.
I These are used when we want to match a certain numberof characters that meet certain criteria.
I Quantifiers specify how many instances of a character,group, or character class must be present in the input for amatch to be found.
64
![Page 66: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/66.jpg)
Quantifiers
Quantifier Description? The preceding item is optional and will be
matched at most once* The preceding item will be matched zero
or more times+ The preceding item will be matched
one or more times{n} The preceding item is matched exactly n times{n,} The preceding item is matched n or more times{n,m} The preceding item is matched at least n times,
but not more than m times
65
![Page 67: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/67.jpg)
Positions
Character Description^ matches the start of the string$ matches the end of the string. matches any single character| “OR” operator, matches patterns
on either side of the |
66
![Page 68: Regular Expressions 1is nothing more than pattern matching 8. ... characters as well as special characters. {e.g. [a-zA-Z0-9 .]* I A regex pattern can be as simple as a single character](https://reader036.fdocuments.in/reader036/viewer/2022071019/5fd34f930f80fc6015179d2b/html5/thumbnails/68.jpg)
Regular Expressions in R
I There are two main aspects that we need to consider aboutregular expressions in R
I One has to do with the functions designed for regexpattern matching
I The other aspect has to do with the way regex patternsare expressed in R
67