06: Regular Expressions
STA35B: Statistical Data Science 2
Data we will use
str (babynames:: babynames) # tibble: year/sex/name/number/proportion vars
tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
$ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
$ sex : chr [1:1924665] "F" "F" "F" "F" ...
$ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
$ n : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
$ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
str (stringr:: fruit) # vector: 80 fruits
chr [1:80] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" ...
str (stringr:: words) # vector: 980 common English words
chr [1:980] "a" "able" "about" "absolute" "accept" "account" "achieve" ...
str (stringr:: sentences) # vector: 720 short sentences
chr [1:720] "The birch canoe slid on the smooth planks." ...
We will use str_view(string, pattern = NULL) a lot
pattern will parse regular expressions (regex)
fruit |> str_view ('berry' ) # returns only the fruits whose name contains the pattern 'berry'
[6] │ bil<berry>
[7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
?str_view # pulls up documentation
Character classes: []
Enable you to match from a set of characters (similar idea to %in%)
E.g., [abcd] matches anything with 'a', 'b', 'c', or 'd'
# any word containing 'x' surrounded by vowels
words |> str_view ('[aeiou]x[aeoiu]' )
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
# any word containing 'x' whose immediate-left char is a vowel
words |> str_view ('[aeiou]x' )
[108] │ b<ox>
[284] │ <ex>act
[285] │ <ex>ample
[286] │ <ex>cept
[287] │ <ex>cuse
[288] │ <ex>ercise
[289] │ <ex>ist
[290] │ <ex>pect
[291] │ <ex>pense
[292] │ <ex>perience
[293] │ <ex>plain
...
Character classes: invert
Can invert by using ^: [^abcd] returns anything except 'a', 'b', 'c', 'd'
# any word containing 'y' surrounded by consonants
words |> str_view ('[^aeiou]y[^aeiou]' )
[836] │ <sys>tem
[901] │ <typ>e
# any word containing 'y' whose immediate-left char is a vowel and whose immediate-right char is a consonant
words |> str_view ('[aeiou]y[^aeiou]' )
[35] │ alw<ays>
[510] │ m<ayb>e
Character classes: alternation
alternation | picks between alternative patterns
# words containing 'apple', 'melon', or 'nut'
fruit |> str_view ('apple|melon|nut' )
[1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
# words containing repeated vowels
fruit |> str_view ('aa|ee|ii|oo|uu' )
[9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n
str_detect(character_vector, pattern)
returns a logical vector: TRUE if pattern matches element of vector
c ('a' , 'b' , 'c' , 'd' , 'e' , 'f' ) |> str_detect ('[aeiou]' )
[1] TRUE FALSE FALSE FALSE TRUE FALSE
Since it returns a logical vector, can be used with filter()
tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
$ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
$ sex : chr [1:1924665] "F" "F" "F" "F" ...
$ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
$ n : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
$ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
# most popular names containing an `'x'`
babynames |>
filter (str_detect (name, 'x' )) |> # retains rows with the names containing an `'x'`
count (name, wt = n, sort = TRUE ) # wt: computes sum(n) for each group
# A tibble: 974 × 2
name n
<chr> <int>
1 Alexander 665492
2 Alexis 399551
3 Alex 278705
4 Alexandra 232223
5 Max 148787
6 Alexa 123032
7 Maxine 112261
8 Alexandria 97679
...
str_detect(character_vector, pattern)
Can also use str_detect() in conjunction with group_by(), summarize(), etc.
sum() will return number of strings which have pattern
mean() will return proportion of strings which have pattern
E.g. proportion of names per year that have an 'x'
babynames |>
group_by (year) |>
summarize (prop_x = mean (str_detect (name, 'x' ))) |>
arrange (by = desc (prop_x))
# A tibble: 138 × 2
year prop_x
<dbl> <dbl>
1 2016 0.0163
2 2017 0.0159
3 2015 0.0154
4 2014 0.0146
5 2013 0.0145
6 2012 0.0136
7 2011 0.0130
8 2010 0.0126
...
str_count(): Counting matches
str_count() tells how many matches there are in a string
c ('apple' , 'banana' , 'currant' , 'eggplant' , 'pear' ) |> str_count ('p' )
c ('apple' , 'banana' , 'currant' , 'eggplant' , 'pear' ) |> str_count ('an' )
Regex matches never overlap - always start after the end of previous match
'abababa' |> str_count ('aba' )
'abababa' |> str_view ('aba' )
str_count(): Counting vowels and constants in baby names
Can use str_count() with mutate.
# Compute number of vowels/consonants in baby names
babynames |>
count (name) |> # compute number of occurrences per unique name
mutate (vowels = str_count (name, '[aeiou]' ),
consonants = str_count (name, '[^aeiou]' ))
# A tibble: 97,310 × 4
name n vowels consonants
<chr> <int> <int> <int>
1 Aaban 10 2 3
2 Aabha 5 2 3
3 Aabid 2 2 3
4 Aabir 1 2 3
5 Aabriella 5 4 5
6 Aada 1 2 2
7 Aadam 26 2 3
8 Aadan 11 2 3
...
Pattern matching is case sensitive, so 'A' isn’t counted. Ways around this:
Add upper-case vowels to character class: name |> str_count('[aeiouAEIOU]')
# Compute number of vowels/consonants in baby names
babynames |>
count (name) |> # compute number of occurrences per unique name
mutate (vowels = str_count (name, '[aeiouAEIOU]' ),
consonants = str_count (name, '[^aeiouAEIOU]' ))
# A tibble: 97,310 × 4
name n vowels consonants
<chr> <int> <int> <int>
1 Aaban 10 3 2
2 Aabha 5 3 2
3 Aabid 2 3 2
4 Aabir 1 3 2
5 Aabriella 5 5 4
6 Aada 1 3 1
7 Aadam 26 3 2
8 Aadan 11 3 2
...
Convert the names to lower case: str_to_lower(name) |> str_count('[aeiou]')
babynames |>
count (name) |>
mutate (name = str_to_lower (name),
vowels = str_count (name, '[aeiou]' ),
consonants = str_count (name, '[^aeiou]' ))
# A tibble: 97,310 × 4
name n vowels consonants
<chr> <int> <int> <int>
1 aaban 10 3 2
2 aabha 5 3 2
3 aabid 2 3 2
4 aabir 1 3 2
5 aabriella 5 5 4
6 aada 1 3 1
7 aadam 26 3 2
8 aadan 11 3 2
...
str_replace(): Replace and remove values
(x <- c ('apple' , 'pear' , 'banana' ))
[1] "apple" "pear" "banana"
x |> str_replace ('[aeiou]' , '-' ) # replaces first match
[1] "-pple" "p-ar" "b-nana"
x |> str_replace_all ('[aeiou]' , '-' ) # replace all matches
[1] "-ppl-" "p--r" "b-n-n-"
You can remove patterns by…
x |> str_replace ('[aeiou]' , '' ) # `pattern=''`
x |> str_remove ('[aeiou]' )
x |> str_replace_all ('[aeiou]' , '' ) # `pattern=''`
x |> str_remove_all ('[aeiou]' )
Ranges of characters
Task: if a string has a letter between 'a' and 'u', replace it with 'z'.
Could spell out manually all letters between 'a' and 'u'. Yuck.
Instead, you can use the character class operator [] together with hyphen -.
The above example:
(x <- c ('happy' , 'ab' , 'zap' ))
# replace each letter between a and u with z
x |> str_replace_all ('[a-u]' , 'z' )
An example with numbers:
(x <- c ('code9202' , 'apple2850' , '0352' ))
[1] "code9202" "apple2850" "0352"
# replace each number between 0 and 5 with z
x |> str_replace_all ('[0-5]' , 'z' )
[1] "code9zzz" "applez8zz" "zzzz"
Very useful to use ranges in conjunction with ?, *, +
# Find all words with at least three consecutive vowels.
words |> str_view ('[aeiou][aeiou][aeiou]+' )
[79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
[915] │ var<iou>s
Useful for parsing strings which are partitioned by letters/numbers
(name_score <- c ('Mary_92' , 'Pat_35' , 'Will_85' ))
[1] "Mary_92" "Pat_35" "Will_85"
name_score |>
str_replace ('[a-zA-Z]+' , 'John' ) |> # replace all names with John
str_replace ('[0-9]+' , '100' ) # replace all scores with 100
[1] "John_100" "John_100" "John_100"
Escaping
Since the characters '.', '?', '+', '*' have extra functionality in regex, must use escapes to help parse literal instances of these characters
In regex, we require a \ in front of characters to denote an escape
To create a string with a literal \, we must use an escape, so double \\:
c ('abc' , 'a.c' , 'bef' ) |> str_view ('a \\ .c' )
c ('a*rdvark' , '*pple' , 'm*n' ) |> str_view (' \\ *' )
[1] │ a<*>rdvark
[2] │ <*>pple
[3] │ m<*>n
Recall that to represent backslash in a string, need to escape:
To match for a backslash, need to create a string which has an escape in front of a backslash.
The escape requires double backslash, and the string \ also requires double backslash.
'a \\ b' |> str_view (' \\\\ ' )
'mary.elizabeth' |> str_replace ('\.' , '-' )
# Error: '\.' is an unrecognized escape in character string (<input>:1:33)
'mary.elizabeth' |> str_replace (' \\ .' , '-' )
Anchors
By default: regex will match any part of a string.
To match only at beginning or end, you need to anchor :
fruit |> str_view ('^a' ) # `^` indicates 'starts with'
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
fruit |> str_view ('a$' ) # `$` indicates 'ends with'
[4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
Example: replace every fruit name which starts with 'a' with an 'o'
fruit |> str_replace ('^a' , 'o' )
[1] "opple" "opricot" "ovocado"
[4] "banana" "bell pepper" "bilberry"
[7] "blackberry" "blackcurrant" "blood orange"
[10] "blueberry" "boysenberry" "breadfruit"
[13] "canary melon" "cantaloupe" "cherimoya"
[16] "cherry" "chili pepper" "clementine"
[19] "cloudberry" "coconut" "cranberry"
[22] "cucumber" "currant" "damson"
[25] "date" "dragonfruit" "durian"
[28] "eggplant" "elderberry" "feijoa"
[31] "fig" "goji berry" "gooseberry"
...
To match only the full string, not subsets, anchor it with both ^ and $:
fruit |> str_view ('apple' )
[1] │ <apple>
[62] │ pine<apple>
fruit |> str_view ('^apple$' )
Anchors: boundaries of words
You can specify the beginning or end of the word using \b
This works by treating all letters and numbers as ‘word’ characters, and everything else as ‘non-word’ characters
x <- c ('summary(x)' , 'summarize(df)' , 'rowsum(x)' , 'sum(x)' )
x |> str_view ('sum' )
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
x |> str_view (' \\ bsum \\ b' )
Character sets
We have built sets with []: e.g. [ac] matches if any character is 'a' or 'c'.
We have used - to denote ranges, e.g. [a-z] lowercase letters, [0-9] numbers.
Some others:
(x <- 'abcd ABCD 12345 -!@#%.' )
[1] "abcd ABCD 12345 -!@#%."
\d matches any digit;
\D matches anything that isn’t a digit. .
[1] │ abcd ABCD <12345> -!@#%.
[1] │ <abcd ABCD >12345< -!@#%.>
\s matches any whitespace (e.g., space, tab, newline);
\S matches anything that isn’t whitespace.
[1] │ abcd< >ABCD< >12345< >-!@#%.
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
\w matches any “word” character, i.e. letters and numbers;
\W matches any “non-word” character.
[1] │ <abcd> <ABCD> <12345> -!@#%.
[1] │ abcd< >ABCD< >12345< -!@#%.>
Remember: to represent \ in a string, need double backslash.
Quantifiers
We already discussed ? (0 or 1 match), + (1+ matches), * (0+ matches)
colou?r: matches American and British English
\d+: matches 1+ digits
\s?: matches 0+ whitespaces
Can specify exact number of matches:
{n} matches exactly n times.
{n,} matches at least n times.
{n,m} matches between n and m times.
Words with >= 3 consecutive vowels?
words |> str_view ('[aeiou]{3,}' )
[79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
...
Words with between 4 and 6 consecutive consonants:
words |> str_view ('[^aeiou]{4,6}' )
[45] │ a<pply>
[198] │ cou<ntry>
[424] │ indu<stry>
[830] │ su<pply>
[836] │ <syst>em
Order of operations in regex: precedence
Not immediately clear in which order R processes different operators.
ab+: is this 'a' and then 1+ 'b', or is it 'ab' repeated 1+ times? (1st case)
^a|b$: match the string 'a' or the string 'b', OR: string starting with 'a' or string starting with 'b' (2nd case)
Generally: quantifiers (?+*) have high precedence, alternation | low.
Order of operations in regex: parenthesis
You can also introduce parenthesis to be more explicit about what you want.
words |> str_view ('a(b+)' ) # same as `ab+`
[2] │ <ab>le
[3] │ <ab>out
[4] │ <ab>solute
[62] │ avail<ab>le
[66] │ b<ab>y
[452] │ l<ab>our
[648] │ prob<ab>le
[837] │ t<ab>le
words |> str_view ('(^a)|(b$)' ) # same as `^a|b$`
[1] │ <a>
[2] │ <a>ble
[3] │ <a>bout
[4] │ <a>bsolute
[5] │ <a>ccept
[6] │ <a>ccount
[7] │ <a>chieve
[8] │ <a>cross
[9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
...
Order of operations in regex: parenthesis 2
With parentheses, you can back-reference matches that appeared in parentheses, using \1 for a match in the first parentheses, \2 for a match in the second, etc.
e.g. find all fruits which have a repeated pair of letters .
Pair of letters = '(..)'; back-ref: '\\1'
fruit |> str_view ('(..) \\ 1' )
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
Words that start and end with same pair of letters:
# 'starts with' a pair: ^(..)
# 'ends with': need to end regex with \\1$
# to allow any chars between, put .* in middle
words |> str_view ('^(..).* \\ 1$' )
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>
Words that are repetitions of the same pair of letters:
c ('haha' , 'miumiu' ) |> str_view ('^(..)+ \\ 1$' )
Order of operations in regex: parenthesis 3
Can back-reference in str_replace(). E.g. swap 2nd and 3rd words in sentence
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
...
sentences |> str_replace ('( \\ w+) ( \\ w+) ( \\ w+)' , ' \\ 1 \\ 3 \\ 2' )
[1] "The canoe birch slid on the smooth planks."
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."
...
(\\w+): matches with 1+ ‘word characters’ (letters, numbers)
Spacing between (\\w+) ensures we are looking for sequences of the form: word-chars, space, word-chars, space, word-chars
Examples
Words that start with 'y':
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
...
Words that don’t start with 'y':
words |> str_view ('^[^y]' )
[1] │ <a>
[2] │ <a>ble
[3] │ <a>bout
[4] │ <a>bsolute
[5] │ <a>ccept
...
Ends with a vowel-vowel-consonant triplet:
words |> str_view ('[aeiou]{2}[^aeiou]$' )
[3] │ ab<out>
[11] │ act<ual>
[19] │ aftern<oon>
[20] │ ag<ain>
[26] │ <air>
...
Has 7 or more letters:
words |> str_view ('[a-z]{7,}' )
[4] │ <absolute>
[6] │ <account>
[7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
...
Boolean operations
We have seen how ^ inside [] negates the set, i.e. words with no vowels:
words |> str_view ('^[^aeiou]+$' )
[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
[952] │ <why>
There’s no ‘and’ operator built into regex, which can complicate certain tasks.
Another way: return logical vector indicating presence of vowels, then negate:
words[! str_detect (words, '[aeiou]' )]
[1] "by" "dry" "fly" "mrs" "try" "why"
Boolean operations: examples
Find all words containing an 'a' and 'b'
# Trickier in standard regex
words |> str_view ('a.*b|b.*a' )
[2] │ <ab>le
[3] │ <ab>out
[4] │ <ab>solute
[62] │ <availab>le
...
Easier with str_detect() and &:
words[
str_detect (words, 'a' ) &
str_detect (words, 'b' )
]
[1] "able" "about" "absolute" "available" "baby" "back"
[7] "bad" "bag" "balance" "ball" "bank" "bar"
[13] "base" "basis" "bear" "beat" "beauty" "because"
...
Find all words containing 'a', 'e', 'i', and 'o'.
Standard regex solution would be very complex.
Much easier using str_detect() and &:
words[
str_detect (words, 'a' ) &
str_detect (words, 'e' ) &
str_detect (words, 'i' ) &
str_detect (words, 'o' )
]
[1] "appropriate" "associate" "organize" "relation"