06: Regular Expressions

STA35B: Statistical Data Science 2

Akira Horiguchi

Data we will use

str(babynames::babynames)  # tibble: year/sex/name/number/proportion vars
tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
 $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
 $ sex : chr [1:1924665] "F" "F" "F" "F" ...
 $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
 $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
 $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
str(stringr::fruit)  # vector: 80 fruits
 chr [1:80] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" ...
str(stringr::words)  # vector: 980 common English words
 chr [1:980] "a" "able" "about" "absolute" "accept" "account" "achieve" ...
str(stringr::sentences)  # vector: 720 short sentences
 chr [1:720] "The birch canoe slid on the smooth planks." ...

We will use str_view(string, pattern = NULL) a lot

  • pattern will parse regular expressions (regex)
fruit |> str_view('berry')  # returns only the fruits whose name contains the pattern 'berry'
 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
?str_view  # pulls up documentation

Literal characters, metacharacters

  • Letters and numbers which match exactly are literal characters.
  • Punctuation characters (\, ., +, *, etc) often have special regex meanings.
  • Such metacharacters enable a wide array of behaviors; we will explore a few.

Metacharacter: . will match any character

  • 'a.' matches any string which contains 'a' followed by another character.
c('a', 'ab', 'ae', 'bd', 'ea', 'eab') |> str_view('a.')
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>
  • Or all fruits which have an ‘a’, then 3 letters, then an ‘e’:
fruit |> str_view('a...e')
 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry

Metacharacter: quantifying repetitions with ?, +, and *

(x <- c('mz', 'mb', 'ma', 'mab', 'mabb', 'mabbb', 'mabab', 'macb'))
[1] "mz"    "mb"    "ma"    "mab"   "mabb"  "mabbb" "mabab" "macb" 

Match zero or more times.

x |> str_view('ab*')  # use `*`
[3] │ m<a>
[4] │ m<ab>
[5] │ m<abb>
[6] │ m<abbb>
[7] │ m<ab><ab>
[8] │ m<a>cb

'ab*' matches an 'a', followed by zero or more 'b's.

Match zero or one time.

x |> str_view('ab?')  # use `?`
[3] │ m<a>
[4] │ m<ab>
[5] │ m<ab>b
[6] │ m<ab>bb
[7] │ m<ab><ab>
[8] │ m<a>cb

'ab?' matches an 'a', optionally followed by a single 'b'.

Match one or more times.

x |> str_view('ab+')  # use `+`
[4] │ m<ab>
[5] │ m<abb>
[6] │ m<abbb>
[7] │ m<ab><ab>

'ab+' matches an 'a', followed by one or more 'b's.

Character classes: []

Enable you to match from a set of characters (similar idea to %in%)

  • E.g., [abcd] matches anything with 'a', 'b', 'c', or 'd'
# any word containing 'x' surrounded by vowels
words |> str_view('[aeiou]x[aeoiu]') 
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
# any word containing 'x' whose immediate-left char is a vowel
words |> str_view('[aeiou]x') 
[108] │ b<ox>
[284] │ <ex>act
[285] │ <ex>ample
[286] │ <ex>cept
[287] │ <ex>cuse
[288] │ <ex>ercise
[289] │ <ex>ist
[290] │ <ex>pect
[291] │ <ex>pense
[292] │ <ex>perience
[293] │ <ex>plain
...

Character classes: invert

  • Can invert by using ^: [^abcd] returns anything except 'a', 'b', 'c', 'd'
# any word containing 'y' surrounded by consonants
words |> str_view('[^aeiou]y[^aeiou]')  
[836] │ <sys>tem
[901] │ <typ>e
# any word containing 'y' whose immediate-left char is a vowel and whose immediate-right char is a consonant
words |> str_view('[aeiou]y[^aeiou]')  
 [35] │ alw<ays>
[510] │ m<ayb>e

Character classes: alternation

alternation | picks between alternative patterns

# words containing 'apple', 'melon', or 'nut'
fruit |> str_view('apple|melon|nut')
 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
# words containing repeated vowels
fruit |> str_view('aa|ee|ii|oo|uu')
 [9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n

str_detect(character_vector, pattern)

returns a logical vector: TRUE if pattern matches element of vector

c('a', 'b', 'c', 'd', 'e', 'f') |> str_detect('[aeiou]')
[1]  TRUE FALSE FALSE FALSE  TRUE FALSE
  • Since it returns a logical vector, can be used with filter()
str(babynames)
tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
 $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
 $ sex : chr [1:1924665] "F" "F" "F" "F" ...
 $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
 $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
 $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
# most popular names containing an `'x'`
babynames |> 
  filter(str_detect(name, 'x')) |>  # retains rows with the names containing an `'x'`
  count(name, wt = n, sort = TRUE)  # wt: computes sum(n) for each group
# A tibble: 974 × 2
   name            n
   <chr>       <int>
 1 Alexander  665492
 2 Alexis     399551
 3 Alex       278705
 4 Alexandra  232223
 5 Max        148787
 6 Alexa      123032
 7 Maxine     112261
 8 Alexandria  97679
...

str_detect(character_vector, pattern)

  • Can also use str_detect() in conjunction with group_by(), summarize(), etc.
    • sum() will return number of strings which have pattern
    • mean() will return proportion of strings which have pattern
  • E.g. proportion of names per year that have an 'x'
babynames |> 
  group_by(year) |>
  summarize(prop_x = mean(str_detect(name, 'x'))) |>
  arrange(by = desc(prop_x))
# A tibble: 138 × 2
    year prop_x
   <dbl>  <dbl>
 1  2016 0.0163
 2  2017 0.0159
 3  2015 0.0154
 4  2014 0.0146
 5  2013 0.0145
 6  2012 0.0136
 7  2011 0.0130
 8  2010 0.0126
...

str_count(): Counting matches

  • str_count() tells how many matches there are in a string
c('apple', 'banana', 'currant', 'eggplant', 'pear') |> str_count('p')
[1] 2 0 0 1 1
c('apple', 'banana', 'currant', 'eggplant', 'pear') |> str_count('an')
[1] 0 2 1 1 0
  • Regex matches never overlap - always start after the end of previous match
'abababa' |> str_count('aba')
[1] 2
'abababa' |> str_view('aba')
[1] │ <aba>b<aba>

str_count(): Counting vowels and constants in baby names

Can use str_count() with mutate.

# Compute number of vowels/consonants in baby names
babynames |>
  count(name) |>  # compute number of occurrences per unique name
  mutate(vowels = str_count(name, '[aeiou]'),
         consonants = str_count(name, '[^aeiou]'))
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      2          3
 2 Aabha         5      2          3
 3 Aabid         2      2          3
 4 Aabir         1      2          3
 5 Aabriella     5      4          5
 6 Aada          1      2          2
 7 Aadam        26      2          3
 8 Aadan        11      2          3
...

Pattern matching is case sensitive, so 'A' isn’t counted. Ways around this:

  • Add upper-case vowels to character class: name |> str_count('[aeiouAEIOU]')
# Compute number of vowels/consonants in baby names
babynames |>
  count(name) |>  # compute number of occurrences per unique name
  mutate(vowels = str_count(name, '[aeiouAEIOU]'),
         consonants = str_count(name, '[^aeiouAEIOU]'))
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      3          2
 2 Aabha         5      3          2
 3 Aabid         2      3          2
 4 Aabir         1      3          2
 5 Aabriella     5      5          4
 6 Aada          1      3          1
 7 Aadam        26      3          2
 8 Aadan        11      3          2
...
  • Convert the names to lower case: str_to_lower(name) |> str_count('[aeiou]')
babynames |> 
  count(name) |> 
  mutate(name = str_to_lower(name),
         vowels = str_count(name, '[aeiou]'),
         consonants = str_count(name, '[^aeiou]'))
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 aaban        10      3          2
 2 aabha         5      3          2
 3 aabid         2      3          2
 4 aabir         1      3          2
 5 aabriella     5      5          4
 6 aada          1      3          1
 7 aadam        26      3          2
 8 aadan        11      3          2
...

str_replace(): Replace and remove values

(x <- c('apple', 'pear', 'banana'))
[1] "apple"  "pear"   "banana"
x |> str_replace('[aeiou]', '-')  # replaces first match
[1] "-pple"  "p-ar"   "b-nana"
x |> str_replace_all('[aeiou]', '-')  # replace all matches
[1] "-ppl-"  "p--r"   "b-n-n-"

You can remove patterns by…

x |> str_replace('[aeiou]', '')  # `pattern=''`
[1] "pple"  "par"   "bnana"
x |> str_remove('[aeiou]')
[1] "pple"  "par"   "bnana"
x |> str_replace_all('[aeiou]', '')  # `pattern=''`
[1] "ppl" "pr"  "bnn"
x |> str_remove_all('[aeiou]')
[1] "ppl" "pr"  "bnn"

Ranges of characters

Task: if a string has a letter between 'a' and 'u', replace it with 'z'.

  • Could spell out manually all letters between 'a' and 'u'. Yuck.
  • Instead, you can use the character class operator [] together with hyphen -.

The above example:

(x <- c('happy', 'ab', 'zap'))
[1] "happy" "ab"    "zap"  
# replace each letter between a and u with z
x |> str_replace_all('[a-u]', 'z')
[1] "zzzzy" "zz"    "zzz"  

An example with numbers:

(x <- c('code9202', 'apple2850', '0352'))
[1] "code9202"  "apple2850" "0352"     
# replace each number between 0 and 5 with z
x |> str_replace_all('[0-5]', 'z') 
[1] "code9zzz"  "applez8zz" "zzzz"     

Very useful to use ranges in conjunction with ?, *, +

# Find all words with at least three consecutive vowels.
words |> str_view('[aeiou][aeiou][aeiou]+')
 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
[915] │ var<iou>s

Useful for parsing strings which are partitioned by letters/numbers

(name_score <- c('Mary_92', 'Pat_35', 'Will_85'))
[1] "Mary_92" "Pat_35"  "Will_85"
name_score |> 
  str_replace('[a-zA-Z]+', 'John') |>  # replace all names with John
  str_replace('[0-9]+', '100')  # replace all scores with 100
[1] "John_100" "John_100" "John_100"

Extracting variables

separate_wider_regex(): go from long to wide using regex.

df <- tribble(
  ~str,
  '<Sheryl>-F_34',
  '<Kisha>-F_45',
  '<Pat>-X_33',
  '<Sharon>-F_38',
  '<Penny>-F_58',
  '<Justin>-M_41',
  '<Patricia>-F_84',
)
  • To extract data, construct sequence of regex that match each piece.
  • If you want contents of that piece to appear in output, give it a name.
df |> separate_wider_regex(
    str,
    patterns = c(
      '<',
      name = '[A-Za-z]+',
      '>-',
      gender = '.',
      '_',
      age = '[0-9]+'))
# A tibble: 7 × 3
  name     gender age  
  <chr>    <chr>  <chr>
1 Sheryl   F      34   
2 Kisha    F      45   
3 Pat      X      33   
4 Sharon   F      38   
5 Penny    F      58   
6 Justin   M      41   
7 Patricia F      84   

Escaping

  • Since the characters '.', '?', '+', '*' have extra functionality in regex, must use escapes to help parse literal instances of these characters
  • In regex, we require a \ in front of characters to denote an escape
  • To create a string with a literal \, we must use an escape, so double \\:
c('abc', 'a.c', 'bef') |> str_view('a\\.c')
[2] │ <a.c>
c('a*rdvark', '*pple', 'm*n') |> str_view('\\*')
[1] │ a<*>rdvark
[2] │ <*>pple
[3] │ m<*>n
  • Recall that to represent backslash in a string, need to escape:
str_view('a\\b')
[1] │ a\b
  • To match for a backslash, need to create a string which has an escape in front of a backslash.
  • The escape requires double backslash, and the string \ also requires double backslash.
'a\\b' |> str_view('\\\\')
[1] │ a<\>b
'mary.elizabeth' |> str_replace('\.', '-')
# Error: '\.' is an unrecognized escape in character string (<input>:1:33)
'mary.elizabeth' |> str_replace('\\.', '-')
[1] "mary-elizabeth"

Anchors

  • By default: regex will match any part of a string.
  • To match only at beginning or end, you need to anchor:
fruit |> str_view('^a')  # `^` indicates 'starts with'
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
fruit |> str_view('a$')  # `$` indicates 'ends with'
 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
  • Example: replace every fruit name which starts with 'a' with an 'o'
fruit |> str_replace('^a', 'o')
 [1] "opple"             "opricot"           "ovocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
...
  • To match only the full string, not subsets, anchor it with both ^ and $:
fruit |> str_view('apple')
 [1] │ <apple>
[62] │ pine<apple>
fruit |> str_view('^apple$')
[1] │ <apple>

Anchors: boundaries of words

  • You can specify the beginning or end of the word using \b
    • This works by treating all letters and numbers as ‘word’ characters, and everything else as ‘non-word’ characters
x <- c('summary(x)', 'summarize(df)', 'rowsum(x)', 'sum(x)')
x |> str_view('sum')
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
x |> str_view('\\bsum\\b')
[4] │ <sum>(x)

Character sets

  • We have built sets with []: e.g. [ac] matches if any character is 'a' or 'c'.
  • We have used - to denote ranges, e.g. [a-z] lowercase letters, [0-9] numbers.

Some others:

(x <- 'abcd ABCD 12345 -!@#%.')
[1] "abcd ABCD 12345 -!@#%."
  • \d matches any digit;
  • \D matches anything that isn’t a digit. .
x |> str_view('\\d+')
[1] │ abcd ABCD <12345> -!@#%.
x |> str_view('\\D+')
[1] │ <abcd ABCD >12345< -!@#%.>
  • \s matches any whitespace (e.g., space, tab, newline);
  • \S matches anything that isn’t whitespace.
x |> str_view('\\s+')
[1] │ abcd< >ABCD< >12345< >-!@#%.
x |> str_view('\\S+')
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
  • \w matches any “word” character, i.e. letters and numbers;
  • \W matches any “non-word” character.
x |> str_view('\\w+')
[1] │ <abcd> <ABCD> <12345> -!@#%.
x |> str_view('\\W+')
[1] │ abcd< >ABCD< >12345< -!@#%.>

Remember: to represent \ in a string, need double backslash.

Quantifiers

We already discussed ? (0 or 1 match), + (1+ matches), * (0+ matches)

  • colou?r: matches American and British English
  • \d+: matches 1+ digits
  • \s?: matches 0+ whitespaces

Can specify exact number of matches:

  • {n} matches exactly n times.
  • {n,} matches at least n times.
  • {n,m} matches between n and m times.

Words with >= 3 consecutive vowels?

words |> str_view('[aeiou]{3,}')
 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
...

Words with between 4 and 6 consecutive consonants:

words |> str_view('[^aeiou]{4,6}')
 [45] │ a<pply>
[198] │ cou<ntry>
[424] │ indu<stry>
[830] │ su<pply>
[836] │ <syst>em

Order of operations in regex: precedence

Not immediately clear in which order R processes different operators.

  • ab+: is this 'a' and then 1+ 'b', or is it 'ab' repeated 1+ times? (1st case)
  • ^a|b$: match the string 'a' or the string 'b', OR: string starting with 'a' or string starting with 'b' (2nd case)

Generally: quantifiers (?+*) have high precedence, alternation | low.

Order of operations in regex: parenthesis

You can also introduce parenthesis to be more explicit about what you want.

words |> str_view('a(b+)') # same as `ab+`
  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ avail<ab>le
 [66] │ b<ab>y
[452] │ l<ab>our
[648] │ prob<ab>le
[837] │ t<ab>le
words |> str_view('(^a)|(b$)') # same as `^a|b$`
 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
 [6] │ <a>ccount
 [7] │ <a>chieve
 [8] │ <a>cross
 [9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
...

Order of operations in regex: parenthesis 2

With parentheses, you can back-reference matches that appeared in parentheses, using \1 for a match in the first parentheses, \2 for a match in the second, etc.

  • e.g. find all fruits which have a repeated pair of letters.
  • Pair of letters = '(..)'; back-ref: '\\1'
fruit |> str_view('(..)\\1')
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
  • Words that start and end with same pair of letters:
# 'starts with' a pair: ^(..)
# 'ends with': need to end regex with \\1$
# to allow any chars between, put .* in middle
words |> str_view('^(..).*\\1$')
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>
  • Words that are repetitions of the same pair of letters:
c('haha', 'miumiu') |> str_view('^(..)+\\1$')
[1] │ <haha>

Order of operations in regex: parenthesis 3

Can back-reference in str_replace(). E.g. swap 2nd and 3rd words in sentence

sentences
  [1] "The birch canoe slid on the smooth planks."               
  [2] "Glue the sheet to the dark blue background."              
  [3] "It's easy to tell the depth of a well."                   
...
sentences |> str_replace('(\\w+) (\\w+) (\\w+)', '\\1 \\3 \\2')
  [1] "The canoe birch slid on the smooth planks."               
  [2] "Glue sheet the to the dark blue background."              
  [3] "It's to easy tell the depth of a well."                   
...
  • (\\w+): matches with 1+ ‘word characters’ (letters, numbers)
  • Spacing between (\\w+) ensures we are looking for sequences of the form: word-chars, space, word-chars, space, word-chars

Examples

Words that start with 'y':

words |> str_view('^y')
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
...

Words that don’t start with 'y':

words |> str_view('^[^y]')
 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
...

Ends with a vowel-vowel-consonant triplet:

words |> str_view('[aeiou]{2}[^aeiou]$')
  [3] │ ab<out>
 [11] │ act<ual>
 [19] │ aftern<oon>
 [20] │ ag<ain>
 [26] │ <air>
...

Has 7 or more letters:

words |> str_view('[a-z]{7,}')
 [4] │ <absolute>
 [6] │ <account>
 [7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
...

Boolean operations

We have seen how ^ inside [] negates the set, i.e. words with no vowels:

words |> str_view('^[^aeiou]+$')
[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
[952] │ <why>
  • There’s no ‘and’ operator built into regex, which can complicate certain tasks.
  • Another way: return logical vector indicating presence of vowels, then negate:
words[!str_detect(words, '[aeiou]')]
[1] "by"  "dry" "fly" "mrs" "try" "why"

Boolean operations: examples

Find all words containing an 'a' and 'b'

# Trickier in standard regex
words |> str_view('a.*b|b.*a')
  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ <availab>le
...
  • Easier with str_detect() and &:
words[
    str_detect(words, 'a') & 
    str_detect(words, 'b')
]
 [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
 [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
[13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
...

Find all words containing 'a', 'e', 'i', and 'o'.

  • Standard regex solution would be very complex.
  • Much easier using str_detect() and &:
words[
  str_detect(words, 'a') &
  str_detect(words, 'e') &
  str_detect(words, 'i') &
  str_detect(words, 'o')
]
[1] "appropriate" "associate"   "organize"    "relation"