05: Transformations of strings

STA35B: Statistical Data Science 2

Akira Horiguchi

We’ll focus on transforming strings.

library(tidyverse)
library(babynames)

We’ll primarily work with stringr, which has functions that start with str_.

Creating strings

Creating strings from scratch

  • Can create strings using either ' or " - single or double quotes
  • If you want quotes within your string, use ' on outside and " on inside (or reverse)
string1 <- "example of a string"
string2 <- 'this string has a "quote" inside of it'
  • In RStudio, if you highlight text and then press ' or ", it puts quotes around it
  • If you forget to close a quote, console will print + and wait for you to complete
    • Can lead to very confusing / never-ending errors in the console.
> "This is a string without a closing quote
+ 
+ 
+ more text

Escapes

  • To include a literal single or double quote in a string, use \ to escape it.
  • This is what R is implicitly doing when you put quotes inside of strings.
(string2 <- 'this string has a "quote" inside of it')
[1] "this string has a \"quote\" inside of it"
(string3 <- "this string has a \"quote\" inside of it")
[1] "this string has a \"quote\" inside of it"
string2 == string3
[1] TRUE
  • Another special character you need to escape: \, using \\.
x <- c('\'', "\"", "\\")

There are other special characters (next slide).

Other special characters

  • In addition to \", \', \\, there is:
  • \n: new line
  • \t: tab
  • \u or \U: unicode characters
  • Base R function writeLines() writes text, similar to dplyr::str_view()
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x
[1] "one\ntwo" "one\ttwo" "µ"        "😄"      
writeLines(x)
one
two
one two
µ
😄

Examples

Create a string…

… with value He said "That's amazing!"

x <- 'He said "That\'s amazing!"'
x
[1] "He said \"That's amazing!\""
writeLines(x)
He said "That's amazing!"
y <- "He said \"That's amazing!\""
y
[1] "He said \"That's amazing!\""
writeLines(y)
He said "That's amazing!"

… with value \a\b\c

x <- "\\a\\b\\c"
x
[1] "\\a\\b\\c"
writeLines(x)
\a\b\c

Creating strings from data

  • We’ll now go over ways to create new strings from tibbles.
  • There are many functions that work well with dplyr
    • str_c()
    • str_glue()
    • str_flatten()

str_c(): create strings by concatenation

  • Concatenates any number of vectors and returns a character vector
( str_c("x", "y") )
[1] "xy"
( str_c("x", "y", "z") )
[1] "xyz"
( str_c("Hello ", c("Bug", "Dog")))  # element-wise operation
[1] "Hello Bug" "Hello Dog"
  • Similar to paste0() in base R, but friendlier for dplyr — obeys Tidyverse rules for recycling and propagating missing vals.

Compare str_c() vs paste0()

df <- tibble(nm = c("Bug", "Cat", "Ant", NA))
df
# A tibble: 4 × 1
  nm   
  <chr>
1 Bug  
2 Cat  
3 Ant  
4 <NA> 
df %>% 
  mutate(gr=str_c("Hi ", nm, "!"))
# A tibble: 4 × 2
  nm    gr     
  <chr> <chr>  
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  <NA>   
df %>% 
  mutate(gr=paste0("Hi ", nm, "!"))
# A tibble: 4 × 2
  nm    gr     
  <chr> <chr>  
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  Hi NA! 

str_glue(): create strings with {}

  • If you’re mixing strings with variables, lots of "s make it hard to read
  • str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
  • similar to Python’s f strings
df %>%  # matches behavior of `paste0()`
  mutate(gr = str_glue("Hi {nm}!"))
# A tibble: 4 × 2
  nm    gr     
  <chr> <glue> 
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  Hi NA! 
  • Default behavior for NA is to copy over the literal NA; inconsistent with str_c(). If you set .na=NULL, then matches behavior of str_c()

str_glue(): create strings with {}

  • If you’re mixing strings with variables, lots of "s make it hard to read
  • str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
  • similar to Python’s f strings
df %>%  # matches behavior of `str_c()`
  mutate(gr = str_glue("Hi {nm}!", .na=NULL))
# A tibble: 4 × 2
  nm    gr     
  <chr> <glue> 
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  <NA>   

str_glue(): create strings with {}

  • If you’re mixing strings with variables, lots of "s make it hard to read
  • str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
  • similar to Python’s f strings
df %>%  # For literal `{` or `}`, use double `{` or `}`
  mutate(gr = str_glue("Hi {{{nm}}}!", .na=NULL))
# A tibble: 4 × 2
  nm    gr       
  <chr> <glue>   
1 Bug   Hi {Bug}!
2 Cat   Hi {Cat}!
3 Ant   Hi {Ant}!
4 <NA>  <NA>     

str_flatten(): create strings from a vector

  • If operating over vectors, str_c() and str_glue() return vectors of the same length. This is useful for mutate().
  • If we instead want to e.g. concatenate all strings in a group (summarize()), we can use str_flatten()
( str_flatten(c("x", "y", "z")) )
[1] "xyz"
( str_flatten(c("x", "y", "z"), ", "))
[1] "x, y, z"
( str_flatten(c("x", "y", "z"), ", ", last = ", and ") )
[1] "x, y, and z"

str_flatten(): create strings from a vector

  • Allows for easy computation of gluing together strings per group:
df
# A tibble: 5 × 2
  name  fruit    
  <chr> <chr>    
1 Cat   banana   
2 Cat   apple    
3 Bug   nectarine
4 Ant   orange   
5 Ant   papaya   
df %>%
  group_by(name) %>%
  summarize(fruits = str_flatten(fruit, ", "))
# A tibble: 3 × 2
  name  fruits        
  <chr> <chr>         
1 Ant   orange, papaya
2 Bug   nectarine     
3 Cat   banana, apple 

Extracting data from strings

Individual characters in a string: str_length()

str_length(): returns number of characters in the string

str_length(c("a", "R for data science", NA))
[1]  1 18 NA
str_length(c(1, 44))
[1] 1 2
str_length(c(TRUE, FALSE))
[1] 4 5

Individual characters in a string: str_sub()

str_sub(string, start, end): subsets string from start index to end index.

x <- c("Apple", "Banana", "Orange", NA)
str_sub(x, 1, 1)
[1] "A" "B" "O" NA 
str_sub(x, 1, 4)
[1] "Appl" "Bana" "Oran" NA    
str_sub(x, 2, 4)
[1] "ppl" "ana" "ran" NA   

start and end can be negative: -1 is last char, -2 second to last, etc.

str_sub(x, -3, -1)
[1] "ple" "ana" "nge" NA   

Extracting data from strings

We’ll focus on four useful tidyr functions for extracting data from strings:

df |> separate_longer_delim(col, delim)
df |> separate_longer_position(col, width)
df |> separate_wider_delim(col, delim, names)
df |> separate_wider_position(col, widths)
  • _longer creates new rows / collapses columns to make df longer
  • _wider creates new columns / collapses rows to make df wider
  • _delim splits up a string with a delimiter like ", " or " "
  • _position splits at specified widths of the string, like c(3,5,2)

Examples of each will be shown in the next few slides.

separate_longer_delim()

  • _longer creates new rows / collapses columns to make df longer
  • _delim splits up a string with a delimiter like ", " or " "
df1
# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D    
df1 %>% 
  separate_longer_delim(grades, delim = ",")
# A tibble: 7 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Bug   F     
6 Ant   C     
7 Ant   D     

separate_longer_delim()

  • _longer creates new rows / collapses columns to make df longer
  • _delim splits up a string with a delimiter like ", " or " "
df1
# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D    
df1 %>%   # wrong delimiter
  separate_longer_delim(grades, delim = ".")
# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D    

separate_longer_delim()

  • _longer creates new rows / collapses columns to make df longer
  • _delim splits up a string with a delimiter like ", " or " "
df1
# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D    
df1 %>%   # empty delimiter
  separate_longer_delim(grades, delim = "")
Warning in stri_split_fixed(string, pattern, n = n, simplify = simplify, :
empty search patterns are not supported
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   <NA>  
2 Bug   <NA>  
3 Ant   <NA>  

separate_longer_position()

Less common, but sometimes each character in a value records a value itself

  • e.g. if you record all grades for each student in a single continuous string:
df2
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD    
df2 %>% 
  separate_longer_position(grades, width = 1)
# A tibble: 7 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Bug   F     
6 Ant   C     
7 Ant   D     

separate_longer_position()

Less common, but sometimes each character in a value records a value itself

  • e.g. if you record all grades for each student in a single continuous string:
df2
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD    
df2 %>% 
  separate_longer_position(grades, width = 2)
# A tibble: 4 × 2
  name  grades
  <chr> <chr> 
1 Cat   AB    
2 Cat   BA    
3 Bug   F     
4 Ant   CD    

separate_longer_position()

Less common, but sometimes each character in a value records a value itself

  • e.g. if you record all grades for each student in a single continuous string:
df2
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD    
df2 %>% 
  separate_longer_position(grades, width = 3)
# A tibble: 4 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABB   
2 Cat   A     
3 Bug   F     
4 Ant   CD    

separate_longer_position()

Less common, but sometimes each character in a value records a value itself

  • e.g. if you record all grades for each student in a single continuous string:
df2
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD    
df2 %>% 
  separate_longer_position(grades, width = 4)
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD    

separate_longer_...()

Compare delim vs position based on different formatting:

df3
# A tibble: 2 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Ant   C,D    
df3 %>% separate_longer_delim(grades, delim = ",")
# A tibble: 6 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Ant   C     
6 Ant   D     
df4
# A tibble: 2 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Ant   CD    
df4 %>% separate_longer_position(grades, width = 1)
# A tibble: 6 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Ant   C     
6 Ant   D     

separate_wider_delim()

Separates into columns (wider)

Consider the following tibble:

df4
# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011
  • x has a code, edition number, and year, separated by "."

To separate, supply delimiter and names of new columns

df4 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )
# A tibble: 2 × 3
  code  edition year 
  <chr> <chr>   <chr>
1 a10   1       2022 
2 b10   2       2011 

separate_wider_delim()

Separates into columns (wider)

Consider the following tibble:

df4
# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011
  • x has a code, edition number, and year, separated by "."

To remove output columns, supply NA for name of those columns

df4 |>
  separate_wider_delim(
    x, 
    delim = ".",
    names = c("code", NA, "year")
  )
# A tibble: 2 × 2
  code  year 
  <chr> <chr>
1 a10   2022 
2 b10   2011 

separate_wider_delim()

Separates into columns (wider)

Consider the following tibble:

df4
# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011
  • x has a code, edition number, and year, separated by "."

To remove output columns, supply NA for name of those columns

df4 |>
  separate_wider_delim(
    x, 
    delim = ".",
    names = c("code", NA, NA)
  )
# A tibble: 2 × 1
  code 
  <chr>
1 a10  
2 b10  

separate_wider_position()

Separates into columns (wider)

Consider the following tibble:

df5
# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA
  • x has
    • year (first 4 chars)
    • age (next 2 chars)
    • state (next 2 chars)

To separate, supply named integer vector:

  • name = name of new column,
  • value = number of characters
df5 %>% 
  separate_wider_position(
    x, 
    widths = c(year=4, age=2, state=2)
  )
# A tibble: 3 × 3
  year  age   state
  <chr> <chr> <chr>
1 2022  15    TX   
2 2021  22    LA   
3 2023  25    CA   

separate_wider_position()

Separates into columns (wider)

Consider the following tibble:

df5
# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA
  • x has
    • year (first 4 chars)
    • age (next 2 chars)
    • state (next 2 chars)

To remove output columns, omit those names in the named vector.

df5 %>% 
  separate_wider_position(
    x, 
    widths = c(year=4, 2, state=2)
  )
# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA   

separate_wider_position()

Separates into columns (wider)

Consider the following tibble:

df5
# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA
  • x has
    • year (first 4 chars)
    • age (next 2 chars)
    • state (next 2 chars)

Alternatively, just use select(-name):

df5 |> 
  separate_wider_position(
    x,
    widths = c(year=4, age=2, state=2)
  ) %>%
  select(-age)
# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA   

Diagnosing widening problems

separate_wider_delim() requires a fixed and known set of columns

  • Problem if some rows don’t have expected number of pieces.
  • too_few and too_many args of separate_wider_delim() can help here.
df
# A tibble: 4 × 1
  u    
  <chr>
1 1-1-1
2 1-3  
3 1-3-2
4 1    
df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z")
  )
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.

Let’s try its suggestion to use debug:

Diagnosing widening problems: too_few

df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )
Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
  x     y     z     u     u_ok  u_pieces u_remainder
  <chr> <chr> <chr> <chr> <lgl>    <int> <chr>      
1 1     1     1     1-1-1 TRUE         3 ""         
2 1     3     <NA>  1-3   FALSE        2 ""         
3 1     3     2     1-3-2 TRUE         3 ""         
4 1     <NA>  <NA>  1     FALSE        1 ""         

Three columns get added:

  1. u_ok: which inputs failed
  2. u_pieces: how many pieces were found
  3. u_remainder isn’t useful when too few pieces but we will see it is useful when too many.

Using debug will typically reveal a problem with delimiter strategy

  • suggests need to preprocess the tibble

Diagnosing widening problems: too_few

Can fill in missing pieces with NAs by either…

…putting NAs at tail end

df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_start'
  )
# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     3     <NA> 
3 1     3     2    
4 1     <NA>  <NA> 

…putting NAs at front end

df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_end'
  )
# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 <NA>  1     3    
3 1     3     2    
4 <NA>  <NA>  1    

Diagnosing widening problems: too_many

Same principles apply for too many pieces.

df
# A tibble: 4 × 1
  u        
  <chr>    
1 1-1-1    
2 1-1-2    
3 1-3-5-6  
4 1-3-5-7-9
df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z")
  )
# Error in `separate_wider_delim()`:
# ! Expected 3 pieces in each element of `u`.
# ! 2 values were too long.
# ℹ Use `too_many = "debug"` to diagnose the problem.
# ℹ Use `too_many = "drop"/"merge"` to silence this message.

Diagnosing widening problems: too_many

Debugging shows purpose of u_remainder column:

df
# A tibble: 4 × 1
  u        
  <chr>    
1 1-1-1    
2 1-1-2    
3 1-3-5-6  
4 1-3-5-7-9
df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'debug'
  )
Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
  x     y     z     u         u_ok  u_pieces u_remainder
  <chr> <chr> <chr> <chr>     <lgl>    <int> <chr>      
1 1     1     1     1-1-1     TRUE         3 ""         
2 1     1     2     1-1-2     TRUE         3 ""         
3 1     3     5     1-3-5-6   FALSE        4 "-6"       
4 1     3     5     1-3-5-7-9 FALSE        5 "-7-9"     

Diagnosing widening problems: too_many

To handle too many pieces, you can either…

“merge” the extras into a single column

df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'merge'
  )
# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5-6  
4 1     3     5-7-9

or “drop” the extras

df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'drop'
  )
# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5    
4 1     3     5