05: Transformations of strings

STA35B: Statistical Data Science 2

Akira Horiguchi

We’ll focus on transforming strings.

library(tidyverse)
library(babynames)

We’ll primarily work with stringr, which has functions that start with str_.

Creating strings

Creating strings from scratch

Can create strings using either ' or " - single or double quotes
If you want quotes within your string, use ' on outside and " on inside (or reverse)

string1 <- "example of a string"
string2 <- 'this string has a "quote" inside of it'

In RStudio, if you highlight text and then press ' or ", it puts quotes around it
If you forget to close a quote, console will print + and wait for you to complete
- Can lead to very confusing / never-ending errors in the console.

> "This is a string without a closing quote
+ 
+ 
+ more text

Escapes

To include a literal single or double quote in a string, use \ to escape it.
This is what R is implicitly doing when you put quotes inside of strings.

(string2 <- 'this string has a "quote" inside of it')

[1] "this string has a \"quote\" inside of it"

(string3 <- "this string has a \"quote\" inside of it")

[1] "this string has a \"quote\" inside of it"

string2 == string3

[1] TRUE

Another special character you need to escape: \, using \\.

x <- c('\'', "\"", "\\")

There are other special characters (next slide).

Other special characters

In addition to \", \', \\, there is:
\n: new line
\t: tab
\u or \U: unicode characters
Base R function writeLines() writes text, similar to dplyr::str_view()

x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x

[1] "one\ntwo" "one\ttwo" "µ"        "😄"

writeLines(x)

one
two
one two
µ
😄

Examples

Create a string…

… with value He said "That's amazing!"

x <- 'He said "That\'s amazing!"'
x

[1] "He said \"That's amazing!\""

writeLines(x)

He said "That's amazing!"

y <- "He said \"That's amazing!\""
y

[1] "He said \"That's amazing!\""

writeLines(y)

He said "That's amazing!"

… with value \a\b\c

x <- "\\a\\b\\c"
x

[1] "\\a\\b\\c"

writeLines(x)

\a\b\c

Creating strings from data

We’ll now go over ways to create new strings from tibbles.
There are many functions that work well with dplyr
- str_c()
- str_glue()
- str_flatten()

`str_c()`: create strings by concatenation

Concatenates any number of vectors and returns a character vector

( str_c("x", "y") )

[1] "xy"

( str_c("x", "y", "z") )

[1] "xyz"

( str_c("Hello ", c("Bug", "Dog")))  # element-wise operation

[1] "Hello Bug" "Hello Dog"

Similar to paste0() in base R, but friendlier for dplyr — obeys Tidyverse rules for recycling and propagating missing vals.

Compare `str_c()` vs `paste0()`

df <- tibble(nm = c("Bug", "Cat", "Ant", NA))
df

# A tibble: 4 × 1
  nm   
  <chr>
1 Bug  
2 Cat  
3 Ant  
4 <NA>

df %>% 
  mutate(gr=str_c("Hi ", nm, "!"))

# A tibble: 4 × 2
  nm    gr     
  <chr> <chr>  
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  <NA>

df %>% 
  mutate(gr=paste0("Hi ", nm, "!"))

# A tibble: 4 × 2
  nm    gr     
  <chr> <chr>  
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  Hi NA!

`str_glue()`: create strings with `{}`

If you’re mixing strings with variables, lots of "s make it hard to read
str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
similar to Python’s f strings

df %>%  # matches behavior of `paste0()`
  mutate(gr = str_glue("Hi {nm}!"))

# A tibble: 4 × 2
  nm    gr     
  <chr> <glue> 
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  Hi NA!

Default behavior for NA is to copy over the literal NA; inconsistent with str_c(). If you set .na=NULL, then matches behavior of str_c()

`str_glue()`: create strings with `{}`

If you’re mixing strings with variables, lots of "s make it hard to read
str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
similar to Python’s f strings

df %>%  # matches behavior of `str_c()`
  mutate(gr = str_glue("Hi {nm}!", .na=NULL))

# A tibble: 4 × 2
  nm    gr     
  <chr> <glue> 
1 Bug   Hi Bug!
2 Cat   Hi Cat!
3 Ant   Hi Ant!
4 <NA>  <NA>

`str_glue()`: create strings with `{}`

If you’re mixing strings with variables, lots of "s make it hard to read
str_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:
similar to Python’s f strings

df %>%  # For literal `{` or `}`, use double `{` or `}`
  mutate(gr = str_glue("Hi {{{nm}}}!", .na=NULL))

# A tibble: 4 × 2
  nm    gr       
  <chr> <glue>   
1 Bug   Hi {Bug}!
2 Cat   Hi {Cat}!
3 Ant   Hi {Ant}!
4 <NA>  <NA>

`str_flatten()`: create strings from a vector

If operating over vectors, str_c() and str_glue() return vectors of the same length. This is useful for mutate().
If we instead want to e.g. concatenate all strings in a group (summarize()), we can use str_flatten()

( str_flatten(c("x", "y", "z")) )

[1] "xyz"

( str_flatten(c("x", "y", "z"), ", "))

[1] "x, y, z"

( str_flatten(c("x", "y", "z"), ", ", last = ", and ") )

[1] "x, y, and z"

`str_flatten()`: create strings from a vector

Allows for easy computation of gluing together strings per group:

df

# A tibble: 5 × 2
  name  fruit    
  <chr> <chr>    
1 Cat   banana   
2 Cat   apple    
3 Bug   nectarine
4 Ant   orange   
5 Ant   papaya

df %>%
  group_by(name) %>%
  summarize(fruits = str_flatten(fruit, ", "))

# A tibble: 3 × 2
  name  fruits        
  <chr> <chr>         
1 Ant   orange, papaya
2 Bug   nectarine     
3 Cat   banana, apple

Extracting data from strings

Individual characters in a string: `str_length()`

str_length(): returns number of characters in the string

str_length(c("a", "R for data science", NA))

[1]  1 18 NA

str_length(c(1, 44))

[1] 1 2

str_length(c(TRUE, FALSE))

[1] 4 5

Individual characters in a string: `str_sub()`

str_sub(string, start, end): subsets string from start index to end index.

x <- c("Apple", "Banana", "Orange", NA)
str_sub(x, 1, 1)

[1] "A" "B" "O" NA

str_sub(x, 1, 4)

[1] "Appl" "Bana" "Oran" NA

str_sub(x, 2, 4)

[1] "ppl" "ana" "ran" NA

start and end can be negative: -1 is last char, -2 second to last, etc.

str_sub(x, -3, -1)

[1] "ple" "ana" "nge" NA

Extracting data from strings

We’ll focus on four useful tidyr functions for extracting data from strings:

df |> separate_longer_delim(col, delim)
df |> separate_longer_position(col, width)
df |> separate_wider_delim(col, delim, names)
df |> separate_wider_position(col, widths)

_longer creates new rows / collapses columns to make df longer
_wider creates new columns / collapses rows to make df wider

_delim splits up a string with a delimiter like ", " or " "
_position splits at specified widths of the string, like c(3,5,2)

Examples of each will be shown in the next few slides.

`separate_longer_delim()`

_longer creates new rows / collapses columns to make df longer
_delim splits up a string with a delimiter like ", " or " "

df1

# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D

df1 %>% 
  separate_longer_delim(grades, delim = ",")

# A tibble: 7 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Bug   F     
6 Ant   C     
7 Ant   D

`separate_longer_delim()`

_longer creates new rows / collapses columns to make df longer
_delim splits up a string with a delimiter like ", " or " "

df1

# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D

df1 %>%   # wrong delimiter
  separate_longer_delim(grades, delim = ".")

# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D

`separate_longer_delim()`

_longer creates new rows / collapses columns to make df longer
_delim splits up a string with a delimiter like ", " or " "

df1

# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Bug   F      
3 Ant   C,D

df1 %>%   # empty delimiter
  separate_longer_delim(grades, delim = "")

Warning in stri_split_fixed(string, pattern, n = n, simplify = simplify, :
empty search patterns are not supported

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   <NA>  
2 Bug   <NA>  
3 Ant   <NA>

`separate_longer_position()`

Less common, but sometimes each character in a value records a value itself

e.g. if you record all grades for each student in a single continuous string:

df2

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD

df2 %>% 
  separate_longer_position(grades, width = 1)

# A tibble: 7 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Bug   F     
6 Ant   C     
7 Ant   D

`separate_longer_position()`

Less common, but sometimes each character in a value records a value itself

e.g. if you record all grades for each student in a single continuous string:

df2

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD

df2 %>% 
  separate_longer_position(grades, width = 2)

# A tibble: 4 × 2
  name  grades
  <chr> <chr> 
1 Cat   AB    
2 Cat   BA    
3 Bug   F     
4 Ant   CD

`separate_longer_position()`

Less common, but sometimes each character in a value records a value itself

e.g. if you record all grades for each student in a single continuous string:

df2

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD

df2 %>% 
  separate_longer_position(grades, width = 3)

# A tibble: 4 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABB   
2 Cat   A     
3 Bug   F     
4 Ant   CD

`separate_longer_position()`

Less common, but sometimes each character in a value records a value itself

e.g. if you record all grades for each student in a single continuous string:

df2

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD

df2 %>% 
  separate_longer_position(grades, width = 4)

# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Bug   F     
3 Ant   CD

`separate_longer_...()`

Compare delim vs position based on different formatting:

df3

# A tibble: 2 × 2
  name  grades 
  <chr> <chr>  
1 Cat   A,B,B,A
2 Ant   C,D

df3 %>% separate_longer_delim(grades, delim = ",")

# A tibble: 6 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Ant   C     
6 Ant   D

df4

# A tibble: 2 × 2
  name  grades
  <chr> <chr> 
1 Cat   ABBA  
2 Ant   CD

df4 %>% separate_longer_position(grades, width = 1)

# A tibble: 6 × 2
  name  grades
  <chr> <chr> 
1 Cat   A     
2 Cat   B     
3 Cat   B     
4 Cat   A     
5 Ant   C     
6 Ant   D

`separate_wider_delim()`

Separates into columns (wider)

Consider the following tibble:

df4

# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011

x has a code, edition number, and year, separated by "."

To separate, supply delimiter and names of new columns

df4 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )

# A tibble: 2 × 3
  code  edition year 
  <chr> <chr>   <chr>
1 a10   1       2022 
2 b10   2       2011

`separate_wider_delim()`

Separates into columns (wider)

Consider the following tibble:

df4

# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011

x has a code, edition number, and year, separated by "."

To remove output columns, supply NA for name of those columns

df4 |>
  separate_wider_delim(
    x, 
    delim = ".",
    names = c("code", NA, "year")
  )

# A tibble: 2 × 2
  code  year 
  <chr> <chr>
1 a10   2022 
2 b10   2011

`separate_wider_delim()`

Separates into columns (wider)

Consider the following tibble:

df4

# A tibble: 2 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011

x has a code, edition number, and year, separated by "."

To remove output columns, supply NA for name of those columns

df4 |>
  separate_wider_delim(
    x, 
    delim = ".",
    names = c("code", NA, NA)
  )

# A tibble: 2 × 1
  code 
  <chr>
1 a10  
2 b10

`separate_wider_position()`

Separates into columns (wider)

Consider the following tibble:

df5

# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA

x has
- year (first 4 chars)
- age (next 2 chars)
- state (next 2 chars)

To separate, supply named integer vector:

name = name of new column,
value = number of characters

df5 %>% 
  separate_wider_position(
    x, 
    widths = c(year=4, age=2, state=2)
  )

# A tibble: 3 × 3
  year  age   state
  <chr> <chr> <chr>
1 2022  15    TX   
2 2021  22    LA   
3 2023  25    CA

`separate_wider_position()`

Separates into columns (wider)

Consider the following tibble:

df5

# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA

x has
- year (first 4 chars)
- age (next 2 chars)
- state (next 2 chars)

To remove output columns, omit those names in the named vector.

df5 %>% 
  separate_wider_position(
    x, 
    widths = c(year=4, 2, state=2)
  )

# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA

`separate_wider_position()`

Separates into columns (wider)

Consider the following tibble:

df5

# A tibble: 3 × 1
  x       
  <chr>   
1 202215TX
2 202122LA
3 202325CA

x has
- year (first 4 chars)
- age (next 2 chars)
- state (next 2 chars)

Alternatively, just use select(-name):

df5 |> 
  separate_wider_position(
    x,
    widths = c(year=4, age=2, state=2)
  ) %>%
  select(-age)

# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA

Diagnosing widening problems

separate_wider_delim() requires a fixed and known set of columns

Problem if some rows don’t have expected number of pieces.
too_few and too_many args of separate_wider_delim() can help here.

df

# A tibble: 4 × 1
  u    
  <chr>
1 1-1-1
2 1-3  
3 1-3-2
4 1

df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z")
  )
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.

Let’s try its suggestion to use debug:

Diagnosing widening problems: `too_few`

df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )

Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.

# A tibble: 4 × 7
  x     y     z     u     u_ok  u_pieces u_remainder
  <chr> <chr> <chr> <chr> <lgl>    <int> <chr>      
1 1     1     1     1-1-1 TRUE         3 ""         
2 1     3     <NA>  1-3   FALSE        2 ""         
3 1     3     2     1-3-2 TRUE         3 ""         
4 1     <NA>  <NA>  1     FALSE        1 ""

Three columns get added:

u_ok: which inputs failed
u_pieces: how many pieces were found
u_remainder isn’t useful when too few pieces but we will see it is useful when too many.

Using debug will typically reveal a problem with delimiter strategy

suggests need to preprocess the tibble

Diagnosing widening problems: `too_few`

Can fill in missing pieces with NAs by either…

…putting NAs at tail end

df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_start'
  )

# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     3     <NA> 
3 1     3     2    
4 1     <NA>  <NA>

…putting NAs at front end

df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_end'
  )

# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 <NA>  1     3    
3 1     3     2    
4 <NA>  <NA>  1

Diagnosing widening problems: `too_many`

Same principles apply for too many pieces.

df

# A tibble: 4 × 1
  u        
  <chr>    
1 1-1-1    
2 1-1-2    
3 1-3-5-6  
4 1-3-5-7-9

df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z")
  )
# Error in `separate_wider_delim()`:
# ! Expected 3 pieces in each element of `u`.
# ! 2 values were too long.
# ℹ Use `too_many = "debug"` to diagnose the problem.
# ℹ Use `too_many = "drop"/"merge"` to silence this message.

Diagnosing widening problems: `too_many`

Debugging shows purpose of u_remainder column:

df

# A tibble: 4 × 1
  u        
  <chr>    
1 1-1-1    
2 1-1-2    
3 1-3-5-6  
4 1-3-5-7-9

df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'debug'
  )

Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.

# A tibble: 4 × 7
  x     y     z     u         u_ok  u_pieces u_remainder
  <chr> <chr> <chr> <chr>     <lgl>    <int> <chr>      
1 1     1     1     1-1-1     TRUE         3 ""         
2 1     1     2     1-1-2     TRUE         3 ""         
3 1     3     5     1-3-5-6   FALSE        4 "-6"       
4 1     3     5     1-3-5-7-9 FALSE        5 "-7-9"

Diagnosing widening problems: `too_many`

To handle too many pieces, you can either…

“merge” the extras into a single column

df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'merge'
  )

# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5-6  
4 1     3     5-7-9

or “drop” the extras

df |>
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_many = 'drop'
  )

# A tibble: 4 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     5    
4 1     3     5

05: Transformations of strings

Creating strings

Creating strings from scratch

Escapes

Other special characters

Examples

Creating strings from data

str_c(): create strings by concatenation

Compare str_c() vs paste0()

str_glue(): create strings with {}

str_glue(): create strings with {}

str_glue(): create strings with {}

str_flatten(): create strings from a vector

str_flatten(): create strings from a vector

Extracting data from strings

Individual characters in a string: str_length()

Individual characters in a string: str_sub()

Extracting data from strings

separate_longer_delim()

separate_longer_delim()

separate_longer_delim()

separate_longer_position()

separate_longer_position()

separate_longer_position()

separate_longer_position()

separate_longer_...()

separate_wider_delim()

separate_wider_delim()

separate_wider_delim()

separate_wider_position()

separate_wider_position()

separate_wider_position()

Diagnosing widening problems

Diagnosing widening problems: too_few

Diagnosing widening problems: too_few

Diagnosing widening problems: too_many

Diagnosing widening problems: too_many

Diagnosing widening problems: too_many

`str_c()`: create strings by concatenation

Compare `str_c()` vs `paste0()`

`str_glue()`: create strings with `{}`

`str_glue()`: create strings with `{}`

`str_glue()`: create strings with `{}`

`str_flatten()`: create strings from a vector

`str_flatten()`: create strings from a vector

Individual characters in a string: `str_length()`

Individual characters in a string: `str_sub()`

`separate_longer_delim()`

`separate_longer_delim()`

`separate_longer_delim()`

`separate_longer_position()`

`separate_longer_position()`

`separate_longer_position()`

`separate_longer_position()`

`separate_longer_...()`

`separate_wider_delim()`

`separate_wider_delim()`

`separate_wider_delim()`

`separate_wider_position()`

`separate_wider_position()`

`separate_wider_position()`

Diagnosing widening problems: `too_few`

Diagnosing widening problems: `too_few`

Diagnosing widening problems: `too_many`

Diagnosing widening problems: `too_many`

Diagnosing widening problems: `too_many`