STA35B: Statistical Data Science 2
We’ll focus on transforming strings.
We’ll primarily work with stringr
, which has functions that start with str_
.
'
or "
- single or double quotes'
on outside and "
on inside (or reverse)'
or "
, it puts quotes around it+
and wait for you to complete
\
to escape it.[1] "this string has a \"quote\" inside of it"
[1] "this string has a \"quote\" inside of it"
[1] TRUE
\
, using \\
.There are other special characters (next slide).
\"
, \'
, \\
, there is:\n
: new line\t
: tab\u
or \U
: unicode characterswriteLines()
writes text, similar to dplyr::str_view()
Create a string…
… with value He said "That's amazing!"
dplyr
str_c()
str_glue()
str_flatten()
str_c()
: create strings by concatenation[1] "xy"
[1] "xyz"
[1] "Hello Bug" "Hello Dog"
paste0()
in base R, but friendlier for dplyr
— obeys Tidyverse rules for recycling and propagating missing vals.str_c()
vs paste0()
# A tibble: 4 × 1
nm
<chr>
1 Bug
2 Cat
3 Ant
4 <NA>
str_glue()
: create strings with {}
"
s make it hard to readstr_glue()
: anything inside of {}
will be evaluated like it doesn’t have quotes:# A tibble: 4 × 2
nm gr
<chr> <glue>
1 Bug Hi Bug!
2 Cat Hi Cat!
3 Ant Hi Ant!
4 <NA> Hi NA!
NA
is to copy over the literal NA
; inconsistent with str_c()
. If you set .na=NULL
, then matches behavior of str_c()
str_glue()
: create strings with {}
"
s make it hard to readstr_glue()
: anything inside of {}
will be evaluated like it doesn’t have quotes:str_glue()
: create strings with {}
"
s make it hard to readstr_glue()
: anything inside of {}
will be evaluated like it doesn’t have quotes:str_flatten()
: create strings from a vectorstr_c()
and str_glue()
return vectors of the same length. This is useful for mutate()
.summarize()
), we can use str_flatten()
str_flatten()
: create strings from a vectorstr_length()
str_length()
: returns number of characters in the string
str_sub()
str_sub(string, start, end)
: subsets string
from start
index to end
index.
We’ll focus on four useful tidyr
functions for extracting data from strings:
_longer
creates new rows / collapses columns to make df longer_wider
creates new columns / collapses rows to make df wider_delim
splits up a string with a delimiter like ", "
or " "
_position
splits at specified widths of the string, like c(3,5,2)
Examples of each will be shown in the next few slides.
separate_longer_delim()
_longer
creates new rows / collapses columns to make df longer_delim
splits up a string with a delimiter like ", "
or " "
separate_longer_delim()
_longer
creates new rows / collapses columns to make df longer_delim
splits up a string with a delimiter like ", "
or " "
separate_longer_delim()
_longer
creates new rows / collapses columns to make df longer_delim
splits up a string with a delimiter like ", "
or " "
separate_longer_position()
Less common, but sometimes each character in a value records a value itself
separate_longer_position()
Less common, but sometimes each character in a value records a value itself
separate_longer_position()
Less common, but sometimes each character in a value records a value itself
separate_longer_position()
Less common, but sometimes each character in a value records a value itself
separate_longer_...()
Compare delim
vs position
based on different formatting:
separate_wider_delim()
Separates into columns (wider)
Consider the following tibble:
x
has a code, edition number, and year, separated by "."
separate_wider_delim()
Separates into columns (wider)
Consider the following tibble:
x
has a code, edition number, and year, separated by "."
separate_wider_delim()
Separates into columns (wider)
Consider the following tibble:
x
has a code, edition number, and year, separated by "."
separate_wider_position()
Separates into columns (wider)
Consider the following tibble:
x
has
separate_wider_position()
Separates into columns (wider)
Consider the following tibble:
x
has
separate_wider_position()
Separates into columns (wider)
Consider the following tibble:
x
has
separate_wider_delim()
requires a fixed and known set of columns
too_few
and too_many
args of separate_wider_delim()
can help here.df |>
separate_wider_delim(
u,
delim = "-",
names = c("x", "y", "z")
)
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
Let’s try its suggestion to use debug
:
too_few
Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
x y z u u_ok u_pieces u_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 1 1 1-1-1 TRUE 3 ""
2 1 3 <NA> 1-3 FALSE 2 ""
3 1 3 2 1-3-2 TRUE 3 ""
4 1 <NA> <NA> 1 FALSE 1 ""
Three columns get added:
u_ok
: which inputs failedu_pieces
: how many pieces were foundu_remainder
isn’t useful when too few pieces but we will see it is useful when too many.Using debug
will typically reveal a problem with delimiter strategy
too_few
Can fill in missing pieces with NA
s by either…
…putting NA
s at tail end
too_many
Same principles apply for too many pieces.
too_many
Debugging shows purpose of u_remainder
column:
Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
x y z u u_ok u_pieces u_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 1 1 1-1-1 TRUE 3 ""
2 1 1 2 1-1-2 TRUE 3 ""
3 1 3 5 1-3-5-6 FALSE 4 "-6"
4 1 3 5 1-3-5-7-9 FALSE 5 "-7-9"
too_many
To handle too many pieces, you can either…
“merge” the extras into a single column
University of California, Davis · STA35B · Spring 2025