STA35B: Statistical Data Science 2
We’ll focus on transforming strings.
We’ll primarily work with stringr, which has functions that start with str_.
' or " - single or double quotes' on outside and " on inside (or reverse)' or ", it puts quotes around it+ and wait for you to complete
\ to escape it.[1] "this string has a \"quote\" inside of it"
[1] "this string has a \"quote\" inside of it"
[1] TRUE
\, using \\.There are other special characters (next slide).
\", \', \\, there is:\n: new line\t: tab\u or \U: unicode characterswriteLines() writes text, similar to dplyr::str_view()Create a string…
… with value He said "That's amazing!"
dplyr
str_c()str_glue()str_flatten()str_c(): create strings by concatenation[1] "xy"
[1] "xyz"
[1] "Hello Bug" "Hello Dog"
paste0() in base R, but friendlier for dplyr — obeys Tidyverse rules for recycling and propagating missing vals.str_c() vs paste0()# A tibble: 4 × 1
nm
<chr>
1 Bug
2 Cat
3 Ant
4 <NA>
str_glue(): create strings with {}"s make it hard to readstr_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:# A tibble: 4 × 2
nm gr
<chr> <glue>
1 Bug Hi Bug!
2 Cat Hi Cat!
3 Ant Hi Ant!
4 <NA> Hi NA!
NA is to copy over the literal NA; inconsistent with str_c(). If you set .na=NULL, then matches behavior of str_c()str_glue(): create strings with {}"s make it hard to readstr_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:str_glue(): create strings with {}"s make it hard to readstr_glue(): anything inside of {} will be evaluated like it doesn’t have quotes:str_flatten(): create strings from a vectorstr_c() and str_glue() return vectors of the same length. This is useful for mutate().summarize()), we can use str_flatten()str_flatten(): create strings from a vectorstr_length()str_length(): returns number of characters in the string
str_sub()str_sub(string, start, end): subsets string from start index to end index.
We’ll focus on four useful tidyr functions for extracting data from strings:
_longer creates new rows / collapses columns to make df longer_wider creates new columns / collapses rows to make df wider_delim splits up a string with a delimiter like ", " or " "_position splits at specified widths of the string, like c(3,5,2)Examples of each will be shown in the next few slides.
separate_longer_delim()_longer creates new rows / collapses columns to make df longer_delim splits up a string with a delimiter like ", " or " "separate_longer_delim()_longer creates new rows / collapses columns to make df longer_delim splits up a string with a delimiter like ", " or " "separate_longer_delim()_longer creates new rows / collapses columns to make df longer_delim splits up a string with a delimiter like ", " or " "separate_longer_position()Less common, but sometimes each character in a value records a value itself
separate_longer_position()Less common, but sometimes each character in a value records a value itself
separate_longer_position()Less common, but sometimes each character in a value records a value itself
separate_longer_position()Less common, but sometimes each character in a value records a value itself
separate_longer_...()Compare delim vs position based on different formatting:
separate_wider_delim()Separates into columns (wider)
Consider the following tibble:
x has a code, edition number, and year, separated by "."separate_wider_delim()Separates into columns (wider)
Consider the following tibble:
x has a code, edition number, and year, separated by "."separate_wider_delim()Separates into columns (wider)
Consider the following tibble:
x has a code, edition number, and year, separated by "."separate_wider_position()Separates into columns (wider)
Consider the following tibble:
x has
separate_wider_position()Separates into columns (wider)
Consider the following tibble:
x has
separate_wider_position()Separates into columns (wider)
Consider the following tibble:
x has
separate_wider_delim() requires a fixed and known set of columns
too_few and too_many args of separate_wider_delim() can help here.df |>
separate_wider_delim(
u,
delim = "-",
names = c("x", "y", "z")
)
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.Let’s try its suggestion to use debug:
too_fewWarning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
x y z u u_ok u_pieces u_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 1 1 1-1-1 TRUE 3 ""
2 1 3 <NA> 1-3 FALSE 2 ""
3 1 3 2 1-3-2 TRUE 3 ""
4 1 <NA> <NA> 1 FALSE 1 ""
Three columns get added:
u_ok: which inputs failedu_pieces: how many pieces were foundu_remainder isn’t useful when too few pieces but we will see it is useful when too many.Using debug will typically reveal a problem with delimiter strategy
too_fewCan fill in missing pieces with NAs by either…
…putting NAs at tail end
too_manySame principles apply for too many pieces.
too_manyDebugging shows purpose of u_remainder column:
Warning: Debug mode activated: adding variables `u_ok`, `u_pieces`, and
`u_remainder`.
# A tibble: 4 × 7
x y z u u_ok u_pieces u_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 1 1 1-1-1 TRUE 3 ""
2 1 1 2 1-1-2 TRUE 3 ""
3 1 3 5 1-3-5-6 FALSE 4 "-6"
4 1 3 5 1-3-5-7-9 FALSE 5 "-7-9"
too_manyTo handle too many pieces, you can either…
“merge” the extras into a single column
University of California, Davis · STA35B · Spring 2025