Shuffling Strings in R

2019-05-14

Let's say you need to share some data that has some potentially identifiable sensitive information in it -- people's addresses, phone numbers etc. Maybe these fields are not particularly important, but you don't want to take them out exactly, and neither do you want to have to go through an encryption & decryption process...well, one quick and useful option is the stri_rand_shuffle() function from the stringi package.

Imagine you have the following fake data:

library(stringi); library(dplyr)

df <- tibble(
  name = c("John Smyth", "Alan Pear", "Don Baker", "Bjarne Andersson"),
  client_id = c(243, 22, 441, 994),
  address_1 = c("2 Corner View Road", "106 Southfield Ave.", "213 North 25th Street",
                "11 Apple Boulevard"),
  address_2 = c("Dunkirk Springs", "Ballyvourney", "Oakland Heights", "Rintinville"),
  address_3 = c("New York", "Cork", "Essex", "Stockholm"),
  phone_number = c("99-2278-122", "088-766653221", "112341-991", "011-221-324"),
  zip_code = c(11517, "E45NN12", "WX133Y", 213337),
  registration = c("Full", "Part-time", "Full", "Part-time"),
  profile = c("Advanced", "Advanced", "Beginner", "Intermediate")
)

head(df)
## # A tibble: 4 x 9
##   name  client_id address_1 address_2 address_3 phone_number zip_code
##   <chr>     <dbl> <chr>     <chr>     <chr>     <chr>        <chr>
## 1 John…       243 2 Corner… Dunkirk … New York  99-2278-122  11517
## 2 Alan…        22 106 Sout… Ballyvou… Cork      088-7666532… E45NN12
## 3 Don …       441 213 Nort… Oakland … Essex     112341-991   WX133Y
## 4 Bjar…       994 11 Apple… Rintinvi… Stockholm 011-221-324  213337
## # … with 2 more variables: registration <chr>, profile <chr>

...and suppose we're interested in Client ID, site/region, ZIP code, registration and profile. We can quickly scramble the identifying information we have in the other columns with string & dplyr:

df %>%
  mutate(phone_number = as.character(phone_number)) %>%
  mutate_at(vars(name, address_1, address_2, phone_number), stri_rand_shuffle)
## # A tibble: 4 x 9
##   name  client_id address_1 address_2 address_3 phone_number zip_code
##   <chr>     <dbl> <chr>     <chr>     <chr>     <chr>        <chr>
## 1 "ntJ…       243 RV2 edwr… nnuriSsk… New York  1-928-22297  11517
## 2 " aA…        22 d6h itle… eBlnyyru… Cork      22610588676… E45NN12
## 3 e rk…       441 t 2 rthS… etdHnkaa… Essex     39-4121911   WX133Y
## 4 nBjn…       994 p1erAaud… vltiRlne… Stockholm 3421-2102-1  213337
## # … with 2 more variables: registration <chr>, profile <chr>

If you want a closer look at, for example, the address_1 column:

df %>%
  mutate(phone_number = as.character(phone_number)) %>%
  mutate_at(vars(name, address_1, address_2, phone_number), stri_rand_shuffle) %>%
  pull(address_1)
## [1] "rdrn woa e VoiRCe2"    ".it0e6SoAf h ld1veu"   "S oe t2r3et215hNtrt h"
## [4] "u vA 1reoealBpp1dl"

Using this method, you can share the dataset without concern 😎.