class: center, middle, inverse, title-slide # Data Wrangling in R ## ### Connor Gilroy ### 2020-05-01 --- # Data wrangling What do we mean by *data wrangling*? (Or "data munging", or even ["data janitoring"](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html)?) <blockquote class="twitter-tweet tw-align-center"><p lang="en" dir="ltr">Them: Can you just quickly pull this data for me?<br><br>Me: Sure, let me just: <br><br>SELECT * FROM some_ideal_clean_and_pristine.table_that_you_think_exists</p>— Seth Rosen (@sethrosen) <a href="https://twitter.com/sethrosen/status/1252291581320757249?ref_src=twsrc%5Etfw">April 20, 2020</a></blockquote> -- Data wrangling is **everything you have to do to a data set to get it ready for analysis** <center><img width="40%" src="https://d33wubrfki0l68.cloudfront.net/795c039ba2520455d833b4034befc8cf360a70ba/558a5/diagrams/data-science-explore.png"></center> --- # How to wrangle data in R There's more than one way to do it! R has two main **dialects**: -- #### #1. **base** R ```r head(data.frame(giving)[(giving$`2016_dollars`<0 & !is.na(giving$`2016_dollars`)), c("transaction_amt","memo_text")], 100) ``` -- #### #2. the **tidyverse** ```r giving %>% filter(`2016_dollars` < 0, !is.na(`2016_dollars`)) %>% select(transaction_amt, memo_text) %>% head(100) ``` Many people mix the two. I'll teach a more pure tidyverse style. **Note:** the tidyverse evolves over time. Use current packages that are well-established. Steer clear of retired packages (like **plyr**) or ones that are especially new (unless you need something cutting-edge) --- # Key concepts tidyverse-style data transformation has 3 components 1. data (a data frame object) 2. verbs (functions) 3. the pipe (`%>%`) -- ```r # this some_data %>% do_something() # is the same as this do_something(some_data) ``` -- Why bother with the pipe? You can combine steps into a *series* of transformations Remember, if you want to keep something, you also have to give it a name: ```r some_new_data <- some_data %>% do_something() %>% do_something_else() ``` --- # Simple **dplyr** verbs #### `select()` - choose columns (variables) by name ```r gss %>% select(id, sex, race) # remove columns with -name gss %>% select(-year) # there are select helpers too gss %>% select(starts_with("vote")) ``` -- #### `filter()` - choose rows (observations) by some characteristics ```r gss %>% filter(age < 50) # "a equals b" is `a == b` gss %>% filter(race == "BLACK") # `!` means "not" gss %>% filter(!is.na(divorce)) ``` -- #### `arrange()` - put data in order by one or more variables ```r gss %>% arrange(age, race) ``` --- # Simple **dplyr** verbs #### `select()` - choose columns (variables) by name ```r gss %>% select(id, sex, race) ``` ``` ## # A tibble: 4,761 x 3 ## id sex race ## <fct> <fct> <fct> ## 1 9 FEMALE BLACK ## 2 3001 FEMALE OTHER ## 3 6001 FEMALE BLACK ## 4 10 FEMALE OTHER ## 5 3002 FEMALE OTHER ## 6 6002 FEMALE WHITE ## 7 11 FEMALE BLACK ## 8 3003 FEMALE BLACK ## 9 6003 FEMALE BLACK ## 10 12 MALE BLACK ## # … with 4,751 more rows ``` --- # Simple **dplyr** verbs #### `filter()` - choose rows (observations) by some characteristics ```r gss %>% filter(race == "BLACK") ``` ``` ## # A tibble: 665 x 15 ## firstid wave year id age sex race divorce income income06 rincome vote00 vote04 ## <fct> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 9 1 2006 9 23 FEMA… BLACK <NA> $2500… $40000 … $25000… INELI… VOTED ## 2 9 3 2010 6001 27 FEMA… BLACK NO $2500… $60000 … $25000… IAP VOTED ## 3 11 1 2006 11 81 FEMA… BLACK YES <NA> <NA> <NA> VOTED VOTED ## 4 11 2 2008 3003 83 FEMA… BLACK YES $2000… $20000 … <NA> IAP VOTED ## 5 11 3 2010 6003 85 FEMA… BLACK NO <NA> <NA> <NA> IAP DID N… ## 6 12 1 2006 12 47 MALE BLACK <NA> <NA> <NA> <NA> DID N… DID N… ## 7 12 2 2008 3004 49 MALE BLACK <NA> $1000… $12500 … <NA> IAP DID N… ## 8 13 1 2006 13 26 MALE BLACK <NA> $8000… $8 000 … <NA> VOTED VOTED ## 9 13 2 2008 3005 28 MALE BLACK <NA> $1000… $12500 … $10000… IAP VOTED ## 10 13 3 2010 6005 30 MALE BLACK <NA> $2500… $30000 … $25000… IAP VOTED ## # … with 655 more rows, and 2 more variables: vote08 <dbl>, earthsun <fct> ``` --- # Simple **dplyr** verbs #### `arrange()` - put data in order by one or more variables ```r gss %>% arrange(desc(age)) ``` ``` ## # A tibble: 4,761 x 15 ## firstid wave year id age sex race divorce income income06 rincome vote00 vote04 ## <fct> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 181 1 2006 181 89 MALE WHITE NO $2500… $40000 … <NA> VOTED VOTED ## 2 181 2 2008 3062 89 MALE WHITE NO $2500… $50000 … <NA> IAP VOTED ## 3 431 1 2006 431 89 FEMA… WHITE NO $1500… $15000 … <NA> VOTED VOTED ## 4 431 2 2008 3137 89 FEMA… WHITE NO $1500… $15000 … <NA> IAP VOTED ## 5 431 3 2010 6110 89 FEMA… WHITE NO $1000… $12500 … <NA> IAP VOTED ## 6 771 1 2006 771 89 FEMA… OTHER NO <NA> <NA> <NA> DID N… DID N… ## 7 771 2 2008 3257 89 FEMA… WHITE NO <NA> <NA> <NA> IAP DID N… ## 8 1101 2 2008 3356 89 FEMA… WHITE NO $2500… $150000… <NA> IAP VOTED ## 9 1101 3 2010 6289 89 FEMA… WHITE NO <NA> <NA> <NA> IAP VOTED ## 10 1292 3 2010 6342 89 MALE WHITE NO $2500… $50000 … <NA> IAP VOTED ## # … with 4,751 more rows, and 2 more variables: vote08 <dbl>, earthsun <fct> ``` --- # Making summaries #### `group_by()` creates groups from variables #### `summarize()` aggregates a set of rows -- Combine `group_by()` + `summarize()` to make group-wise summaries: ```r gss %>% group_by(sex, race) %>% summarize(age = mean(age)) ``` ``` ## # A tibble: 6 x 3 ## # Groups: sex [2] ## sex race age ## <fct> <fct> <dbl> ## 1 MALE WHITE 50.0 ## 2 MALE BLACK 46.3 ## 3 MALE OTHER 38.9 ## 4 FEMALE WHITE 50.6 ## 5 FEMALE BLACK 46.6 ## 6 FEMALE OTHER 42.4 ``` **Note:** You can also use `group_by()` with `mutate()`. When might that be useful? --- # Making new variables Create new variables with the verb `mutate()`: ```r new_data <- data %>% mutate(new_variable = some_function(old_variable)) ``` The key is that the function needs to work on a *vector*. Why? Because columns of data frames are vectors, and mutate transforms a whole column. For example, `+` works on vectors: ```r x <- c(1, 2, 3, 4) y <- c(5, 6, 7, 8) x + y ``` ``` ## [1] 6 8 10 12 ``` --- # Making new variables Because `+` works on vectors, we can use `mutate()` to create a new column `z` by adding `x` and `y`: ```r d <- tibble(x, y) d %>% mutate(z = x + y) ``` ``` ## # A tibble: 4 x 3 ## x y z ## <dbl> <dbl> <dbl> ## 1 1 5 6 ## 2 2 6 8 ## 3 3 7 10 ## 4 4 8 12 ``` --- # More on making new variables #### Change the type of a variable ```r gss %>% mutate(age = as.numeric(age)) ``` `as.numeric`, `as.character`, `as.factor` (and `as_factor`) ... -- #### Test a TRUE/FALSE condition ```r gss %>% mutate(divorce_bin = ifelse(divorce == "YES", 1, 0)) gss %>% mutate(divorce_lgl = (divorce == "YES"), divorce_dbl = as.numeric(divorce_lgl)) ``` -- #### Factors with **forcats** Factors represent categorical data. forcats functions of the form `forcats::fct_*()` manipulate factors. -- #### Text data with **stringr** Characters (also called *strings*) represent text data. stringr functions of the form `stringr::str_*()` manipulate strings. --- # More on making new variables - factors #### Factors with **forcats** Factors represent categorical data. forcats functions of the form `forcats::fct_*()` manipulate factors. Useful functions include - `fct_relevel`: change the order of the factor levels (categories). In an R model, the *first* level is the reference category - `fct_drop`: get rid of levels that don't appear in the data - `fct_recode`: manually change factor levels - `fct_collapse`: manually combine factor levels - `fct_reorder`: reorder levels based on a second variable (good for plotting) -- ```r gss_divorce <- gss %>% mutate(divorce = fct_drop(divorce), divorce = fct_relevel(divorce, "NO", "YES")) levels(gss_divorce$divorce) ``` ``` ## [1] "NO" "YES" ``` --- # More on making new variables - strings #### Text data with **stringr** Characters (also called *strings*) represent text data. stringr functions of the form `stringr::str_*()` manipulate strings. Useful functions include - `str_c`: combine strings - `str_detect`: TRUE/FALSE if a pattern is in the string - `str_extract`: pattern detection, but gives you the pattern itself - `str_replace`, `str_replace_all`: replace a pattern -- **Note:** To get the most out of working with text data, you'll need to learn a bit about ***regular expressions*** ("regex"), a general way for representing text patterns. For instance, with regex you can find: - the beginning of a piece of text (`"^"`), or the end (`"$"`) - any number (`"[0-9]"` or `"\\d"`) - any number of numbers (`"[0-9]*"`, or `"[0-9]+"` for at least one) --- # Other good stuff *You might need to do these things, you might not. It's good to know they exist.* -- #### Reshape data with `tidyr::pivot_longer()` and `tidyr::pivot_wider()` Sometimes it's better to have data in "long" form for visualization and "wide" form for modeling. -- #### Join data sets with `dplyr::left_join()` If you have more than one table, you can join them using identifying variables. -- #### Apply functions with `purrr::map()` If you want to use a normal, non-vector function inside mutate, the purrr package can "map" any function to a list or vector of values. **Note:** `map()` returns a list, but often you want a vector---use `map_chr()`, `map_dbl()`, `map_lgl()`, ..., to get a vector column of the appropriate type. --- # An advanced example If you want a challenge, try to understand how this code works. It fits a separate model to each wave of the gss panel data. ```r gss_nested <- gss %>% group_by(wave) %>% nest() gss_nested %>% mutate(fit = map(data, ~lm(age ~ race + sex, data = .))) ``` ``` ## # A tibble: 3 x 3 ## # Groups: wave [3] ## wave data fit ## <dbl> <list<df[,14]>> <list> ## 1 1 [1,992 × 14] <lm> ## 2 2 [1,514 × 14] <lm> ## 3 3 [1,255 × 14] <lm> ``` --- # Rewriting code in a tidyverse style ```r test <- gss_panel10_long %>% dplyr::select("earthsun", names(gss_panel10_long)[grep("sci", names(gss_panel10_long))]) %>% filter(is.na(earthsun) == FALSE) ## Some of these we grabbed, dont't make sense, e.g., vissci test <- test[, -which(colnames(test) == "uscitzn")] test <- test[, -which(colnames(test) == "vissci")] sum_sci <- apply(data.frame(test)[, -1], 1, sum, na.rm = TRUE) sum(is.na(sum_sci)) ``` -- ```r test <- gss_panel10_long %>% dplyr::select(earthsun, contains("sci")) %>% filter(!is.na(earthsun)) %>% dplyr::select(-uscitzn, -vissci, -earthsun) # TODO: find a tidier way to sum rows... test %>% mutate(sum_sci = rowSums(., na.rm = TRUE)) %>% summarize(sum_sci = sum(is.na(sum_sci))) ``` --- class: center, middle, inverse # Lab exercise (in `data_wrangling.R`)