3 Intro to dplyr

In this section, we’ll discuss Data Wrangling/Transformation via the dplyr package. We’ll explore ways to choose subsets of data, aggregate data to create summaries, make new variables, and sort your data frames. It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.

Back to gapminder

Here is a look at the gapminder data frame in the gapminder package.

library(gapminder)
gapminder

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)

[1] 60.0649

library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

3.1 The pipe `%>%`

A way to chain together commands
It is essentially the dplyr equivalent to the + in ggplot2

3.2 The Five Main Verbs (5MV) of data wrangling

filter()
summarize()
group_by()
mutate()
arrange()

3.2.1 `filter()`

Select a subset of the rows of a data frame.
The arguments are the “filters” that you’d like to apply.

library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007

Use == to compare a variable to a value

3.2.2 Logical operators

Use | to check for any in multiple filters being true:

gapminder %>% 
  filter(year == 2002 | continent == "Europe")

Use & or , to check for all of multiple filters being true:

gapminder %>% 
  filter(year == 2002, continent == "Europe")

Use %in% to check for any being true (shortcut to using | repeatedly with ==)

gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))

3.2.3 `summarize()`

Any numerical summary that you want to apply to a column of a data frame is specified within summarize().

max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997

3.2.4 Combining `summarize()` with `group_by()`

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont

3.2.5 Without the `%>%`

It’s hard to appreciate the %>% without seeing what the code would look like without it:

max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont

3.3 `mutate()`

Allows you to
1. create a new variable based on other variables OR
2. change the contents of an existing variable

create a new variable based on other variables

gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp

3.4 `mutate()`

change the contents of an existing variable

gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird

3.5 `arrange()`

Reorders the rows in a data frame based on the values of one or more variables

gapminder %>%
  arrange(year, country)

Can also put into descending order

gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))

3.6 Other useful `dplyr` verbs

select
top_n
sample_n
slice
glimpse
rename

3.7 Your Task

Determine which African country had the highest GDP per capita in 1982 using the gapminder data in the gapminder package. Store your answer as one row including all six of the variables in gapminder. Give the name of top_africa to this resulting data 1 x 6 data frame.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpIiwic29sdXRpb24iOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG50b3BfYWZyaWNhIDwtIGdhcG1pbmRlciAlPiVcbiAgZmlsdGVyKHllYXIgPT0gMTk4MikgJT4lXG4gIGZpbHRlcihjb250aW5lbnQgPT0gXCJBZnJpY2FcIikgJT4lXG4gIGZpbHRlcihnZHBQZXJjYXAgPT0gbWF4KGdkcFBlcmNhcCkpIiwic2N0IjoidGVzdF9saWJyYXJ5X2Z1bmN0aW9uKFwiZ2FwbWluZGVyXCIpXG50ZXN0X2xpYnJhcnlfZnVuY3Rpb24oXCJkcGx5clwiKVxudGVzdF9vYmplY3QoXCJ0b3BfYWZyaWNhXCIpXG50ZXN0X2Vycm9yKCkifQ==

3.8 Your Tasks (Challenge)

For both of these problems below, use the bechdel data frame in the fivethirtyeight package:

Use the count function in the dplyr package to determine how many movies in 2013 fell into each of the different categories for clean_test

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=

Determine the percentage of movies that received the value of "ok" across all years

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=

3.9 Your Task

Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel data frame in the fivethirtyeight package.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=

3 Intro to dplyr

3.1 The pipe %>%

3.2 The Five Main Verbs (5MV) of data wrangling

3.2.1 filter()

3.2.2 Logical operators

3.2.3 summarize()

3.2.4 Combining summarize() with group_by()

3.2.5 Without the %>%

3.3 mutate()

3.4 mutate()

3.5 arrange()

3.6 Other useful dplyr verbs

3.7 Your Task

3.8 Your Tasks (Challenge)

3.9 Your Task

3.1 The pipe `%>%`

3.2.1 `filter()`

3.2.3 `summarize()`

3.2.4 Combining `summarize()` with `group_by()`

3.2.5 Without the `%>%`

3.3 `mutate()`

3.4 `mutate()`

3.5 `arrange()`

3.6 Other useful `dplyr` verbs