3 Intro to dplyr

In this section, we’ll discuss Data Wrangling/Transformation via the dplyr package. We’ll explore ways to choose subsets of data, aggregate data to create summaries, make new variables, and sort your data frames. It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.

Back to gapminder

Here is a look at the gapminder data frame in the gapminder package.

library(gapminder)
gapminder

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

3.1 The pipe %>%

   

  • A way to chain together commands
  • It is essentially the dplyr equivalent to the + in ggplot2

3.2 The Five Main Verbs (5MV) of data wrangling

filter()
summarize()
group_by()
mutate()
arrange()


3.2.1 filter()

  • Select a subset of the rows of a data frame.
  • The arguments are the “filters” that you’d like to apply.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007
  • Use == to compare a variable to a value

3.2.2 Logical operators

  • Use | to check for any in multiple filters being true:
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
  • Use & or , to check for all of multiple filters being true:
gapminder %>% 
  filter(year == 2002, continent == "Europe")
  • Use %in% to check for any being true (shortcut to using | repeatedly with ==)
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))

3.2.3 summarize()

  • Any numerical summary that you want to apply to a column of a data frame is specified within summarize().
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997

3.2.4 Combining summarize() with group_by()

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont

3.2.5 Without the %>%

It’s hard to appreciate the %>% without seeing what the code would look like without it:

max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont

3.3 mutate()

  • Allows you to
    1. create a new variable based on other variables OR
    2. change the contents of an existing variable
  1. create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp

3.4 mutate()

  1. change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird

3.5 arrange()

  • Reorders the rows in a data frame based on the values of one or more variables
gapminder %>%
  arrange(year, country)
  • Can also put into descending order
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))

3.6 Other useful dplyr verbs

  • select
  • top_n
  • sample_n
  • slice
  • glimpse
  • rename

3.7 Your Task

Determine which African country had the highest GDP per capita in 1982 using the gapminder data in the gapminder package. Store your answer as one row including all six of the variables in gapminder. Give the name of top_africa to this resulting data 1 x 6 data frame.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpIiwic29sdXRpb24iOiJsaWJyYXJ5KGdhcG1pbmRlcilcbmxpYnJhcnkoZHBseXIpXG50b3BfYWZyaWNhIDwtIGdhcG1pbmRlciAlPiVcbiAgZmlsdGVyKHllYXIgPT0gMTk4MikgJT4lXG4gIGZpbHRlcihjb250aW5lbnQgPT0gXCJBZnJpY2FcIikgJT4lXG4gIGZpbHRlcihnZHBQZXJjYXAgPT0gbWF4KGdkcFBlcmNhcCkpIiwic2N0IjoidGVzdF9saWJyYXJ5X2Z1bmN0aW9uKFwiZ2FwbWluZGVyXCIpXG50ZXN0X2xpYnJhcnlfZnVuY3Rpb24oXCJkcGx5clwiKVxudGVzdF9vYmplY3QoXCJ0b3BfYWZyaWNhXCIpXG50ZXN0X2Vycm9yKCkifQ==

3.8 Your Tasks (Challenge)

For both of these problems below, use the bechdel data frame in the fivethirtyeight package:

  • Use the count function in the dplyr package to determine how many movies in 2013 fell into each of the different categories for clean_test
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=
  • Determine the percentage of movies that received the value of "ok" across all years
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=

3.9 Your Task

Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel data frame in the fivethirtyeight package.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjU3BhY2UgZm9yIHlvdXIgYW5zd2VyIGhlcmUuIn0=