3 Intro to dplyr
In this section, we’ll discuss Data Wrangling/Transformation via the dplyr
package. We’ll explore ways to choose subsets of data, aggregate data to create summaries, make new variables, and sort your data frames. It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.
Back to gapminder
Here is a look at the gapminder
data frame in the gapminder
package.
library(gapminder)
gapminder
Say we wanted mean life expectancy across all years for Asia
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649
library(dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
3.1 The pipe %>%
- A way to chain together commands
- It is essentially the
dplyr
equivalent to the+
inggplot2
3.2 The Five Main Verbs (5MV) of data wrangling
filter()
summarize()
group_by()
mutate()
arrange()
3.2.1 filter()
- Select a subset of the rows of a data frame.
- The arguments are the “filters” that you’d like to apply.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007
- Use
==
to compare a variable to a value
3.2.2 Logical operators
- Use
|
to check for any in multiple filters being true:
gapminder %>%
filter(year == 2002 | continent == "Europe")
- Use
&
or,
to check for all of multiple filters being true:
gapminder %>%
filter(year == 2002, continent == "Europe")
- Use
%in%
to check for any being true (shortcut to using|
repeatedly with==
)
gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))
3.2.3 summarize()
- Any numerical summary that you want to apply to a column of a data frame is specified within
summarize()
.
max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997
3.2.4 Combining summarize()
with group_by()
When you’d like to determine a numerical summary for all levels of a different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
3.2.5 Without the %>%
It’s hard to appreciate the %>%
without seeing what the code would look like without it:
max_exp_1997_by_cont <-
summarize(
group_by(
filter(
gapminder,
year == 1997),
continent),
max_exp = max(lifeExp))
max_exp_1997_by_cont
3.3 mutate()
- Allows you to
- create a new variable based on other variables OR
- change the contents of an existing variable
- create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp
3.4 mutate()
- change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird
3.5 arrange()
- Reorders the rows in a data frame based on the values of one or more variables
gapminder %>%
arrange(year, country)
- Can also put into descending order
gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp))
3.6 Other useful dplyr
verbs
select
top_n
sample_n
slice
glimpse
rename
3.7 Your Task
Determine which African country had the highest GDP per capita in 1982 using the gapminder
data in the gapminder
package. Store your answer as one row including all six of the variables in gapminder
. Give the name of top_africa
to this resulting data 1 x 6 data frame.
3.8 Your Tasks (Challenge)
For both of these problems below, use the bechdel
data frame in the fivethirtyeight
package:
- Use the
count
function in thedplyr
package to determine how many movies in 2013 fell into each of the different categories forclean_test
- Determine the percentage of movies that received the value of
"ok"
across all years
3.9 Your Task
Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel
data frame in the fivethirtyeight
package.