4 Tidy Data: Why and How

4.1 What is Tidy Data?

each row corresponds to an observation
each variable is a column
each type of observation is in a different table

4.2 Why Tidy Data?

Tidy data enables us to do lots of things!

Great ggplots
Summarize/slice the data in multiple ways
Enable Exploratory Data Analysis
Ensure assumptions are met for methods
Enable Confirmatory Data Analysis

4.3 Beware of columns masquerading as variables!

These columns are actually categories!

Ask yourself: do these columns go together as a single observation for your analysis?

Also ask yourself: What is the unit of observation?

library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)

fertilityData[1:10, 1:10]

4.4 Making data tidy: `gather()`

Use gather() when you need to make a bunch of columns into one column.

library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)

#gather() takes three arguments: data, key, and value
#key is what you want your new categorical column to be named
#value is for the actual values in the columns

#We don't want the `Total fertility rate` column to be included as part of the
#gather() operation, so we use the `-` notation to exclude it.

fertilityDataTidy <- 
  gather(fertilityData, "Year", "fertilityRate", -`Total fertility rate`) %>% 
  select(Country = `Total fertility rate`, Year, fertilityRate) %>% 
  #remove na values (there are countries that have no information)
  filter(!is.na(fertilityRate))

fertilityDataTidy[1:10,]

4.5 Your Task: using tidy data

How would we find the average fertility within a year? Call the summarized data meanYearRate.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkodGlkeXIpXG5saWJyYXJ5KGRwbHlyKVxuZmVydGlsaXR5RGF0YSA8LSByZWFkLmNzdihcImRhdGEvdG90YWxfZmVydGlsaXR5LmNzdlwiLCBjaGVjay5uYW1lcyA9IEZBTFNFKVxuXG5mZXJ0aWxpdHlEYXRhVGlkeSA8LSBcbiAgZ2F0aGVyKGZlcnRpbGl0eURhdGEsIFwiWWVhclwiLCBcImZlcnRpbGl0eVJhdGVcIiwgLWBUb3RhbCBmZXJ0aWxpdHkgcmF0ZWApICU+JSBcbiAgc2VsZWN0KENvdW50cnkgPSBgVG90YWwgZmVydGlsaXR5IHJhdGVgLCBZZWFyLCBmZXJ0aWxpdHlSYXRlKSAlPiUgXG4gICNyZW1vdmUgbmEgdmFsdWVzICh0aGVyZSBhcmUgY291bnRyaWVzIHRoYXQgaGF2ZSBubyBpbmZvcm1hdGlvbilcbiAgZmlsdGVyKCFpcy5uYShmZXJ0aWxpdHlSYXRlKSkiLCJzYW1wbGUiOiJhdmVyYWdlRmVydGlsaXR5IDwtIGZlcnRpbGl0eURhdGFUaWR5ICU+JSBcbiAgXG5hdmVyYWdlRmVydGlsaXR5Iiwic29sdXRpb24iOiJhdmVyYWdlRmVydGlsaXR5IDwtIGZlcnRpbGl0eURhdGFUaWR5ICU+JVxuICBncm91cF9ieShZZWFyKSAlPiUgc3VtbWFyaXplKG1lYW5ZZWFyUmF0ZT1tZWFuKGZlcnRpbGl0eVJhdGUpKVxuXG5hdmVyYWdlRmVydGlsaXR5Iiwic2N0IjoidGVzdF9vYmplY3QoXCJhdmVyYWdlRmVydGlsaXR5XCIsIGluY29ycmVjdF9tc2cgPSBcIk5vdCBxdWl0ZS4gRGlkIHlvdSBgZ3JvdXBfYnkoKWAgYW5kXG4gICAgICAgICAgICBgc3VtbWFyaXplKClgIGNvcnJlY3RseT9cIikifQ==

4.6 Making one column into many: `spread()`

Sometimes, you will need to go the other direction: take a long format dataset and make it into a more matrix-like format. This is necessary for such functions such as heatmap().

Let’s change things around and make the Country column into the variables (columns) in the dataset.

fertilityCountryColumns <- fertilityDataTidy %>% 
  #spread takes a key (Country) and value (fertilityRate) argument
  spread(Country, fertilityRate) 

fertilityCountryColumns[1:10, 1:10]

4.7 Your Task - Who is the most democratic?

Load the dem_score.csv dataset in the data/ folder. Tidy it up using gather(). Which countries had the highest democracy score in 2007?

Hint: you’ll have to use your dplyr skills as well. Also, to remove characters, you can use str_replace(var, "X", "") in the stringr package to do it (what dplyr function should you wrap it in?). For more info, check out ?str_replace.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHN0cmluZ3IpXG5kZW1TY29yZSA8LSByZWFkLmNzdihcImRhdGEvZGVtX3Njb3JlLmNzdlwiKVxuXG4jI3Byb2Nlc3MgZGVtU2NvcmUgaGVyZVxuZGVtU2NvcmVUaWR5IDwtIGRlbVNjb3JlICU+JVxuXG4jc2hvdyB0aGUgdG9wIDUgY2F0ZWdvcmllcyB1c2luZyB0b3BfbigpXG50b3BDb3VudHJpZXMgPC0gZGVtU2NvcmVUaWR5ICU+JVxuICBcbnRvcENvdW50cmllcyAgIiwic29sdXRpb24iOiJsaWJyYXJ5KHN0cmluZ3IpXG5kZW1TY29yZSA8LSByZWFkLmNzdihcImRhdGEvZGVtX3Njb3JlLmNzdlwiKVxuXG5kZW1TY29yZVRpZHkgPC0gZGVtU2NvcmUgJT4lXG4gIGdhdGhlcihrZXkgPSBcInllYXJcIiwgdmFsdWUgPSBcImRlbV9zY29yZVwiLCAtY291bnRyeSkgJT4lXG4gIG11dGF0ZSh5ZWFyID0gYXMubnVtZXJpYyhzdHJfcmVwbGFjZSh5ZWFyLCBcIlhcIiwgXCJcIikpKVxuXG50b3BDb3VudHJpZXMgPC0gZGVtU2NvcmVUaWR5ICU+JVxuICBmaWx0ZXIoeWVhciA9PSAyMDA3KSAlPiVcbiAgdG9wX24obiA9IDUsIHd0ID0gZGVtX3Njb3JlKVxuXG50b3BDb3VudHJpZXMiLCJzY3QiOiJ0ZXN0X2xpYnJhcnlfZnVuY3Rpb24oXCJnYXBtaW5kZXJcIilcbnRlc3RfbGlicmFyeV9mdW5jdGlvbihcImRwbHlyXCIpXG50ZXN0X29iamVjdChcInRvcENvdW50cmllc1wiKVxudGVzdF9vYmplY3QoXCJkZW1TY29yZVRpZHlcIikifQ==

4.8 Challenge - if that was too easy…

Take a look at the who dataset (it’s built into tidyverse)

data(who)

who

4.9 Make it look like this:

load("data/who_tidy.rda")

who_tidy

4.10 Some Hints on the Challenge

Look at the documentation for separate(). You will first have to gather a bunch of the columns into a single column. Then you will have to apply separate() twice, with different parameters.

What does each column mean? Here’s some info from the data dictionary:

The first three letters of entries in the key column correspond to new or old cases of TB.
The next two letters (after the _) correspond to TB type:
- rel for relapse,
- ep for extrapulmonary TB
- sn for smear negative,
- sp for smear positive
The next letter after the second _ corresponds to the sex of the TB patient.
The remaining numbers correspond to age group:
- 014 for 0 to 14 years
- 65 for 65 or older
- etc.