class: middle, title-slide # Working with Data ## Data Types & Data Classes ### Dennis A. V. Dittrich ### 2021 --- layout: true <div class="my-footer"> <span><img src="img/tcb-logo.png" height="40px"></span> </div> --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ```r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ```r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ```r purchases <- customers %>% * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? .row[.col-7[ Most likely, because the next step of your analysis needs it ]] -- .pull-left[ ```r prices ``` ``` ## # A tibble: 5 x 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ```r purchases %>% * left_join(prices) ``` ``` ## # A tibble: 6 x 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] .pull-right-wide[ ```r purchases %>% * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] --- class: middle # Case study: Approval rating of Donald Trump --- <img src="img/trump-approval.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/) ] --- ## Data ```r trump ``` ``` ## # A tibble: 2,702 x 4 ## subgroup date approval disapproval ## <chr> <date> <dbl> <dbl> ## 1 Voters 2020-10-04 44.7 52.2 ## 2 Adults 2020-10-04 43.2 52.6 ## 3 Adults 2020-10-03 43.2 52.6 ## 4 Voters 2020-10-03 45.0 51.7 ## 5 Adults 2020-10-02 43.3 52.4 ## 6 Voters 2020-10-02 44.5 52.1 ## 7 Voters 2020-10-01 44.1 52.8 ## 8 Adults 2020-10-01 42.7 53.3 ## 9 Adults 2020-09-30 42.2 53.7 ## 10 Voters 2020-09-30 44.2 52.7 ## # … with 2,692 more rows ``` --- ## Goal .pull-left-wide[ <img src="04.wwdata2_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right-narrow[ **Aesthetic mappings:** ✅ x = `date` ❌ y = `rating_value` ❌ color = `rating_type` **Facet:** ✅ `subgroup` (Adults and Voters) ] --- ## Pivot ```r trump_longer <- trump %>% pivot_longer( cols = c(approval, disapproval), names_to = "rating_type", values_to = "rating_value" ) trump_longer ``` ``` ## # A tibble: 5,404 x 4 ## subgroup date rating_type rating_value ## <chr> <date> <chr> <dbl> ## 1 Voters 2020-10-04 approval 44.7 ## 2 Voters 2020-10-04 disapproval 52.2 ## 3 Adults 2020-10-04 approval 43.2 ## 4 Adults 2020-10-04 disapproval 52.6 ## 5 Adults 2020-10-03 approval 43.2 ## 6 Adults 2020-10-03 disapproval 52.6 ## 7 Voters 2020-10-03 approval 45.0 ## 8 Voters 2020-10-03 disapproval 51.7 ... ``` --- ## Plot ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) ``` <img src="04.wwdata2_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> --- .panelset[ .panel[.panel-name[Code] ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + * scale_color_manual(values = c("darkgreen", "orange")) + * labs( * x = "Date", y = "Rating", * color = NULL, * title = "How (un)popular is Donald Trump?", * subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", * caption = "Source: FiveThirtyEight modeling estimates" * ) ``` ] .panel[.panel-name[Plot] <img src="04.wwdata2_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- .panelset[ .panel[.panel-name[Code] ```r ggplot(trump_longer, aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + geom_line() + facet_wrap(~ subgroup) + scale_color_manual(values = c("darkgreen", "orange")) + labs( x = "Date", y = "Rating", color = NULL, title = "How (un)popular is Donald Trump?", subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", caption = "Source: FiveThirtyEight modeling estimates" ) + * theme_minimal() + * theme(legend.position = "bottom") ``` ] .panel[.panel-name[Plot] <img src="04.wwdata2_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- class: middle # Why should you care about data types? --- ## Example: Cat lovers .row[.col-7[ A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value. ]] ```r cat_lovers <- read_csv("data/cat-lovers.csv") ``` ``` ## # A tibble: 60 x 3 ## name number_of_cats handedness ## <chr> <chr> <chr> ## 1 Bernice Warren 0 left ## 2 Woodrow Stone 0 left ## 3 Willie Bass 1 left ## 4 Tyrone Estrada 3 left ## 5 Alex Daniels 3 left ## 6 Jane Bates 2 left ## 7 Latoya Simpson 1 left ## 8 Darin Woods 1 left ## 9 Agnes Cobb 0 left ## 10 Tabitha Grant 0 left ## # … with 50 more rows ``` --- ## Oh why won't you work?! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning in mean.default(number_of_cats): argument is not numeric ## or logical: returning NA ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ```r ?mean ``` <img src="img/mean-help.png" width="75%" style="display: block; margin: auto;" /> --- ## Oh why won't you still work??!! ```r cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE)) ``` ``` ## Warning in mean.default(number_of_cats, na.rm = TRUE): argument ## is not numeric or logical: returning NA ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 NA ``` --- ## Take a breath and look at your data .row[.col-7[ .question[ What is the type of the `number_of_cats` variable? ] ]] ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## Let's take another look .midi[
] --- ## Sometimes you might need to babysit your respondents .midi[ ```r cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced ## by coercion ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 0.833 ``` ] --- ## Always you need to respect data types ```r cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats)) ``` ``` ## # A tibble: 1 x 1 ## mean_cats ## <dbl> ## 1 0.833 ``` --- ## Now that we know what we're doing... ```r *cat_lovers <- cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) ``` --- ## Moral of the story .row[.col-7[ - If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason. - Go in and investigate your data, apply the fix, *save your data*, live happily ever after. ]] --- class: middle .light-blue[now that we have a good motivation for learning about data types in R] <br> .large[ .hand[.light-blue[let's learn about data types in R!]] ] --- class: middle # Data types --- ## Data types in R .row[.col-7[ - **logical** - **double** - **integer** - **character** - and some more, but we won't be focusing on those ]] --- ## Logical & character .pull-left[ **logical** - boolean values `TRUE` and `FALSE` ```r typeof(TRUE) ``` ``` ## [1] "logical" ``` ] .pull-right[ **character** - character strings ```r typeof("hello") ``` ``` ## [1] "character" ``` ] --- ## Double & integer .pull-left[ **double** - floating point numerical values (default numerical type) ```r typeof(1.335) ``` ``` ## [1] "double" ``` ```r typeof(7) ``` ``` ## [1] "double" ``` ] .pull-right[ **integer** - integer numerical values (indicated with an `L`) ```r typeof(7L) ``` ``` ## [1] "integer" ``` ```r typeof(1:3) ``` ``` ## [1] "integer" ``` ] --- ## Concatenation Vectors can be constructed using the `c()` function. ```r c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ```r c("Hello", "World!") ``` ``` ## [1] "Hello" "World!" ``` ```r c(c("hi", "hello"), c("bye", "jello")) ``` ``` ## [1] "hi" "hello" "bye" "jello" ``` --- ## Converting between types with intention... .pull-left[ ```r x <- 1:3 x ``` ``` ## [1] 1 2 3 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ ```r y <- as.character(x) y ``` ``` ## [1] "1" "2" "3" ``` ```r typeof(y) ``` ``` ## [1] "character" ``` ] --- ## Converting between types with intention... .pull-left[ ```r x <- c(TRUE, FALSE) x ``` ``` ## [1] TRUE FALSE ``` ```r typeof(x) ``` ``` ## [1] "logical" ``` ] -- .pull-right[ ```r y <- as.numeric(x) y ``` ``` ## [1] 1 0 ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ] --- ## Converting between types without intention... R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing! .pull-left[ ```r c(1, "Hello") ``` ``` ## [1] "1" "Hello" ``` ```r c(FALSE, 3L) ``` ``` ## [1] 0 3 ``` ] -- .pull-right[ ```r c(1.2, 3L) ``` ``` ## [1] 1.2 3.0 ``` ```r c(2L, "two") ``` ``` ## [1] "2" "two" ``` ] --- ## Explicit vs. implicit coercion .row[.col-7[ Let's give formal names to what we've seen so far: ]] -- .row[.col-7[ **Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()` ]] -- .row[.col-7[ **Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector ]] ??? .your-turn[ ## Your turn! .row[.col-7[ - RStudio Cloud > `Ex 04 - Hotels + Data types` > open `type-coercion.Rmd` and knit. - What is the type of the given vectors? First, guess. Then, try it out in R. If your guess was correct, great! If not, discuss why they have that type. ] ]] .row[.col-7[ **Example:** Suppose we want to know the type of `c(1, "a")`. First, I'd look at: ]] .pull-left[ ```r typeof(1) ``` ``` ## [1] "double" ``` ] .pull-right[ ```r typeof("a") ``` ``` ## [1] "character" ``` ] .row[.col-7[ and make a guess based on these. Then finally I'd check: ]] .pull-left[ ```r typeof(c(1, "a")) ``` ``` ## [1] "character" ``` ] ] --- class: middle # Special values --- ## Special values - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity -- .pull-left[ ```r pi / 0 ``` ``` ## [1] Inf ``` ```r 0 / 0 ``` ``` ## [1] NaN ``` ] .pull-right[ ```r 1/0 - 1/0 ``` ``` ## [1] NaN ``` ```r 1/0 + 1/0 ``` ``` ## [1] Inf ``` ] --- ## `NA`s are special ```r x <- c(1, 2, 3, 4, NA) ``` ```r mean(x) ``` ``` ## [1] NA ``` ```r mean(x, na.rm = TRUE) ``` ``` ## [1] 2.5 ``` ```r summary(x) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.00 1.75 2.50 2.50 3.25 4.00 1 ``` --- ## `NA`s are logical .row[.col-7[ R uses `NA` to represent missing values in its data structures. ]] ```r typeof(NA) ``` ``` ## [1] "logical" ``` --- ## Mental model for `NA`s .row[.col-7[ - Unlike `NaN`, `NA`s are genuinely unknown values - But that doesn't mean they can't function in a logical way - Let's think about why `NA`s are logical... ]] -- .row[.col-7[ .question[ Why do the following give different answers? ]]] .pull-left[ ```r # TRUE or NA TRUE | NA ``` ``` ## [1] TRUE ``` ] .pull-right[ ```r # FALSE or NA FALSE | NA ``` ``` ## [1] NA ``` ] `\(\rightarrow\)` See next slide for answers... --- - `NA` is unknown, so it could be `TRUE` or `FALSE` .pull-left[ .midi[ - `TRUE | NA` ```r TRUE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r TRUE | FALSE # if NA was FALSE ``` ``` ## [1] TRUE ``` ] ] .pull-right[ .midi[ - `FALSE | NA` ```r FALSE | TRUE # if NA was TRUE ``` ``` ## [1] TRUE ``` ```r FALSE | FALSE # if NA was FALSE ``` ``` ## [1] FALSE ``` ] ] - Doesn't make sense for mathematical operations - Makes sense in the context of missing data --- class: middle # Data classes --- ## Data classes .row[.col-7[ We talked about *types* so far, next we'll introduce the concept of *classes* - Vectors are like Lego building blocks - We stick them together to build more complicated constructs, e.g. *representations of data* - The **class** attribute relates to the S3 class of an object which determines its behaviour - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious - Examples: factors, dates, and data frames ]] --- ## Factors .row[.col-7[ R uses factors to handle categorical variables, variables that have a fixed and known set of possible values ]] ```r x <- factor(c("BS", "MS", "PhD", "MS")) x ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` -- .pull-left[ ```r typeof(x) ``` ``` ## [1] "integer" ``` ] .pull-right[ ```r class(x) ``` ``` ## [1] "factor" ``` ] --- ## More on factors .row[.col-7[ We can think of factors like character (level labels) and an integer (level numbers) glued together ]] ```r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r as.integer(x) ``` ``` ## [1] 1 2 3 2 ``` --- ## Dates ```r y <- as.Date("2020-01-01") y ``` ``` ## [1] "2020-01-01" ``` ```r typeof(y) ``` ``` ## [1] "double" ``` ```r class(y) ``` ``` ## [1] "Date" ``` --- ## More on dates .row[.col-7[ We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together ]] ```r as.integer(y) ``` ``` ## [1] 18262 ``` ```r as.integer(y) / 365 # roughly 50 yrs ``` ``` ## [1] 50.03288 ``` --- ## Data frames .row[.col-7[ We can think of data frames like vectors of equal length glued together ]] ```r df <- data.frame(x = 1:2, y = 3:4) df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` .pull-left[ ```r typeof(df) ``` ``` ## [1] "list" ``` ] .pull-right[ ```r class(df) ``` ``` ## [1] "data.frame" ``` ] --- ## Lists .row[.col-7[ Lists are a generic vector container, vectors of any type can go in them ]] ```r l <- list( x = 1:4, y = c("hi", "hello", "jello"), z = c(TRUE, FALSE) ) l ``` ``` ## $x ## [1] 1 2 3 4 ## ## $y ## [1] "hi" "hello" "jello" ## ## $z ## [1] TRUE FALSE ``` --- ## Lists and data frames .row[.col-7[ - A data frame is a special list containing vectors of equal length - When we use the `pull()` function, we extract a vector from the data frame ]] ```r df ``` ``` ## x y ## 1 1 3 ## 2 2 4 ``` ```r df %>% pull(y) ``` ``` ## [1] 3 4 ``` --- class: middle # Working with factors --- ## Read data in as character strings ```r glimpse(cat_lovers) ``` ``` ## Rows: 60 ## Columns: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Will… ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", … ## $ handedness <chr> "left", "left", "left", "left", "left", … ``` --- ## But coerce when plotting ```r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <img src="04.wwdata2_files/figure-html/unnamed-chunk-72-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use forcats to manipulate factors ```r cat_lovers %>% * mutate(handedness = fct_infreq(handedness)) %>% ggplot(mapping = aes(x = handedness)) + geom_bar() ``` <img src="04.wwdata2_files/figure-html/unnamed-chunk-73-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Forcats for Factors .pull-right[ <img src="img/forcats-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" /> ] .pull-left-wide[ - Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display - They are also useful in modeling scenarios - The **forcats** package provides a suite of useful tools that solve common problems with factors ] ??? .your-turn[ ## Your turn! .row[.col-7[ - [RStudio Cloud](http://rstd.io/dsbox-cloud) > `Ex 04 - Hotels + Data types` > `hotels-forcats.Rmd` > knit - Recreate the x-axis of the following plot. - **Stretch goal:** Recreate the y-axis. ] ] <img src="04.wwdata2_files/figure-html/unnamed-chunk-75-1.png" width="90%" style="display: block; margin: auto;" /> ]