Working with Data

# Working with Data
## Data Types & Data Classes
### Dennis A. V. Dittrich
### 2021

---

---

# .hand[We...]

---

## Data: Sales

<br>

```
## # A tibble: 2 x 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>
```
]

--
.pull-right[
### .pink[We want...]

```
## # A tibble: 6 x 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

```r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

```r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Customers `$\rightarrow$` purchases

```r
purchases <- customers %>%
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_no",  # column names -> new column called item_no
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

---

## Why pivot?
.row[.col-7[
Most likely, because the next step of your analysis needs it
]]
--

```r
prices
```

```
## # A tibble: 5 x 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]
.pull-right[

```r
purchases %>%
* left_join(prices)
```

```
## # A tibble: 6 x 4
##   customer_id item_no item         price
##         <dbl> <chr>   <chr>        <dbl>
## 1           1 item_1  bread         1   
## 2           1 item_2  milk          0.8 
## 3           1 item_3  banana        0.15
## 4           2 item_1  milk          0.8 
## 5           2 item_2  toilet paper  3   
## 6           2 item_3  <NA>         NA
```
]

---

## Purchases `$\rightarrow$` customers

.pull-left-narrow[
- `data` (as usual)
- `names_from`: which column in the long format contains the what should be column names in the wide format
- `values_from`: which column in the long format contains the what should be values in the new columns in the wide format
]
.pull-right-wide[

```r
purchases %>%
* pivot_wider(
*   names_from = item_no,
*   values_from = item
* )
```

---

# Case study: Approval rating of Donald Trump

---

.footnote[
Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/)
]

---

## Data

```r
trump
```

```
## # A tibble: 2,702 x 4
##    subgroup date       approval disapproval
##    <chr>    <date>        <dbl>       <dbl>
##  1 Voters   2020-10-04     44.7        52.2
##  2 Adults   2020-10-04     43.2        52.6
##  3 Adults   2020-10-03     43.2        52.6
##  4 Voters   2020-10-03     45.0        51.7
##  5 Adults   2020-10-02     43.3        52.4
##  6 Voters   2020-10-02     44.5        52.1
##  7 Voters   2020-10-01     44.1        52.8
##  8 Adults   2020-10-01     42.7        53.3
##  9 Adults   2020-09-30     42.2        53.7
## 10 Voters   2020-09-30     44.2        52.7
## # … with 2,692 more rows
```

---

## Goal

.pull-left-wide[
<img src="04.wwdata2_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Pivot

```r
trump_longer <- trump %>%
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

trump_longer
```

```
## # A tibble: 5,404 x 4
##    subgroup date       rating_type rating_value
##    <chr>    <date>     <chr>              <dbl>
##  1 Voters   2020-10-04 approval            44.7
##  2 Voters   2020-10-04 disapproval         52.2
##  3 Adults   2020-10-04 approval            43.2
##  4 Adults   2020-10-04 disapproval         52.6
##  5 Adults   2020-10-03 approval            43.2
##  6 Adults   2020-10-03 disapproval         52.6
##  7 Voters   2020-10-03 approval            45.0
##  8 Voters   2020-10-03 disapproval         51.7
...
```

---

## Plot

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup)
```

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is Donald Trump?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="04.wwdata2_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is Donald Trump?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
* theme_minimal() +
* theme(legend.position = "bottom")
```
]

.panel[.panel-name[Plot]
<img src="04.wwdata2_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

# Why should you care about data types?

---

## Example: Cat lovers
.row[.col-7[
A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.
]]

```r
cat_lovers <- read_csv("data/cat-lovers.csv")
```

```
## # A tibble: 60 x 3
##    name           number_of_cats handedness
##    <chr>          <chr>          <chr>     
##  1 Bernice Warren 0              left      
##  2 Woodrow Stone  0              left      
##  3 Willie Bass    1              left      
##  4 Tyrone Estrada 3              left      
##  5 Alex Daniels   3              left      
##  6 Jane Bates     2              left      
##  7 Latoya Simpson 1              left      
##  8 Darin Woods    1              left      
##  9 Agnes Cobb     0              left      
## 10 Tabitha Grant  0              left      
## # … with 50 more rows
```

---

## Oh why won't you work?!

```r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning in mean.default(number_of_cats): argument is not numeric
## or logical: returning NA
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

```r
?mean
```

---

## Oh why won't you still work??!!

```r
cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
```

```
## Warning in mean.default(number_of_cats, na.rm = TRUE): argument
## is not numeric or logical: returning NA
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1        NA
```

---

## Take a breath and look at your data
.row[.col-7[
.question[
What is the type of the `number_of_cats` variable?
]
]]

```r
glimpse(cat_lovers)
```

```
## Rows: 60
## Columns: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Will…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", …
## $ handedness     <chr> "left", "left", "left", "left", "left", …
```

---

## Let's take another look

.midi[
<div id="htmlwidget-c2e3fede5c798441fdae" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-c2e3fede5c798441fdae">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60"],["Bernice Warren","Woodrow Stone","Willie Bass","Tyrone Estrada","Alex Daniels","Jane Bates","Latoya Simpson","Darin Woods","Agnes Cobb","Tabitha Grant","Perry Cross","Wanda Silva","Alicia Sims","Emily Logan","Woodrow Elliott","Brent Copeland","Pedro Carlson","Patsy Luna","Brett Robbins","Oliver George","Calvin Perry","Lora Gutierrez","Charlotte Sparks","Earl Mack","Leslie Wade","Santiago Barker","Jose Bell","Lynda Smith","Bradford Marshall","Irving Miller","Caroline Simpson","Frances Welch","Melba Jenkins","Veronica Morales","Juanita Cunningham","Maurice Howard","Teri Pierce","Phil Franklin","Jan Zimmerman","Leslie Price","Bessie Patterson","Ethel Wolfe","Naomi Wright","Sadie Frank","Lonnie Cannon","Tony Garcia","Darla Newton","Ginger Clark","Lionel Campbell","Florence Klein","Harriet Leonard","Terrence Harrington","Travis Garner","Doug Bass","Pat Norris","Dawn Young","Shari Alvarez","Tamara Robinson","Megan Morgan","Kara Obrien"],["0","0","1","3","3","2","1","1","0","0","0","0","1","3","3","2","1","1","0","0","1","1","0","0","4","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","3","3","2","1","1.5 - honestly I think one of my cats is half human","0","0","1","0","1","three","1","1","1","0","0","2"],["left","left","left","left","left","left","left","left","left","left","left","left","left","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","right","ambidextrous","ambidextrous","ambidextrous","ambidextrous","ambidextrous"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>name<\/th>\n      <th>number_of_cats<\/th>\n      <th>handedness<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"order":[],"autoWidth":false,"orderClasses":false,"columnDefs":[{"orderable":false,"targets":0}]}},"evals":[],"jsHooks":[]}</script>
]

---

## Sometimes you might need to babysit your respondents

```r
cat_lovers %>%
  mutate(number_of_cats = case_when(
    name == "Ginger Clark" ~ 2,
    name == "Doug Bass"    ~ 3,
    TRUE                   ~ as.numeric(number_of_cats)
    )) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced
## by coercion
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.833
```
]

---

## Always you need to respect data types

```r
cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) %>%
  summarise(mean_cats = mean(number_of_cats))
```

```
## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.833
```

---

## Now that we know what we're doing...

```r
*cat_lovers <- cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    )
```

---

## Moral of the story
.row[.col-7[
- If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
- Go in and investigate your data, apply the fix, *save your data*, live happily ever after.
]]
---

<br>

---

# Data types

---

## Data types in R
.row[.col-7[
- **logical**
- **double**
- **integer**
- **character**
- and some more, but we won't be focusing on those
]]

---

## Logical & character

```r
typeof(TRUE)
```

```
## [1] "logical"
```
]
.pull-right[
**character** - character strings

```r
typeof("hello")
```

```
## [1] "character"
```
]

---

## Double & integer

```r
typeof(1.335)
```

```
## [1] "double"
```

```r
typeof(7)
```

```
## [1] "double"
```
]
.pull-right[
**integer** - integer numerical values (indicated with an `L`)

```r
typeof(7L)
```

```
## [1] "integer"
```

```r
typeof(1:3)
```

```
## [1] "integer"
```
]

---

## Concatenation

Vectors can be constructed using the `c()` function.

```r
c(1, 2, 3)
```

```
## [1] 1 2 3
```

```r
c("Hello", "World!")
```

```
## [1] "Hello"  "World!"
```

```r
c(c("hi", "hello"), c("bye", "jello"))
```

```
## [1] "hi"    "hello" "bye"   "jello"
```

---

## Converting between types

with intention...

```r
x <- 1:3
x
```

```
## [1] 1 2 3
```

```r
typeof(x)
```

```
## [1] "integer"
```
]
--
.pull-right[

```r
y <- as.character(x)
y
```

```
## [1] "1" "2" "3"
```

```r
typeof(y)
```

```
## [1] "character"
```
]

---

## Converting between types

with intention...

```r
x <- c(TRUE, FALSE)
x
```

```
## [1]  TRUE FALSE
```

```r
typeof(x)
```

```
## [1] "logical"
```
]
--
.pull-right[

```r
y <- as.numeric(x)
y
```

```
## [1] 1 0
```

```r
typeof(y)
```

```
## [1] "double"
```
]

---

## Converting between types

without intention...

R will happily convert between various types without complaint when different types of data are concatenated in a vector, and that's not always a great thing!

```r
c(1, "Hello")
```

```
## [1] "1"     "Hello"
```

```r
c(FALSE, 3L)
```

```
## [1] 0 3
```
]
--
.pull-right[

```r
c(1.2, 3L)
```

```
## [1] 1.2 3.0
```

```r
c(2L, "two")
```

```
## [1] "2"   "two"
```
]

---

## Explicit vs. implicit coercion
.row[.col-7[
Let's give formal names to what we've seen so far:
]]
--
.row[.col-7[
**Explicit coercion** is when you call a function like `as.logical()`, `as.numeric()`, `as.integer()`, `as.double()`, or `as.character()`
]]

--
.row[.col-7[
**Implicit coercion** happens when you use a vector in a specific context that expects a certain type of vector
]]

???

.your-turn[
## Your turn!
.row[.col-7[
- RStudio Cloud > `Ex 04 - Hotels + Data types` > open `type-coercion.Rmd` and knit.
- What is the type of the given vectors? First, guess. Then, try it out in R. 
If your guess was correct, great! If not, discuss why they have that type.
]
]]
.row[.col-7[
**Example:** Suppose we want to know the type of `c(1, "a")`. First, I'd look at: 
]]
.pull-left[

```r
typeof(1)
```

```
## [1] "double"
```
]
.pull-right[

```r
typeof("a")
```

```
## [1] "character"
```
]
.row[.col-7[
and make a guess based on these. Then finally I'd check:
]]
.pull-left[

```r
typeof(c(1, "a"))
```

```
## [1] "character"
```
]
]

---

# Special values

---

## Special values

- `NA`: Not available
- `NaN`: Not a number
- `Inf`: Positive infinity
- `-Inf`: Negative infinity

```r
pi / 0
```

```
## [1] Inf
```

```r
0 / 0
```

```
## [1] NaN
```
]
.pull-right[

```r
1/0 - 1/0
```

```
## [1] NaN
```

```r
1/0 + 1/0
```

```
## [1] Inf
```
]

---

## `NA`s are special

```r
x <- c(1, 2, 3, 4, NA)
```

```r
mean(x)
```

```
## [1] NA
```

```r
mean(x, na.rm = TRUE)
```

```
## [1] 2.5
```

```r
summary(x)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.75    2.50    2.50    3.25    4.00       1
```

---

## `NA`s are logical
.row[.col-7[
R uses `NA` to represent missing values in its data structures.
]]

```r
typeof(NA)
```

```
## [1] "logical"
```

---

## Mental model for `NA`s
.row[.col-7[
- Unlike `NaN`, `NA`s are genuinely unknown values
- But that doesn't mean they can't function in a logical way
- Let's think about why `NA`s are logical...
]]
--
.row[.col-7[
.question[
Why do the following give different answers?
]]]
.pull-left[

```r
# TRUE or NA
TRUE | NA
```

```
## [1] TRUE
```
]
.pull-right[

```r
# FALSE or NA
FALSE | NA
```

```
## [1] NA
```
]

`$\rightarrow$` See next slide for answers...

---

- `NA` is unknown, so it could be `TRUE` or `FALSE`

```r
TRUE | TRUE  # if NA was TRUE
```

```
## [1] TRUE
```

```r
TRUE | FALSE # if NA was FALSE
```

```
## [1] TRUE
```
]
]
.pull-right[
.midi[
- `FALSE | NA`

```r
FALSE | TRUE  # if NA was TRUE
```

```
## [1] TRUE
```

```r
FALSE | FALSE # if NA was FALSE
```

```
## [1] FALSE
```
]
]

- Doesn't make sense for mathematical operations 
- Makes sense in the context of missing data

---

# Data classes

---

## Data classes
.row[.col-7[
We talked about *types* so far, next we'll introduce the concept of *classes*

- Vectors are like Lego building blocks
- We stick them together to build more complicated constructs, e.g. *representations of data*
- The **class** attribute relates to the S3 class of an object which determines its behaviour
  - You don't need to worry about what S3 classes really mean, but you can read more about it [here](https://adv-r.hadley.nz/s3.html#s3-classes) if you're curious

- Examples: factors, dates, and data frames
]]

---

## Factors
.row[.col-7[
R uses factors to handle categorical variables, variables that have a fixed and known set of possible values
]]

```r
x <- factor(c("BS", "MS", "PhD", "MS"))
x
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

```r
typeof(x)
```

```
## [1] "integer"
```
]
.pull-right[

```r
class(x)
```

```
## [1] "factor"
```
]

---

## More on factors
.row[.col-7[
We can think of factors like character (level labels) and an integer (level numbers) glued together
]]

```r
glimpse(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

```r
as.integer(x)
```

```
## [1] 1 2 3 2
```

---

## Dates

```r
y <- as.Date("2020-01-01")
y
```

```
## [1] "2020-01-01"
```

```r
typeof(y)
```

```
## [1] "double"
```

```r
class(y)
```

```
## [1] "Date"
```

---

## More on dates
.row[.col-7[
We can think of dates like an integer (the number of days since the origin, 1 Jan 1970) and an integer (the origin) glued together
]]

```r
as.integer(y)
```

```
## [1] 18262
```

```r
as.integer(y) / 365 # roughly 50 yrs
```

```
## [1] 50.03288
```

---

## Data frames
.row[.col-7[
We can think of data frames like vectors of equal length glued together
]]

```r
df <- data.frame(x = 1:2, y = 3:4)
df
```

```
##   x y
## 1 1 3
## 2 2 4
```

```r
typeof(df)
```

```
## [1] "list"
```
]
.pull-right[

```r
class(df)
```

```
## [1] "data.frame"
```
]

---

## Lists
.row[.col-7[
Lists are a generic vector container, vectors of any type can go in them
]]

```r
l <- list(
  x = 1:4,
  y = c("hi", "hello", "jello"),
  z = c(TRUE, FALSE)
)
l
```

```
## $x
## [1] 1 2 3 4
## 
## $y
## [1] "hi"    "hello" "jello"
## 
## $z
## [1]  TRUE FALSE
```

---

## Lists and data frames
.row[.col-7[
- A data frame is a special list containing vectors of equal length
- When we use the `pull()` function, we extract a vector from the data frame
]]

```r
df
```

```
##   x y
## 1 1 3
## 2 2 4
```

```r
df %>%
  pull(y)
```

```
## [1] 3 4
```

---

# Working with factors

---

## Read data in as character strings

```r
glimpse(cat_lovers)
```

---

## But coerce when plotting

```r
ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Use forcats to manipulate factors

```r
cat_lovers %>%
* mutate(handedness = fct_infreq(handedness)) %>%
  ggplot(mapping = aes(x = handedness)) +
  geom_bar()
```

---

## Forcats for Factors

.pull-right[
<img src="img/forcats-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]

.pull-left-wide[
- Factors are useful when you have true categorical data and you want to override the ordering of character vectors to improve display
- They are also useful in modeling scenarios
- The **forcats** package provides a suite of useful tools that solve common problems with factors
]

???

.your-turn[
## Your turn!
.row[.col-7[
- [RStudio Cloud](http://rstd.io/dsbox-cloud) > `Ex 04 - Hotels + Data types` > `hotels-forcats.Rmd` > knit
- Recreate the x-axis of the following plot. 
- **Stretch goal:** Recreate the y-axis.
]
]

<img src="04.wwdata2_files/figure-html/unnamed-chunk-75-1.png" width="90%" style="display: block; margin: auto;" />
]