class: middle, title-slide # Exploratory Data Analysis ## and Data Visualization II ### Dennis A. V. Dittrich ### 2021 --- layout: true <div class="my-footer"> <span><img src="img/tcb-logo.png" height="40px"></span> </div> --- class: middle # Terminology --- ## Number of variables involved .row[.col-7[ **Univariate** data analysis - distribution of single variable **Bivariate** data analysis - relationship between two variables **Multivariate** data analysis - relationship between many variables at once, usually focusing on the relationship between two while conditioning for others ]] --- ## Types of variables .row[.col-7[ A variable is the general characteristic being observed on objects of interest. Types of variables * **Qualitative**: gender, race, political affiliation. * **Quantitative**: test scores, age, weight. Quantitative variables must have units. The units indicate... * how each value has been measured. * the corresponding scale of measurement. * how much of something we have. * how far apart two values are. ] .col-5[ **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. ]] --- class: middle # Data --- ## Data: Lending Club .pull-left-wide[ - Thousands of loans made through the Lending Club, which is a platform that allows individuals to lend to other individuals - Not all loans are created equal -- ease of getting a loan depends on (apparent) ability to pay back the loan - Data includes loans *made*, these are not loan applications ] .pull-right-narrow[ <img src="img/lending-club.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Take a peek at data ```r library(openintro) glimpse(loans_full_schema) ``` ``` ## Rows: 10,000 ## Columns: 55 ## $ emp_title <chr> "global config engine… ## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, … ## $ state <fct> NJ, HI, WI, PA, CA, K… ## $ homeownership <fct> MORTGAGE, RENT, RENT,… ## $ annual_income <dbl> 90000, 40000, 40000, … ## $ verified_income <fct> Verified, Not Verifie… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 1… ## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000… ## $ verification_income_joint <fct> , , , , Verified, , N… ## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66… ## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, … ## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3… ## $ earliest_credit_line <dbl> 2001, 1996, 2006, 200… ## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, … ## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32… ## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12… ... ``` --- ## Selected variables ```r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, … ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72,… ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36,… ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, … ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL,… ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 3400… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, … ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46,… ``` --- ## Selected variables <br> .midi[ variable | description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio ] --- ## Variable types <br> variable | type ----------------|------------- `loan_amount` | numerical, continuous `interest_rate` | numerical, continuous `term` | numerical, discrete `grade` | categorical, ordinal `state` | categorical, not ordinal `annual_income` | numerical, continuous `homeownership` | categorical, not ordinal `debt_to_income` | numerical, continuous --- class: middle # Visualizing numerical data --- ## Describing shapes of numerical distributions .row[.col-7[ **shape**: - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform **center**: mean (`mean`), median (`median`), mode (not always useful) **spread**: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) unusual observations ]] --- class: middle # Histogram --- ## Histogram ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Histograms and binwidth .panelset[ .panel[.panel-name[binwidth = 1000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-6-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 5000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-7-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 20000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-8-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Customizing histograms .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + * labs( * x = "Loan amount ($)", * y = "Frequency", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Fill with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_histogram(binwidth = 5000, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` ] ] --- ## Facet with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + * facet_wrap(~ homeownership, nrow = 3) ``` ] ] --- class: middle # Dot plot --- ## Dot plot ```r smallloans <- loans %>% sample_n(50) smallloans %>% ggplot(aes(x = loan_amount)) + geom_dotplot() + scale_y_continuous(NULL, breaks = NULL) ``` ``` ## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-12-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Fill with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r smallloans %>% ggplot(aes(x = loan_amount, * fill = homeownership)) + * geom_dotplot(stackgroups = TRUE, binpositions = "all", alpha = 0.5) + scale_y_continuous(NULL, breaks = NULL) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` ] ] --- ## Facet with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r smallloans %>% ggplot(aes(x = loan_amount, fill = homeownership)) + geom_dotplot() + scale_y_continuous(NULL, breaks = NULL) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + * facet_wrap(~ homeownership, nrow = 3) ``` ] ] --- ## Beeswarm plot ```r library(ggbeeswarm) loans %>% sample_n(200) %>% ggplot(aes(y = loan_amount, x=homeownership, color=homeownership)) + geom_quasirandom(varwidth = TRUE) + guides(color=F) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" /> --- class: middle # Density plot --- ## Density plot ```r ggplot(loans, aes(x = loan_amount)) + geom_density() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Density plots and adjusting bandwidth .panelset[ .panel[.panel-name[adjust = 0.5] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 0.5) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-17-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 1] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 1) # default bandwidth ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-18-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 2] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-19-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Customizing density plots .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) + * labs( * x = "Loan amount ($)", * y = "Density", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_density(adjust = 2, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", * fill = "Homeownership" ) ``` ] ] --- ## Ridge plots ```r library(ggridges) ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Box plot --- ## Box plot ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Box plot and outliers ```r ggplot(loans, aes(x = annual_income)) + geom_boxplot() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Customizing box plots .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() + labs( x = "Interest rate (%)", y = NULL, title = "Interest rates of Lending Club loans" ) + * theme( * axis.ticks.y = element_blank(), * axis.text.y = element_blank() * ) ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = interest_rate, * y = grade)) + geom_boxplot() + labs( x = "Interest rate (%)", y = "Grade", title = "Interest rates of Lending Club loans", * subtitle = "by grade of loan" ) ``` ] ] --- ## Adding a beeswarm .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-27-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r loans %>% sample_n(200) %>% ggplot(aes(y = loan_amount, x = homeownership)) + * geom_boxplot() + * geom_quasirandom(varwidth = TRUE, alpha = 0.3) + labs( x = "Home Ownership", y = "Loan Amount", title = "Amounts of Lending Club loans" ) + theme_minimal() ``` ] ] --- class: middle # Relationships numerical variables --- ## Scatterplot ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Hex plot ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-29-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Hex plot ```r ggplot(loans %>% filter(debt_to_income < 100), aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Recap --- ## Variables .row[.col-7[ **Numerical** variables can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. ]] --- ### Data ```r library(openintro) loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, … ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72,… ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36,… ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, … ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL,… ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 3400… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, … ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46,… ``` --- class: middle # Bar plot --- ## Bar plot ```r ggplot(loans, aes(x = grade)) + geom_bar() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-32-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Don't do Pie Charts ```r ggplot(loans, aes(x=1, fill = grade)) + geom_bar(position="fill") + coord_polar("y", start=0) + theme_void() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-33-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, * fill = grade)) + geom_bar() ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, fill = grade)) + * geom_bar(position = "fill") ``` <img src="03.dataviz-2_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" /> --- .row[.col-7[ .question[ Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? ]]] .pull-left[ <img src="03.dataviz-2_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="03.dataviz-2_files/figure-html/unnamed-chunk-37-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-38-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r *ggplot(loans, aes(y = homeownership, fill = grade)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Homeownership", * fill = "Grade", * title = "Grades of Lending Club loans", * subtitle = "and homeownership of lendee" * ) ``` ] ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-39-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(y = homeownership, * fill = fct_rev(grade))) + * geom_bar(position = "dodge2") + * labs( x = "Count", y = "Homeownership", * fill = "Grade", title = "Grades of Lending Club loans", subtitle = "and homeownership of lendee" ) + * scale_fill_ordinal(guide=(guide_legend(reverse=T))) + * theme_minimal() ``` ]] --- class: middle # Relationships between numerical and categorical variables --- ## Already talked about... - Colouring and faceting histograms and density plots - Side-by-side box plots --- ## Cleveland Dot Plot .panelset[ .panel[.panel-name[Plot] <img src="03.dataviz-2_files/figure-html/unnamed-chunk-40-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] .small[ ```r loans %>% group_by(grade, homeownership) %>% summarise(m_loan = mean(loan_amount)) %>% ggplot(aes(x = m_loan, y= fct_rev(grade))) + geom_line(aes(group = grade), alpha=0.3) + geom_point(aes(color = homeownership), size=3) + labs( x = "Average Loan Amount", y = "Grade", fill = "Grade", title = "Average Loan Amount & Grades of Lending Club Loans", subtitle = "and homeownership of lendee" ) + theme_minimal() + scale_color_discrete(guide=(guide_legend(reverse=T))) + theme( panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), legend.title = element_blank(), legend.justification = c(0, 1), legend.position = c(.1, 1.075), legend.background = element_blank(), legend.direction="horizontal", plot.title = element_text(size = 20, margin = margin(b = 10)), plot.subtitle = element_text(size = 16, color = "darkslategrey", margin = margin(b = 25)), plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0)) ``` ]]]