class: middle, title-slide # Introduction to Statistics ## and Data Science ### Dennis A. V. Dittrich ### 2022 --- layout: true <div class="my-footer"> <span><img src="img/tcb-logo.png" height="40px"></span> </div> --- ## What Is Statistics? .row[.col-8[ **Statistics (the discipline)** is a way of reasoning, along with collection of tools and methods of extracting useful information from a data set, designed to help us understand the world. **Statistics (plural)** are particular calculations made from data. **Data** are values with a context. To do good statistical analysis, you must: 1. Find the right data. 2. Use the appropriate statistical tools. 3. Clearly communicate the numerical information into written language. ]] --- ## What is Statistics Really About? .pull-left[ * Statistics is about variation. * People have different opinions about important issues. It can be important to see how their answers vary. * When we take measurements in an experiment, we expect individuals to be slightly different. How much difference is simply due to random variation? And when is a difference so large that we believe something other than random variation is at work? ] .pull-right[ **Data Analysis** is the process of examining collected data to look for patterns or numerical indicators that capture the essence of what the data is telling us. ] --- ## Data Science and Business Analytics:<br/>The Changing Face Of Statistics .pull-left-wide[ * Use statistical methods to analyze and explore data to uncover unforeseen relationships. * Use management science methods to develop optimization models that impact an organization’s strategy, planning, and operations. * Use information systems’ methods to collect and process data sets of all sizes, including very large data sets that would otherwise be hard to examine efficiently. ] --- ## Business Analytics Is Applied In Many Business Decision-Making Contexts .pull-left-wide[ * HR managers understanding relationships between HR drivers, key business outcomes, employee skills, capabilities, and motivation. * Financial analysts determining why certain trends occur to predict future financial environments. * Marketers driving loyalty programs and customer marketing decisions to drive sales. * Supply chain managers planning and forecasting based on product distribution and optimizing sales distribution based on key inventory measures. * Studies show an increase in productivity, innovation, and competitiveness for organizations that embrace business analytics. ] --- ## Data science .pull-left-wide[ - Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. - We're going to learn to do this in a `tidy` way -- more on that later! - This is a course on introduction to data science, with an emphasis on statistical thinking. ] --- class: middle # Software --- <img src="img/excel.png" width="75%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/emacs.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- class: middle # Data science life cycle --- <img src="img/data-science-cycle/data-science-cycle.001.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.002.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.003.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.004.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.005.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.006.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.007.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="img/data-science-cycle/data-science-cycle.008.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- class: middle # Let's dive in! --- ## Course toolkit ### Doing data science and statistical analyses .pull-left[ Reproducible Statistical Analyses: - R - RStudio - tidyverse - R Markdown ] --- class: col7-slide ## Learning goals By the end of the course, you will be able to... -- - gain insight from data -- - gain insight from data, **reproducibly** -- - gain insight from data, reproducibly, **using modern statistical tools and techniques** -- - gain insight from data, reproducibly **(with literate programming)**, using modern statistical tools and techniques --- class: middle # Reproducible data analysis --- class: col7-slide ## Reproducibility checklist .question[ What does it mean for a data analysis to be "reproducible"? ] -- Near-term goals: - Are the tables and figures reproducible from the commands and data? - Do the commands actually do what you think they do? - In addition to what was done, is it clear *why* it was done? Long-term goals: - Can the commands be used for other data? - Can you extend the commands to do other things? --- class: col7-slide ## Toolkit for reproducibility - Scriptability `\(\rightarrow\)` R - Literate programming (commands, narrative, output in one place) `\(\rightarrow\)` R Markdown - .gray[Version control `\(\rightarrow\)` Git / GitHub ] --- class: middle # R and RStudio --- ## R and RStudio .row[.col-6[ <img src="img/r-logo.png" width="25%" style="display: block; margin: auto;" /> ] .col-6[ <img src="img/rstudio-logo.png" width="50%" style="display: block; margin: auto;" /> ]] .row[.col-6[ - R is an open-source statistical **programming language** - R is also an environment for statistical computing and graphics - It's easily extensible with *packages* ] .col-6[ - RStudio is a convenient interface for R called an **IDE** (integrated development environment), e.g. *"I write R code in the RStudio IDE"* - RStudio is not a requirement for data analysis with R, but it's very commonly used by R users and data scientists ]] --- ## R is widely used in industry and academia .pull-left[ * Facebook - For behavior analysis related to status updates and profile pictures. * Google - For advertising effectiveness and economic forecasting. * Twitter - For data visualization and semantic clustering * New York Times - For data visualization * Microsoft, IBM, HP * Uber * Airbnb * Novartis * Roche * Mckinsey, BCG, Bain ] .pull-right[ * American Express * Bank of America * Barclays Bank * Bharti Axa Insurance * Blackrock * Citibank * HSBC, RBS, UBS * JP Morgan * Lloyds Banking * Wells Fargo * Goldman Sachs * Morgan Stanley ] --- class: col7-slide ## R packages .pull-left-wide[ - **Packages** are the fundamental units of reproducible R commands. They include reusable R functions, the documentation that describes how to use them, and sample data<sup>1</sup> - As of February 2022, there are over 18,800 R packages available on **CRAN** (the Comprehensive R Archive Network)<sup>2</sup> - We're going to work with a small (but important) subset of these! ] .pull-right-narrow[ .footnote[ <sup>1</sup> Wickham and Bryan, [R Packages](https://r-pkgs.org/). <sup>2</sup> [CRAN contributed packages](https://cran.r-project.org/web/packages/). ] ] --- ## Tour: R and RStudio <img src="img/tour-r-rstudio.png" width="80%" style="display: block; margin: auto;" /> --- ## A short list (for now) of R essentials .pull-left[ **Functions** are (most often) verbs, followed by what they will be applied to in parentheses: ```r do_this(to_this) do_that(to_this, to_that, with_those) ``` **Packages** are installed with the `install.packages` function and loaded with the `library` function, once per session: ```r install.packages("package_name") library(package_name) ``` ] .pull-right[ Object **documentation** can be accessed with `?` ```r ?mean ``` ] --- ## tidyverse .pull-left[ <img src="img/tidyverse.png" width="99%" style="display: block; margin: auto;" /> ] .pull-right[ .center[.large[ [tidyverse.org](https://www.tidyverse.org/) ]] The **tidyverse** is an opinionated collection of R packages designed for data science All packages share an underlying philosophy and a common grammar ] --- ## rmarkdown .pull-left[ .center[.large[ [rmarkdown.rstudio.com](https://rmarkdown.rstudio.com/) ]] **rmarkdown** and the various packages that support it enable R users to write their commands and prose in reproducible computational documents We will generally refer to R Markdown documents (with `.Rmd` extension), e.g. *"Do this in your R Markdown document"* and rarely discuss loading the rmarkdown package ] .pull-right[ <img src="img/rmarkdown.png" width="60%" style="display: block; margin: auto;" /> ] --- class: middle # R Markdown --- ## R Markdown .pull-left-wide[ - Fully reproducible reports -- each time you knit the analysis is ran from the beginning - Simple markdown syntax for text - R commands go in chunks, defined by three backticks, narrative goes outside of chunks ] --- ## Tour: R Markdown <img src="img/tour-rmarkdown.png" width="90%" style="display: block; margin: auto;" /> --- ## R Markdown help .pull-left[ .center[ .midi[R Markdown Cheat Sheet `Help -> Cheatsheets`] ] <img src="img/rmd-cheatsheet.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .center[ .midi[Markdown Quick Reference `Help -> Markdown Quick Reference`] ] <img src="img/md-cheatsheet.png" width="80%" style="display: block; margin: auto;" /> ] --- ## How will we use R Markdown? .pull-left-wide[ - Every assignment / report / project / etc. is an R Markdown document - You'll always have a template R Markdown document to start with - The amount of scaffolding in the template will decrease over the semester ] --- .your-turn[ ## Your turn: `Ex 01 - unvotes` .row[.col-7[ - Go to RStudio Cloud and start Ex 01 - unvotes. - In the Files pane (bottom right corner), spot the file called unvotes.Rmd. - Open it and click "Knit". - Then... - Go back to the file and change your name on top (in the yaml -- we'll talk about what this means later) and knit again. - Change the country names to those you're interested in. Spelling and capitalization must match how the countries appear in the data, so take a peek at the Appendix to confirm spelling. - Knit again. Voila, your first data visualization! ]]]