class: middle, title-slide # Inference I ## Estimation of Population Parameters and their Margin of Error - Continuous Data ### Dennis A. V. Dittrich ### 2021 --- layout: true <div class="my-footer"> <span><img src="img/tcb-logo.png" height="40px"></span> </div> --- ## Statistical Inference and Esitmation .row[.col-7[ **Inference** is the process of using sample data to make conclusions about the underlying population the sample came from So far we have done lots of **estimation** (mean, median, etc.), i.e. - used data from samples to calculate sample statistics - which can then be used as estimates for population parameters ]] --- <br/> .row[.col-7[ .question[ If you want to catch a fish, do you prefer a spear or a net? ] ]] <br> .pull-left[ <img src="img/spear.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/net.png" width="80%" style="display: block; margin: auto;" /> ] --- .row[.col-7[ .question[ If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value? ] <br> - If we report a **point estimate**, we probably won’t hit the exact population parameter - If we report a **range of plausible values** we have a good shot at capturing the parameter ]] --- ## Population and Random Samples ### Parameter and Sample Statistics .row[.col-7[ **Parameter** A measure computed from the entire **population**. As long as the population does not change, the value of the parameter will not change. **Random sampling** 1. Every member of the population must have an equal probability of being sampled; and 2. All members of the sample must be chosen independently. ] .col-5[ **Sampling distribution** The probability distribution of a sample statistic that is formed when random samples of size `\(n\)` are repeatedly taken from a population. The sampling distribution of the sample mean is the distribution created by the means of many samples. ]] --- ## Sampling Distribution Simulation .col-7[ In simulation, unlike in real-life research, 1. we specify `\(\mu\)`, `\(\sigma\)`, and the shape of the population; and 2. we take many samples. We use sample statistics `\(M\)` and `\(s\)` as point estimates of population parameters `\(\mu\)` and `\(\sigma\)`. ] --- ## Sampling Distribution Simulation ### Normal distributed population .row[.col-6[ <img src="07.inference-1_files/figure-html/stdnormal-plot-1.png" width="100%" style="display: block; margin: auto;" /> <img src="07.inference-1_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Sampling Distribution Simulation ### Normal distributed population .row[.col-6[ Take an infinite number of samples: The distribution of their means is the **theoretical sampling distribution** of the sample mean ] .col-6[ We can compare the theoretical sampling distribution of the sample mean with the empirical sampling distribution of the sample mean ]] .row[.col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## The standard error .row[.col-6[ The difference between a measure computed from a sample (a statistic) and the corresponding measure computed from the population (a parameter) is the **sampling error**. * The size of the sampling error depends on which sample is selected. * The sampling error may be positive or negative. * There is potentially a different `\(\bar{x}\)` for each sample. ] .col-6[ The **standard error** is the standard deviation of the sampling distribution of the sample mean. `$$SE = s = \frac{\sigma}{\sqrt{N}}$$` To decrease the standard error we need to increase the sample size `\(N\)`. ] ] --- ## The Standard Error depends on Sample Size .row[.col-6[ <img src="07.inference-1_files/figure-html/stdnormal-plot-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ <img src="07.inference-1_files/figure-html/s100-dot-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Properties of Sampling Distributions ## of Sample Means .row[.col-7[ 1. The expectation (mean) of the sample means, `\(\mu_{\bar{x}}\)`, is equal to the population mean `\(\mu\)`. `$$\mu_{\bar{x}}=\mu$$` 2. The standard deviation of the sample means, `\(\sigma_{\bar{x}}\)`, is equal to the population standard deviation, `\(\sigma\)`, divided by the square root of the sample size, `\(N\)`. `$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}$$` Called the **standard error of the mean**. The larger the sample, the smaller the standard error of the mean. ]] --- .row[.col-7[ ## The Central Limit Theorem The central limit theorem states that the sum, or the mean, of a number of independent variables has, approximately, a normal distribution, almost whatever the distributions of those variables. If the population itself is normally distributed, the sampling distribution of the sample means is normally distribution for any sample size `\(N\)`. ] .col-5[ ![](img/clt2.png) ]] .row[.col-7[ ## The Central Limit Theorem If samples of sufficient size ( `\(N \geq 30\)` is often considered sufficient) are drawn from any population with mean = `\(\mu\)` and standard deviation = `\(\sigma ,\)` then the sampling distribution of the sample means approximates a normal distribution. The greater the sample size, the better the approximation. ] .col-5[ ![](img/clt1.png) ]] --- ## The Central Limit Theorem .row[.col-7[ In either case, the sampling distribution of sample means has a mean equal to the population mean. `$$\text{Mean of the sample means } \mu_{\bar{x}} = \mu$$` The sampling distribution of sample means has a variance equal to `\(1/n\)` times the variance of the population and a standard deviation equal to the population standard deviation divided by the square root of `\(n\)`. ] .col-5[ **Variance of the sample means** `$$\sigma^2_{\bar{x}} = \frac{\sigma^2}{n}$$` **Standard deviation of the sample means** (Standard error of the mean) `$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$` ]] --- ## Sampling Distribution Simulation ### Uniform distributed population .row[.col-6[ <img src="07.inference-1_files/figure-html/unif-plot-1.png" width="100%" style="display: block; margin: auto;" /> <img src="07.inference-1_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Sampling Distribution Simulation ### Skewed distributed population (Exponential distribution) .row[.col-6[ <img src="07.inference-1_files/figure-html/exp-plot-1.png" width="100%" style="display: block; margin: auto;" /> <img src="07.inference-1_files/figure-html/unnamed-chunk-10-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Expected Value and Variance ## of Continuous Random Variables .row[ .col-6[ If `\(X\)` is a continuous random variable with pdf `\(f(x)\)`, then the **expected value** (or **mean**) of `\(X\)` is given by `$$\mu = \mu_X = \int_{-\infty}^\infty x\cdot f(x)$$` The formula for the expected value of a continuous random variable is the continuous analog of the expected value of a discrete random variable, where instead of summing over all possible values we integrate. ] .col-6[ For the **variance** of a continuous random variable, the definition is the same as before, only we now integrate to calculate the value: `\begin{align*} \text{Var}(X)&=\text{E}[X^2]−\mu^2\\&=\int_{-\infty}^\infty x^2\cdot f(x) - \mu^2 \end{align*}` ]] --- ## Confidence intervals .row[.col-7[ A plausible range of values for the population parameter is a **confidence interval**. - In order to construct a confidence interval we need to quantify the variability of our sample statistic - For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mean - This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean - Quantifying this requires a measurement of how much we would expect the sample population to vary from sample to sample ]] --- ## Estimation Error and the Margin of Error .row[.col-6[ The **estimation error** is ( `\(M-\mu\)` ), the distance between our point estimate based on the sample, and the population parameter we are estimating. The **margin of error** (MoE) is the largest likely estimation error. Choosing “likely” to mean 95%, the MoE is `\(1.96 \times SE\)`. `\(z_{.95} = 1.96\)`, meaning 95% of the area under the standard normal curve lies between `\(z = −1.96\)` and `\(z = 1.96\)`. Also `\(z_{.99} = 2.58\)`. ] .col-6[ <img src="07.inference-1_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> <img src="07.inference-1_files/figure-html/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Confidence Interval .row[.col-7[ The 95% **confidence interval** (CI) is an interval calculated from sample data that’s one from an infinite sequence, 95% of which include the population parameter. ]] .row[.col-6[ For example, the CI on the sample mean is [M - MoE, M + MoE]. In the long run, 95% of such intervals include `\(\mu\)`. For 95% of samples, `\(|M – \mu| < \text{MoE}\)`, meaning that for most samples `\(M\)` is close to `\(\mu\)`. Therefore in most cases `\(\mu\)` is close to `\(M\)`. ] .col-6[ If `\(\sigma\)` (the population standard deviation) is known the 95% CI of the sample estimate `\(M\)` for the population mean `\(\mu\)` is $$\left[ M-1.96\times\frac{\sigma}{\sqrt{N}}, M+1.96\times\frac{\sigma}{\sqrt{N}} \right] $$ ] ] --- ## Level of Confidence .row[.col-7[ The **level of confidence**, or **confidence level**, is the 95 in “95% CI”. It specifies how confident we can be that our CI includes the population parameter. ]] .row[.col-7[ More generally, for the C% CI of a normally distributed sample estimate we would use `\(z_{C/100}\)`, and write the C% CI if the population standard deviation `\(\sigma\)` is known as: ] .col-5[ For larger C, the CI is longer, for smaller C it is shorter. A 99% CI is longer than a 95% CI. ]] .row[.col-7[ `$$\left[ M-z_{C/100}\times\frac{\sigma}{\sqrt{N}}, M+z_{C/100}\times\frac{\sigma}{\sqrt{N}} \right]$$` Therefore, if `\(\sigma\)` is known: `$$\text{MoE}_C = z_{C/100}\times\frac{\sigma}{\sqrt{N}}$$` ]] --- .row[.col-5[ ## Confidence Interval For any CI, bear in mind it might be one of the intervals that doesn’t capture the true population parameter, although in real life we’ll never know. ] .col-7[ <img src="07.inference-1_files/figure-html/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## The t Distribution .row[.col-7[ For any `\(M\)`, we can consider a `\(z\)` score that tells us where that sample mean falls in the sampling distribution ].col-5[ `$$z=\frac{X-\mu}{\sigma}$$` ]] .row[.col-7[ The z we want refers to the sampling distribution curve, which is a normal distribution with mean `\(\mu\)` and SD of `\(SE = \frac{\sigma}{\sqrt{N}}\)` ] .col-5[ `$$z=\frac{M-\mu}{\sigma/\sqrt{N}}$$` ]] .row[.col-7[ If we don't `\(\sigma\)` we will have to estimate it from the data. Rather than using the `\(z\)` score for our sample, which needs `\(\sigma\)`, we define a value of `\(t\)` instead. t follows a **Student's t distribution** with `\(df=N-1\)` degrees of freedom. ] .col-5[ `$$t=\frac{M-\mu}{s/\sqrt{N}}$$` ]] --- ## Student's t Distribution .col-7[ <img src="07.inference-1_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Degrees of Freedom .row[.col-7[ The standard deviation, s, is given by `$$s=\sqrt{\frac{\sum(X_i-M)^2}{N-1}}$$` The number of **degrees of freedom**, df, is the number of separate, relevant pieces of information that are available. ] .col-5[ When we know the mean `\(M\)` and the value of `\(N-1\)` observations, we can infer the value of the remaining observation. When we estimate s, there are only `\(N-1\)` pieces of information from our sample of `\(N\)` left "free." ] ] --- ## CIs When `\(\sigma\)` is not Known .row[.col-7[ Almost always in practice we don’t know `\(\sigma\)`, so need to use `\(s\)`, the SD of our sample. To calculate CIs and MoE we replace `\(z_{C/100}\)` with `\(t_{C/100}(df)\)` and `\(\sigma\)` with `\(s\)`. Making those substitutions gives us the C% CI when σ is not known: `$$\left[ M-t_{C/100}(df)\times\frac{s}{\sqrt{N}}, M+t_{C/100}(df)\times\frac{s}{\sqrt{N}} \right]$$` Therefore, if `\(\sigma\)` is not known: `$$\text{MoE}_C = t_{C/100}(df)\times\frac{s}{\sqrt{N}}$$` ]] --- ## Effect Sizes .row[.col-7[ **Effect size** (ES) is the amount of anything that’s of research interest. ]] .row[.col-7[ A **population effect size** is the true value of an effect in the population. A **sample effect size**, or effect size estimate, is calculated from sample data. ] .col-5[ We calculate from our data the sample effect size, and use this as our estimate of the population effect size, which is typically what we would like to know. ]] --- ## Interpreting Effect Sizes and Confidence Intervals .row[.col-7[ * Our CI is one from the dance—a notionally infinite sequence of repeats of the experiment. Most likely it captures the parameter we’re estimating, but “It might be red!” If N is less than about 6 to 8, CI length may be a very poor indicator of precision and the width of the dance. Such a CI may mislead. * The density plot shows how plausibility is greatest near the center of the CI, and decreases smoothly towards and beyond either limit. The CI includes values that are most plausible for `\(\mu\)`. The lower limit (LL) of our interval is a likely lower bound for `\(\mu\)`, and the upper limit (UL) a likely upper bound. ] .col-5[ * Margin of Error Gives the Precision. The MoE of a 95% CI indicates precision, and is the maximum likely estimation error. * Our CI Gives Useful Information About Replication A replication mean is the mean obtained in a close replication. ]] --- ## Bootstrapping .row[.col-7[ - _"pulling oneself up by one’s bootstraps"_: accomplishing an impossible task without any outside help - **Impossible task:** estimating a population parameter using data from only the given sample (without making additional assumptions) - **Note:** Notion of saying something about a population parameter using only information from an observed sample is the essence of statistical inference ] .col-5[ .huge[ 🥾 ] ]] --- ## Bootstrapping scheme .row[.col-7[ 1. Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, variance, etc. computed on the bootstrap samples 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics 4. Calculate the bounds of the C% confidence interval as the middle C% of the bootstrap distribution ]] --- class: middle # Bootstrapping in `R` --- ## Rent in Edinburgh .col-7[ .question[ Take a guess! How much does a typical 3 BR flat in Edinburgh rents for? ] ] --- ## Sample .row[.col-7[ Fifteen 3 BR flats in Edinburgh were randomly selected on rightmove.co.uk. ``` ## # A tibble: 15 x 4 ## flat_id rent title address ## <chr> <dbl> <chr> <chr> ## 1 flat_01 825 3 bedroom apartmen… Burnhead Grove, Edinburgh, M… ## 2 flat_02 2400 3 bedroom flat to … Simpson Loan, Quartermile, E… ## 3 flat_03 1900 3 bedroom flat to … FETTES ROW, NEW TOWN, EH3 6SE ## 4 flat_04 1500 3 bedroom apartmen… Eyre Crescent, Edinburgh, Mi… ## 5 flat_05 3250 3 bedroom flat to … Walker Street, Edinburgh ## 6 flat_06 2145 3 bedroom flat to … George Street, City Centre, … ## # … with 9 more rows ``` ]] --- ## Observed sample <img src="07.inference-1_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Observed sample Sample mean ≈ £1895 😱 <br> <img src="img/rent-bootsamp.png" width="90%" style="display: block; margin: auto;" /> --- ## Bootstrap population .col-7[ Generated assuming there are more flats like the ones in the observed sample... Population mean = ❓ ] <img src="img/rent-bootpop.png" width="65%" style="display: block; margin: auto;" /> --- ## Bootstrapping scheme .row[.col-7[ 1. Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics 4. Calculate the bounds of the C% confidence interval as the middle C% of the bootstrap distribution ]] --- class: middle # Bootstrapping with tidymodels --- ## Generate bootstrap means .row[.col-7[ ```r edi_3br %>% # specify the variable of interest specify(response = rent) ``` ] .col-5[ **specify the variable of interest** ] ] --- ## Generate bootstrap means .row[.col-7[ ```r edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 5000 bootstrap samples generate(reps = 5000, type = "bootstrap") ``` ] .col-5[ specify the variable of interest **generate bootstrap samples** ] ] --- ## Generate bootstrap means .row[.col-7[ ```r edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 5000 bootstrap samples generate(reps = 5000, type = "bootstrap") %>% # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` ] .col-5[ specify the variable of interest generate bootstrap samples **calculate the mean of each bootstrap sample** ] ] --- ## Generate bootstrap means .row[.col-7[ ```r # save resulting bootstrap distribution boot_df <- edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 5000 bootstrap samples generate(reps = 5000, type = "bootstrap") %>% # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` ] .col-5[ specify the variable of interest generate bootstrap samples calculate the mean of each bootstrap sample **save resulting bootstrap distribution** ] ] --- # The bootstrap sample .row[.col-7[ .question[ How many observations are there in `boot_df`? What does each observation represent? ] .midi[ ```r boot_df ``` ``` ## # A tibble: 5,000 x 2 ## replicate stat ## <int> <dbl> ## 1 1 1793. ## 2 2 1938. ## 3 3 2175 ## 4 4 2159. ## 5 5 2084 ## 6 6 1761 ## # … with 4,994 more rows ``` ] ]] --- ## Visualize the bootstrap distribution .row[.col-6[ ```r ggplot(data = boot_df, mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + labs( title = "Bootstrap distribution of means") ``` <img src="07.inference-1_files/figure-html/unnamed-chunk-27-1.png" width="100%" style="display: block; margin: auto;" /> ] .col-6[ ```r boot_df %>% visualize() ``` <img src="07.inference-1_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] ] --- ## Calculate the confidence interval .row[.col-6[ A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution. ```r boot_df %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` ``` ## # A tibble: 1 x 2 ## lower upper ## <dbl> <dbl> ## 1 1605. 2215. ``` ] .col-6[ #### Percentile method ```r boot_df %>% get_ci(level=0.95) ``` ``` ## # A tibble: 1 x 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1605. 2215. ``` #### Standard Error method ```r boot_df %>% get_ci(level=0.95, point_estimate=sample_mean, type="se") ``` ``` ## # A tibble: 1 x 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1587. 2203. ``` #### Bias Correction method ```r boot_df %>% get_ci(level=0.95, point_estimate=sample_mean, type="bias-corrected") ``` ``` ## # A tibble: 1 x 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 1621. 2235. ``` ]] --- ## Visualize the confidence interval .row[.col-6[ ```r boot_df %>% ggplot(mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + geom_vline(xintercept = c(lower_bound, upper_bound), color = "#A7D5E8", size = 2) + labs(title = "Bootstrap distribution of means", subtitle = "and 95% confidence interval") ``` <img src="07.inference-1_files/figure-html/unnamed-chunk-35-1.png" width="80%" style="display: block; margin: auto;" /> ] .col-6[ ```r percentile_ci <- boot_df %>% get_ci(level=0.95) boot_df %>% visualize() + shade_confidence_interval( endpoints = percentile_ci) ``` <img src="07.inference-1_files/figure-html/unnamed-chunk-36-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- ## Interpret the confidence interval .row[.col-7[ .question[ The 95% confidence interval for the mean rent of three bedroom flats in Edinburgh was calculated as (1605, 2215). Which of the following is the correct interpretation of this interval? **(a)** 95% of the time the mean rent of three bedroom flats in this sample is between £1605 and £2215. **(b)** 95% of all three bedroom flats in Edinburgh have rents between £1605 and £2215. **(c)** We are 95% confident that the mean rent of all three bedroom flats is between £1605 and £2215. **(d)** We are 95% confident that the mean rent of three bedroom flats in this sample is between £1605 and £2215. ] ]] --- class: middle # Accuracy vs. precision --- ## Confidence level .row[.col-7[ **We are 95% confident that ...** - Suppose we took many samples from the original population and built a 95% confidence interval based on each sample. - Then about 95% of those intervals would contain the true population parameter. ]] --- ## Commonly used confidence levels .row[.col-7[ .question[ Which line (orange dash/dot, blue dash, green dot) represents which confidence level (90%, 95%, 99%)? ] <img src="07.inference-1_files/figure-html/unnamed-chunk-37-1.png" width="60%" style="display: block; margin: auto;" /> ]] --- ## Precision vs. accuracy .row[.col-7[ .question[ If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval? ] ]] <img src="img/garfield.png" width="60%" style="display: block; margin: auto;" /> .row[.col-7[ .question[ How can we get best of both worlds -- high precision and high accuracy? ] ]] --- ## Changing confidence level .row[.col-7[ .question[ How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval? ] ]] .row[.col-7[ ```r edi_3br %>% specify(response = rent) %>% generate(reps = 5000, type = "bootstrap") %>% calculate(stat = "mean") %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` ] ] --- ## Recap .row[.col-7[ - Sample statistic `\(\ne\)` population parameter, but if the sample is good, it can be a good estimate - We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population - Since we can't continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability - We can do this for any sample statistic: - For a mean: `calculate(stat = "mean")` - For a median: `calculate(stat = "median")` ]]