class: center, middle, inverse, title-slide .title[ # An Overview of Linear Regression and Bias-Variance Tradeoff in Predictive Modeling ] .author[ ### Cengiz Zopluoglu ] .institute[ ### College of Education, University of Oregon ] --- <style> .blockquote { border-left: 5px solid #007935; background: #f9f9f9; padding: 10px; padding-left: 30px; margin-left: 16px; margin-right: 0; border-radius: 0px 4px 4px 0px; } #infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid black; border-radius: 10px; background: #E6F6DC 5px center/3em no-repeat; } .centering[ float: center; ] .left-column2 { width: 50%; height: 92%; float: left; padding-top: 1em; } .right-column2 { width: 50%; float: right; padding-top: 1em; } .remark-code { font-size: 18px; } .tiny .remark-code { /*Change made here*/ font-size: 75% !important; } .tiny2 .remark-code { /*Change made here*/ font-size: 50% !important; } .indent { margin-left: 3em; } .single { line-height: 1 ; } .double { line-height: 2 ; } .title-slide h1 { padding-top: 0px; font-size: 40px; text-align: center; padding-bottom: 18px; margin-bottom: 18px; } .title-slide h2 { font-size: 30px; text-align: center; padding-top: 0px; margin-top: 0px; } .title-slide h3 { font-size: 30px; color: #26272A; text-align: center; text-shadow: none; padding: 10px; margin: 10px; line-height: 1.2; } </style> ### Today's Goals: - An Overview of Linear Regression - Model Description - Model Estimation - Performance Evaluation - Understanding the concept of bias - variance tradeoff for predictive models - How to balance the model bias and variance when building predictive models --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # An Overview of Linear Regression --- - The prediction algorithms are classified into two main categories: *supervised* and *unsupervised*. - **Supervised algorithms** are used when the dataset has an actual outcome of interest to predict (labels), and the goal is to build the "best" model predicting the outcome of interest. - **Unsupervised algorithms** are used when the dataset doesn't have an outcome of interest. The goal is typically to identify similar groups of observations (rows of data) or similar groups of variables (columns of data) in data. - This course will cover several *supervised* algorithms and Linear Regression is one of the most straightforward supervised algorithms and the easiest to interpret. --- ## Model Description The linear regression model with `\(P\)` predictors and an outcome variable `\(Y\)` can be written as `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \epsilon$$` In this model, - `\(Y\)` represents the observed value for the outcome of an observation, - `\(X_{p}\)` represents the observed value of the `\(p^{th}\)` variable for the same observation, - `\(\beta_p\)` is the associated model parameter for the `\(p^{th}\)` variable, - and `\(\epsilon\)` is the model error (residual) for the observation. This model includes only the main effects of each predictor. --- - The previous model can be easily extended by including quadratic or higher-order polynomial terms for all (or a specific subset of) predictors. - A model with the first-order, second-order, and third-order polynomial terms for all predictors can be written as `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \sum_{k=1}^{P} \beta_{k+P}X_{k}^2 + \sum_{m=1}^{P} \beta_{m+2P}X_{m}^3 + \epsilon$$` - Example: A model with only main effects `$$Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \beta_3X_{3}+ \epsilon.$$` - Example: A model with polynomial terms up to the 3rd degree added: `$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \\ \beta_4X_1^2 + \beta_5X_2^2 + \beta_6X_3^2+ \\ \beta_{7}X_1^3 + \beta_{8}X_2^3 + \beta_{9}X_3^3$$` --- - The effect of predictor variables on the outcome variable is sometimes not additive. - When the effect of one predictor on the response variable depends on the levels of another predictor, non-additive effects (a.k.a. interaction effects) can also be added to the model. - The interaction effects can be first-order interactions (interaction between two variables, e.g., `\(X_1*X_2\)`), second-order interactions ($X_1*X_2*X_3$), or higher orders. I - For instance, the model below also adds the first-order interactions. `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \sum_{k=1}^{P} \beta_{k+P}X_{k}^2 + \sum_{m=1}^{P} \beta_{m+2P}X_{m}^3 + \sum_{i=1}^{P}\sum_{j=i+1}^{P}\beta_{i,j}X_iX_j + \epsilon$$` - A model with both interaction terms and polynomial terms up to the 3rd degree added: `$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \\ \beta_4X_1^2 + \beta_5X_2^2 + \beta_6X_3^2+ \\ \beta_{7}X_1^3 + \beta_{8}X_2^3 + \beta_{9}X_3^3+ \\ \beta_{1,2}X_1X_2+ \beta_{1,3}X_1X_3 + \beta_{2,3}X_2X_3 + \epsilon$$` --- ## Model Estimation - Suppose that we would like to predict the target readability score for a given text from the Feature 220. - Note that there are 768 features extracted from the NLP model as numerical embeddings. For the sake of simplicity, we will only use one of them (Feature 220). - Below is a scatterplot to show the relationship between these two variables for a random sample of 20 observations. <img src="slide3_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- - Consider a simple linear regression model - Outcome: the readability score is the outcome ( `\(Y\)` ) - Predictor: Feature 220 ( `\(X\)` ) - Our regression model is `$$Y = \beta_0 + \beta_1X + \epsilon$$` - The set of coefficients, { `\(\beta_0,\beta_1\)` } , represents a linear line. - We can write any set of { `\(\beta_0,\beta_1\)` } coefficients and use it as our model. - For instance, suppose we guesstimate that these coefficients are { `\(\beta_0,\beta_1\)` } = {-1.5,2}. Then, my model would be `$$Y = -1.5 + 2X + \epsilon$$` --- <img src="slide3_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- We can predict the target readability score for any observation in the dataset using this model. `$$Y_{(1)} = -1.5 + 2X_{(1)} + \epsilon_{(1)}.$$` `$$\hat{Y}_{(1)} = -1.5 + 2*(-0.139) = -1.778$$` $$\hat{\epsilon}_{(1)} = -2.062 - (-1.778) = -0.284 $$ The discrepancy between the observed value and the model prediction is the model error (residual) for the first observation and captured in the `\(\epsilon_{(1)}\)` term in the model. <img src="slide3_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- We can do the same thing for the second observation. `$$Y_{(2)} = -1.5 + 2X_{(2)} + \epsilon_{(2)}.$$` `$$\hat{Y}_{(2)} = -1.5 + 2*(0.218) = -1.065$$` $$\hat{\epsilon}_{(2)} = 0.583 - (-1.065) = 1.648 $$ <img src="slide3_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> --- Using a similar approach, we can calculate the model error for every observation. <img src="slide3_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> --- .single[ ```r d <- readability_sub[,c('V220','target')] d$predicted <- -1.5 + 2*d$V220 d$error <- d$target - d$predicted round(d,3) ``` ] .pull-left[ .single[ ``` V220 target predicted error 1 -0.139 -2.063 -1.778 -0.285 2 0.218 0.583 -1.065 1.647 3 0.058 -1.653 -1.384 -0.269 4 0.025 -0.874 -1.449 0.576 5 0.224 -1.740 -1.051 -0.689 6 -0.078 -3.640 -1.656 -1.984 7 0.434 -0.623 -0.632 0.009 8 -0.244 -0.344 -1.987 1.643 9 0.159 -1.123 -1.182 0.059 10 0.145 -0.999 -1.210 0.211 11 0.342 -0.877 -0.816 -0.061 12 0.252 -0.033 -0.996 0.963 13 0.035 -0.495 -1.429 0.934 14 0.364 0.125 -0.772 0.896 15 0.300 0.097 -0.900 0.997 16 0.198 0.384 -1.103 1.487 17 0.078 -0.581 -1.344 0.762 18 0.079 -0.343 -1.341 0.998 19 0.570 -0.391 -0.360 -0.031 20 0.345 -0.675 -0.810 0.134 ``` ] ] .pull-right[ `$$SSR = \sum_{i=1}^{N}(Y_{(i)} - (\beta_0+\beta_1X_{(i)}))^2$$` `$$SSR = \sum_{i=1}^{N}(Y_{(i)} - \hat{Y}_{(i)})^2$$` `$$SSR = \sum_{i=1}^{N}(\epsilon_{(i)})^2$$` $$ SSR = 17.767$$ For the set of coefficients { `\(\beta_0,\beta_1\)` } = {-1.5,2}, SSR is equal to 17.767. Could you find another set of coefficients that can do a better job in terms of prediction (smaller SSR)? ] --- **Thought Experiment** - Suppose the potential range for my intercept, `\(\beta_0\)`, is from -10 to 10, and we will consider every single possible value from -10 to 10 with increments of .1. - Also, suppose the potential range for my slope, `\(\beta_1\)`, is from -5 to 5, and we will consider every single possible value from -5 to 5 with increments of .01. - Note that every single possible combination of `\(\beta_0\)` and `\(\beta_1\)` indicates a different model - How many possible set of coefficients are there? - Can we try every single possible set of coefficients and compute the SSR? ---