Multiple linear regression – linear relationship between a dependent variable Y (response) and a set of predictors [J Model Goal: Fit the data well and understand the contribution of explanatory variables to the model – model performance assessed by residual analysis [J Model fitted to the entire dataset Predictive Modeling Goal: Predict target values in new data where we have predictor values, but not target values јClassic data mining context model Goal: Optimize predictive accuracy – how accurately can the felted model predict new cases јModel trained on training data and performance
Explaining role of predictors is not the primary purpose (although useful) Regression Method 0 Predict the value of the dependent variable Y based on predictors XSL ,… ,Xp 0 Regression coefficients џl, up in the equation: Y 0 Coefficients estimated via ordinary least squares (OILS) method 0 Estimated using training sample 0 Predictive capacity assessed by prediction results on validation set – average squared error 0 Assumptions – normality, independence, linearity Example: Prices of Toyota Corolla Doctrinally. XSL Goal: Predict sale prices of used Toyota
Corollas based on their specification Data: Prices of 1442 used Toyota Corollas, with their specification information – age, mileage, fuel type, engine size Data Sample (showing only the variables to be used in analysis) Variables Used Price in Euros Age in months as of 8/04 KM (kilometers) Fuel Type (diesel, petrol, CNN) HP (horsepower) Metallic color (1 ?yes, 0=no) Automatic transmission (I?yes, 0=no) C (cylinder volume) +џEX.+. .. + џpix Quarterly_Tax (road tax) Weight (in keg) Preprocessing Fuel type is categorical, must be transformed into binary variables
Diesel 0=no) CNN (1 o=no) None needed for “Petrol” (reference category) Subset of the records selected for training partition (limited # of variables shown) 60% training data / 40% validation data Multiple linear regression model fitted using ONLY training data The Fitted Regression Model (Slimier output) Predicted Values Predicted price computed using coefficients Residuals ? difference between actual and predicted prices Error reports Error for the validation set is usually larger than that of the training set (as expected) Distribution of Residuals
Symmetric distribution Some outliers 50% errors between Selecting Subsets of Predictors Goal: Find parsimonious model (the simplest model that performs sufficiently well) Expensive or impossible to measure all predictors for future predictions more robust multidimensionality can lead to unstable regression coefficients and hence increase variation in predictions and lower predictive accuracy Sometimes dropping correlated predictors increase bias (average error) trade-off between too few and too many predictors – Bias-variance trade-off Variable selection methods Use domain knowledge – some practical considerations: Expense of collecting future data on predictors missing values and inaccurate measurements Lawrenceville to the problem at hand Sigh correlations 0 Two primary methods: Exhaustive Search Partial Search Algorithms 0 Forward selection 0 Backward elimination 0 Stepwise regression Exhaustive Search 0 All possible subsets of predictors assessed (single, pairs, triplets, etc. ) 0 Computationally intensive 0 Judge by “adjusted RE” (RE is the proportion of the higher the better mallows Cap Value near p+l Small p
Partial Search Algorithms 0 Popular methods of finding the best subset of predictors 0 Relies on a partial, iterative search through the space of all possible regression models 0 End product: Best subset 0 Computationally cheaper and can potentially “miss” good combinations Forward Selection 0 Start with no predictors 0 Add them one by one (add the one with largest contribution to RE on top of the predictors that are present) 0 Stop when the addition is not statistically significant (large p-value) Drawback: May miss pairs or groups of predictors that perform well together but reform poorly as single predictors Backward Elimination 0 Start with all predictors 0 Successively eliminate least useful predictors one by one according to statistical significance (largest p-values) 0 Stop when all remaining predictors have statistically significant contribution (low values) 0 Drawback: Computing the initial model with all predictors can be time consuming and unstable Stepwise Backward elimination (showing last 7 models) (Age _08_04) Second model has two predictors, etc. All 12 Models Diagnostics for the 12 models Good model has: High ads-RE, Cap # of predictors + 1 Good predictors: Petrol, weight, age, HP, Quart, Mileage Least useful: Doors, C, Diesel, Metallic, Automatic Model with only 6 Model Fit Predictive performance – same as 12-predictor model!
Next step 0 Subset selection methods give candidate models that might be “good models” 0 May end up with different subsets with different variable selection methods (For example: Exhaustive search may vary from backward elimination) 0 Notion: Stepwise regression is better than both forward and backward selection methods – FALSE!! 0 Do not guarantee that “best” model is indeed best “Best” model can still have insufficient predictive accuracy 0 Must run the candidates and assess predictive accuracy Summary 0 Linear regression models are very popular tools, not only for explanatory modeling, but also for prediction 0 A good predictive model has high predictive accuracy (to a useful practical level) set, and evaluated on a separate validation data set 0 Removing redundant predictors is key to achieving predictive accuracy and robustness 0 Subset selection methods help find “good” candidate models – which should then be run and assessed