CLC Multiple linear regression – linear relationship between a dependent variable Y (response) and a set of predictors 0 Model Goal: Fit the data well and understand the contribution of explanatory variables to the model – model performance assessed by residual analysis 0 Model fitted to the entire dataset Predictive Modeling Goal: Predict target values In new data where we have predictor values, but not target values Cassias data mining context model Goal: Optimize predictive accuracy – how accurately can the fitted model predict new cases model trained on training data and performance
Explaining role of predictors is not the primary purpose (although useful) Regression Method 0 Predict the value of the dependent variable Y based on predictors XSL ,… ,Xp 0 Regression coefficients ;l, up in the equation: Y 0 Coefficients estimated via ordinary least squares (LOS) method 0 Estimated using training sample 0 Predictive capacity assessed by prediction results on validation set – average squared error 0 Assumptions – normality, independence, linearity Example: Prices of Toyota Corolla Doctrinally. XSL Goal: Predict sale prices of used Toyota
Corollas based on their specification Data: Prices of 1442 used Toyota Corollas, with their specification information – age, mileage, fuel type, engine size Data Sample (showing only the variables to be used in analysis) Variables Used Price in Euros Age in months as of 8/04 KM (kilometers) Fuel Type (diesel, petrol, CNN) HP (horsepower) Metallic color (1 ;yes, 0=no) Automatic transmission (I;yes, 0=no) C (cylinder volume) +;EX.+. .. + ;pix Quarterly_Tax (road tax) Weight (in keg) Preprocessing Fuel type is categorical, must be transformed into binary variables
Diesel 0=no) CNN (1 o=no) None needed for “Petrol” (reference category) Subset of the records selected for training partition (limited # of variables shown) 60% training data / 40% validation data Multiple linear regression model fitted using ONLY training data The Fitted Regression Model (Slimier output) Predicted Values Predicted price computed using coefficients Residuals ; difference between actual and predicted prices Error reports Error for the validation set is usually larger than that of the training set (as expected) Distribution of Residuals
Symmetric distribution Some outliers 50% errors between Selecting Subsets of Predictors Goal: Find parsimonious model (the simplest model that performs sufficiently well) Expensive or impossible to measure all predictors for future predictions more robust multidimensionality can lead to unstable regression coefficients and hence increase variation in predictions and lower predictive accuracy Sometimes dropping correlated predictors increase bias (average error) trade-off between too few and too many predictors – Bias-variance trade-off Variable selection methods Use domain knowledge – some practical considerations: Expense of collecting future data on predictors missing values and inaccurate measurements Lawrenceville to the problem at hand Sigh correlations 0 Two primary methods: Exhaustive Search Partial Search Algorithms 0 Forward selection 0 Backward elimination 0 Stepwise regression Exhaustive Search 0 All possible subsets of predictors assessed (single, pairs, triplets, etc. ) 0 Computationally intensive 0 Judge by “adjusted RE” (RE is the proportion of the higher the better mallows Cap Value near p+l Small p
Partial Search Algorithms 0 Popular methods of finding the best subset of predictors 0 Relies on a partial, iterative search through the space of all possible regression models 0 End product: Best subset 0 Computationally cheaper and can potentially “miss” good combinations Forward Selection 0 Start with no predictors 0 Add them one by one (add the one with largest contribution to RE on top of the predictors that are present) 0 Stop when the addition is not statistically significant (large p-value) Drawback: May miss pairs or groups of predictors that perform well together but reform poorly as single predictors Backward Elimination 0 Start with all predictors 0 Successively eliminate least useful predictors one by one according to statistical significance (largest p-values) 0 Stop when all remaining predictors have statistically significant contribution (low values) 0 Drawback: Computing the initial model with all predictors can be time consuming and unstable Stepwise Backward elimination (showing last 7 models) (Age _08_04) Second model has two predictors, etc. All 12 Models Diagnostics for the 12 models Good model has: High ads-RE, Cap # of predictors + 1 Good predictors: Petrol, weight, age, HP, Quarter, Mileage Least useful: Doors, C, Diesel, Metallic, Automatic Model with only 6 Model Fit Predictive performance – same as 12-predictor model!