COMPARISON OF THE PENALIZED REGRESSION TECHNIQUES WITH CLASSICAL LEAST SQURES IN MINIMIZING THE EFFECT OF MULTICOLLINEARITY

CHAPTER ONE

INTRODUCTION

1.1 Background of the Study

In order to reduce possible biasness, large number of predictor variables was introduced in a model and that lead to a serious concern of multicollinearity among the predictor variables in multiple linear regressions, variable selection is an important issue.(Mathew and Yahaya, 2015)

Multicollinearity and high dimensionality are two problems and computational issue that bring challenges to regression analysis. To deal with these challenges, variables selection and shrinkage estimation are becoming important and useful. The traditional approach of automatic selection (such as forward selection, backward elimination and stepwise selection) and best subset selection are computationally expensive and may not necessarily produce the best model.

Multicollinearity problem is being dealt with by Penalized least square (PLS) method by putting some constraints on the values of the parameters estimated. The aftermath is that the entries of the variance covariance matrix are significantly reduced.When multicollinearity exist that predictor’s variables that are highly correlated form some groups. One of the way collinearity problem can be dealt with is to remove one or more of the predictor variables within the same group, by making decision which among the group variables is to be eliminated tend to be difficult and complicated. The aftermath of multicollinearity is that the parameter estimator and their variance or standard error tends to be large and prediction may be very inaccurate.

In a situation where there exist correlated data or data where the number of predictors is much larger than the sample size, penalized regression methods have been introduced to deal with this challenge, because they produce more stable results, penalized regression methods do not clearly select the variables; instead they minimize the Regression Sum of Square by using a penalty on the size of the regression coefficients. This penalty causes the regression coefficients to shrink toward zero and this may result in biased estimates through these regression coefficient estimates will have smaller variance. This can improve the prediction accuracy because of the smaller mean squared error (Hastie et al., 2009). This is why penalized regression methods are also known as shrinkage or regularization methods. Some regression coefficients are set to zero exactly if the shrinkage is large enough, thus, penalized regression methods perform variable selection and coefficient estimation simultaneously. The Least Absolute Shrinkage Selection Operator (LASSO) enables selection such that only the important variable stays in the model (Szymeezak,et al., 2009).

1.2 Research Motivation

The motivation for using penalized regression is that the ordinary least square estimation methods is not unique and are subjected to high variability due to the presence of multicollinearity. However, with penalization it becomes unique when appropriate turning parameters are chosen and the variances of the estimators are controlled. Most of the comparisons done by Mathew and Yahaya (2015) were between Least Absolute Shrinkage Selector Operator (LASSO), Elastic Net (EN) and Correlation Adjusted Elastic Net (CAEN). This research attempt to compare LASSO, EN, CAEN and Smoothly Clipped Absolute Deviation (SCAD) regression.

1.3 Statement of the Problem

When multicollinearity exist in a model, Parameter estimates β of the multiple linear regression models are not unique. Most often we face the issue of multicollinearity when there are strong linear relationships between two or more predictors. In recent years, alternative methods known as shrinkage and variable selection have been introduced to deal with multicollinearity in particular, penalized regression methods. This study deal with multi collinearity by considering different penalized regression methods.

1.4 Aim and Objectives of the Study

The aim of this study is to compare the performance of penalized regression techniques with classical regression methods in minimizing the effect of multicollinearity. We intend to achieve this aim through the following objectives:

i. Determine variables that possess multicollinearity using Variance Inflation Factor;

ii. Apply penalized regression techniques such as LASSO, CAEN, EN, and SCAD regression to eliminate multicollinearity; and

iii. Assess the adequacy of the fitted penalized regression models and the classical least squares.

1.5 Significance of the Study

This study is expected at the end to show the importance of variable selection through Penalized regression as a prior step in removing unimportant factors or variables before model building, also, providing assistance to researchers to ease their decision making as to which technique to be used when encountered with the problem of multicollinearity.

1.6 Scope and Limitations of the Study

This study revolves around the use of Generalized Cross-Validation (GCV) as a good approximation of the leave-one-out cross-validation (LOOCV)to determine the number of variables selected by each of the methods (LASSO, CAEN, EN and SCAD) under study and also by the use of Mean Square Error and linear fits to determine the predictive accuracy of the methods. The research gives an insight of each of the procedure in an attempt to highlight the similarities and the differences existing between three penalized methods.

REFERENCES

Adams, J. (1990). A computer experiment to evaluate regression strategies. Proceedings of the Statistical Computing Section.American Statistical Association, 3(4): 55-62.

Andre, N., Young, T. M. and Rials, T. (2006). Online monitoring of the buffer capacity of particle board furnish by near-infrared spectroscopy. Applied Spectroscopy, 60(10), 1204-1209.

Ayers, K. L., and Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 34, 879-89.1

Bovelstad. H. M.,Nygard, S., Storvold, H. I., Aldrin, M., Borgan, O., and Frigessi, A., (2007) predicting survival from microray data a comparative study. Bioinformatics. 23(16): 2080-2087

Beer, D.G. Kardia. S. I., Huang. C.C, Giordano, T.J. and Levin. A. M. (2002). Gene-expression Profiles predict survival Patients with lung adenocarcinoma. Nat. Med., 8(8): 816-824.

Breiman, I. (1996). Heuristics of instability and stability in model selection.Annals of Statistics, 24(6): 2350-2383

Breiman. I. and Friedman, J., (1997). Predicting multiple responses in multiple linear regression (with discussion). Journal of the Royal Statistical Society: Series B 59, 3-54.

Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR.Biometrics 64(1), 115-123.

Buhlmann, P., and VandeGeer, S. (2011). Statistics for High Dimensional Data: methods, theory and applications. Springer Science and Business Media.

Cho, S., Kim,K.,Kim,Y.J.,Lee,J.K., (2010). Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Annals of Human Genetics. 74(5), 416-428

Donoho, D., Elad, M. and Temlyakov, M (2004).Stable recovery of space over complete representation in the presence of noise.IEEE transactions on information theory, 52(1)

Doreswamy, V, and Chanabasayya, M. V. (2013): performance analysis of regularized linear regression models. International Journal of Computational Science and Information Technology, 1(4), 20-33

Draper N.R. and H. Smith.(1981). Applied regression Analysis, 2nd Ed. John Wiley and Sons.Inc.
New York, NY.

Dismuke, C., and Lindrooth, R. (2006). Ordinary least Squares. Methods and designs for outcomes Research, 93, 93-104.