Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Interview Questions / Linear Regression Interview questions and Answers
Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.
Table of Content
When the data is transformed the intention is to improve the Correlation between the Features and the target. But one of the side effects of this can be that the transformation may increase the correlation but also increase the Error. It may be noted that R2 is a measure of a combination of 2 measures the SSR and the SSE. The SSR measures how much the inputs contribute to the change in the output. In the perspective of simple linear regression, this can be understood as the slope of the model vis-a-vis the baseline. The SSE measures the squared errors ie the difference between the actual and predicted squared. Sometimes after the transformation, the model SSR may increase excessively while the SSE also increases but not to that extent. Therefore the overall R2 may increase but the offset is that RMSE (due to the increased SSE) will also increase
Explanation: The term heteroscedasticity will be called when the variance of the errors is not constant and following high and low error variance or following some patterns like funnel shape is called heteroscedasticity. A residual plot can help us to understand this scenario. Calculate Square residuals and plot the graph by taking squared residuals against the explanatory variable. If the scatterplot plotted between dependent and independent variables are varying in magnitude we can understand this may lead to unequal variances. If this problem exists, the population used in the regression contains unequal variance, and the analysis results may be invalid. To fix this problem we can perform transformations.
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. HDL and LDL are independent variables of the Regression technique. This is an example of perfect collinearity.MultiCollinearity is caused because of the inaccurate use of dummy variables. Multicollinearity generates a high variance of the estimated coefficients so the results will not be accurate. This problem will not allow for the extraction of the individual effects of each independent variable on the target variable. Due to this standard errors may be overestimated and t values are depressed. It can be detected through the Variance Inflation Factor.
A variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model.Variance inflation factors allow a quick measure of how much a variable is contributing to the standard error in the regression.It measures how much the variance (or standard error) of the estimated regression coefficient is inflated due to collinearity.VIF=1/tolerance (1/1-Rˆ 2)and VIF is 1 indicates two variables are not correlated if it is >10 it is highly correlated.Due to the variance , the interpretation is difficult with respect to coefficients due to multicollinearity problem.the VIF for a regression model variable is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable.
Absolute Error is the amount of deviation in our calculations. It is the difference between the predicted and the actual value. The Mean Absolute Error(MAE) is the mean of all absolute errors. RMSE is best as it is differentiable everywhere. To optimize the squared error, we can determine the derivative and set its expression equal to 0, and solve. But to optimize the absolute error, we require more complex techniques having more calculations. We use the Root Mean Squared Error instead of the Mean squared error so that the unit of RMSE and the dependent variable are equal and the results are interpretable. Mean Absolute Error(MAE) is preferred when we have too many outliers present in the dataset because MAE is robust to outliers whereas MSE and RMSE are very liable to outliers and these start reducing the outliers by squaring the error terms, commonly known as residuals.
Regression is a parametric approach and it makes assumptions for analysis. If the assumptions are not satisfied the results are not fruitful. There should be a linear relationship between x and y i.e 1unit change in x will have a change in y. An additive relationship is the effect of one input on y is independent of other variables in the data. If errors have a relationship we end up with an autocorrelation problem. If the inputs have strong relationships among them it is a multicollinearity problem. Errors should be normally distributed and should have constant variance. If the error terms are non-normally distributed, confidence intervals may become too wide or narrow.
Didn’t receive OTP? Resend
Let's Connect! Please share your details here