A Friendly Introduction To Linear Regression With Python Implementation

Published in

Analytics Vidhya

9 min readMay 20, 2021

Every ML beginner starts their journey in machine learning with the linear regression algorithm and it is the most easiest algorithm to understand.

In this article let’s try to understand Linear regression along with python implementation from scratch. I will try to cover both theory and practical implementation in python simultaneously.

Introduction:

when I first studied linear regression I have struggled a lot to understand this simple algorithm and I could not able to understand simple train test split and many more things and I don’t want this to happen for you so, if you are a beginner in ml this article is for you. let’s get started

What is Linear Regression?

linear regression is a supervised learning algorithm where we have one or more independent variables and only one dependent variable which is a continuous numerical variable. confused?

Here’s an example suppose we want to predict the weight of a student based on his\her particular height, here our weight variable is output or dependent variable and the height variable is an independent variable and this is a simple linear regression algorithm since we have only one independent variable and suppose if we have another independent variable related to his\her education background, this is a multiple linear regression algorithm.

In both of these algorithms, our goal is to find out the relationship between independent and dependent variables like how my height variable is contributing to predict the weight of a student and to know which variable is most significant to predict the weight of a student.

How do we predict using linear regression?

In the case of simple linear regression, you will have two variables and we try to find the best fit line that passes through our data points(see below image) and for multiple linear regression, we fit the hyperplane.

https://www.numpyninja.com/post/what-is-line-of-best-fit-in-linear-regression

from school maths, if you remember our straight line equation look like this y=mx+c here you can think of x as an independent variable and y as a target variable and c is the intercept and m is the coefficient or slope value that we need to find out and for the good m value, our error(actual-predicted) will be low and we can calculate this error using RSS(residual sum of squares).

https://julienbeaulieu.gitbook.io/wiki/books/untitled/statistics/regression-and-prediction

as you can see in the above image our yi value is the actual value and yi hat is the predicted value and we square the error for all data points and compute the total error. let’s pause here think about it for a moment.

finding the best fit line or an optimal slope value isn’t easy and it is an iterative process. we use Gradient descent to find out optimal slope values, gradient descent is an optimization algorithm to find out the minimum of the function(cost function). we compute the gradient of the cost function with respect m or in other words we try different iterations of m and compute the loss for every m value and once we find the optimal m(global minima)value we stop our iterations. In short, this what gradient descent does,for detailed explanation refer to this wonderful link.

Gradient Descent: An Introduction to 1 of Machine Learning's Most Popular Algorithms

Gradient descent is by far the most popular optimization strategy used in machine learning and deep learning at the…

builtin.com

for simplicity think of gradient descent as some magical box if you throw your data into the magical box it will give you the best fit line(optimal slope values) as output that reduces the error between actual and predicted values. with this knowledge, you are ready to implement linear regression in python.

Linear Regression in Python:

Here I am using a real state company dataset and this dataset contains the prices of properties in the Delhi region and company wishes to use the data to optimize the sale prices of the properties based on important factors such as area, bedrooms, parking, etc. the company wants —

To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc
To know the accuracy of the model, i.e. how well these variables can predict house prices.

steps we are going to follow :

a. Reading and Understanding the data

b. performing some preprocessing steps(encoding,scaling)

c. training the model(using statsmodels ) and making predictions.

first let’s import all necessary libraries

2. Read the data

our datset consists of 545 rows and 13 columns like price,bedroom,stories etc..

3. check for null values

luckily we don’t have any null values in our dataset,if we have null values we need to remove them or we need to fill them with appropriate values.

4. let’s do some plotting also to know corelation between variables.

here i have used simple seaborn pairplot sns.pairplot(housing) function to plot.

from the above plot we can see that price and area are correlated to each other.

5. Preparing data for modelling

In our dataset we have categorical variables like mainroad,guest room etc.. that says weather the particular house has guest room (yes\no) and mainroad(yes\no). but in order to fit a regression line we need numerical variables not categorical variables. so let’s convert categorical variables into numerical values using simple mapping and lambda function.

we have also column named furnishing status that tells whether the house is semi furnished or furnished or unfurnished. so for this column we are using pandas dummies function to convert it into numerical values.

here in first row(index1) we have (0,0) values which means this particular house is furnished and in the third row we have (1,0) which means this particular house is semi furnished and we finally concatinate it to original dataframe and drop original furnishingstatus variable.

6. Train,Test split and scaling

we split the data into train and test split and we use training data to train the model and testing data to test the model. here we have used 70% data for training and 30% data for testing

after train and test split it is important to scale all our data points to bring all variables into same scale. here we have used min max_scaler.

from df_train we have derived our X_trian and corresponding label y_train

7. Model Training

finally after preprocessing we arrive at model training ,here we are using statsmodels library and it give us the detailed statistics like p value ,F-statistic etc.. but statsmodel doesn’t come up with constant we need add our constant.

Hypothesis Testing :

after training our model the important thing is to remove redundant variables and if you know hypothesis testing and p value your job will be very easier here. actually our null hypothesis here is that the independent variables has no significant effect on our dependent variable and our alternate hypothesis is that our independent variables has significant effect on our dependent variable. so if our p value <0.05 we reject null hypothesis and if p value>0.05 we fail to reject our null hypothesis. so our goal here is to keep only significant variables and we need to remove redundant variables.

but wait! our job is not done yet by checking p value, we need look into vif value as well, so what is this vif?

basically our vif(variance inflation factor) tell us whether our independent variables are correlated with each other or not. you may be wondering why we need to take care of corelation between independent variables.basically in multiple linear regression we have more than one independent variable and there’s a chance that these variables are correlated with each other(this is what so called multicollinearity problem), because of this multicollinearity issue unit increase in independent variable also increase the correlated variable coefficients, so coefficient values change wildly and p values are not reliable.

But multicollinearity does not influence predictions and goodness of fit. if your primary goal is to make predictions and you don’t want to understand the role of each independent variable, you don’t need to worry about multicollinearity. For our problem we need to take care of multicollinearity issue.

so by checking p and vif values we need to remove redundant variables.

a rule of thumb is that if you have p value>0.05 remove that particular variable and if we have vif value >5 remove that particular variable.

7. check VIF

here we can see that semi furnished variable has high p value(0.93) and hence we need to remove that variable and again train our model.

this time p values are fine we need take a look at vif as well and vif shows that bedrooms variable having multicollinearity issue .so let’s remove this variable.

after removing bedrooms we need to train our model again and look at the p values and vif values.

all values looks fine we can now go ahead and make predictions on test set.

let’s first make predictions on train set.

error terms are normally distributed which is one of the assumptions of linear regression.

8.Test Prediction and R_2 score

we follow the same approach that we did for training set.

finally we got R_2 score of 67% which is not good and we can improve this by training other models like ridge and lasso. but our goal for this article is to understand the linear regression implementation in python. so i will stop here.

I hope this article helped you to understand linear regression and it’s implementation in python and Thank you so much for spending your time to learn something new.

If you liked this article please show your support by clapping for this article below and it really motivates me to write more interesting articles and if you have any questions leave a comment below and I’d love to hear from you.

About me:

I’m a data science enthusiast and currently pursuing pg diploma in machine learning & ai from the international institute of information technology Bangalore.

you can also find me in www.linkedin.com/in/manoj-gadde