House price prediction using Linear Regression.

Suraj
2 min readMar 20, 2021

Hi, There! In this blog, We will be exploring linear regression techniques to predict house prices given several distinguishable features as input. We will use the following kaggle dataset for the task; Dataset Link

The dataset comes as .csv file which was read using Pandas.

The 5 first five rows of the dateset can be viewed here. Moving ahead, We will be segmenting the entire life-cycle of this project into following steps.

  1. Exploratory Data Analysis and Data Cleaning.
  2. Feature Engineering.
  3. Model Selection, HyperParameters Tuning, Fitting and Evaluation.
  4. Deployment on Heroku.

1. Exploratory Data Analysis and Data Cleaning.

In this stage, we calculated the following metrics to transform the dataset via EDA.

a. Percentage of missing values in the dateset followed by dropping the column with most number of missing values.

b. Skewness and Kurtosis of the features of the dataframe followed by relevant transformation to make the distribution closer to normal(Gaussian). For eg: Since the kurtosis value of the price was very high, so we decreased it by converting the price to logarithmic scale.

After this, We cleaned our data by splitting the data contained in size column into two separate feature columns(BHK_size and Bed_size) in the dataframe and filling the empty rows with 0 using fillna(item). Similarly features in the columns like availability, bath and total_sqft were processed to relevant categorical and numerical forms. The processed dataframe can be viewed here in cell 40.

The data was then split into train and test set.

2. Feature Engineering

In this stage, we explored the possibility of employing Data standardization and Normalisation steps. We used the Standard Scaler from sklearn to transform the train and test input features. The same can be viewed in the Cell 46

3. Model Selection, HyperParameters Tuning, Fitting and Evaluation.

Since our intention is primarily to explore the Linear Regression techniques with regularisations(Lasso and Ridge), we have constrained our model selection stage to Simple Linear regression, Lasso, Ridge backed with an experimentation with polynomial features. We fitted these models with and without hyperparameter tuning on the train set and evaluated on test set. The result of which can be viewed in the cells 47–69. One can feel free to run the hyper-parameter tuning on a larger sample space of ridge__alpha and lasso__alpha for enhanced accuracy. In our limited experimentation and evaluation, we dumped the model with the best score as .pkl file.

4. Deployment on Heroku.

The fitted-model was served on web for prediction as a flask-webapp based micro service on Heroku. The source code of which can be viewed at link.

Feel free to experiment more with the source code. You can easily replace the Flask via FastAPI in the app.py with minor changes.

Until next time!

Bye!

--

--

Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.