Exploratory Data Analysis.

Suraj
3 min readMar 31, 2021
Image credits: https://www.youtube.com/watch?v=AM6w_tUlIn4
Figure 1: Exploratory Data Analysis.

Hi There!, In this blog, We will be covering the fundamental step of Exploratory Data Analysis(EDA)! As the name suggests, It is highly related to data exploration which is often performed at an initial stage post data acquisition. This is the first step in the life-cycle of a machine learning project and lays a key foundation for feature-engineering towards fetching meaningful insights from the data.

The data can be represented in various dimensions and unfortunately, the unwanted outliers often find a place in the acquired data. The removal of these outliers from the acquired data is a crucial step in a data-science project towards ensuring that the machine learning model trained on it functions within the desired ballpark.

The data can exist in structured and unstructured forms. Structured data mostly comprises tabular data which are processed using pandas whereas unstructured data can be regarded as Text, audio, visual data etc. The following Google colab notebooks will give you a walk-through of how to do EDA on structured data. Link1 and Link2 whereas EDA on one such unstructured data can be viewed in the Link5(EDA on Sound). The EDA for text, sound and image data are often concluded with few preliminary checks of removing outliers and missing values(Thanks to the power of convolutions, recurrent units in deep neural networks) however EDA on tabular data is a lengthy process and often requires one to handle data through the lens of a statistician.

The crucial steps of performing EDA on structured data can be summarised in a nutshell as follows:

  1. Assuming, we have our structured data stored in a tabular data structure(data frame) which can be read by pandas. The first step often involves showing the head and tail of the data along with printing the shape of the data.
>> import pandas as pd
>> dataframe = pd.read_csv("/path/to/DataframeName.csv")
>> dataframe.head()
>> dataframe.tail()
>> dataframe.shape

2. Once the data is loaded, we should try to fetch the datatypes of dependent and independent variables. We can observe that the datatypes may belong to either int64, float64, object or DateTime. This step often plays a crucial step in the feature scaling part.

>> dataframe.info()

3. It is also important to check the variation in the values of the data by entities like mean, count, standard deviation, min, max, 25%, 50% and 75% values. Hovering our eye on these metrics too play a crucial role in feature engineering

dataframe.describe()

4. The next step is often regarded as univariate and bivariate analysis which is done through seaborn’s distplot and join plot. It is also advised to find the correlation between features in the data frame via corr() method. A higher correlation between the target variable and input feature signals that we should definitely use for training our machine learning models whereas we can drop those features which show a negative correlation. We can also check the tailedness and skewness of the data to ensure that the features follow Gaussian distribution as they are favourable for using StandardScaler during the feature scaling stage.

5. If the features are categorical, then using seaborn’s countplot ,boxplot and violin plot etc are often recommended to get rid of the outliers and gain valuable insights. Refer here.

6. The last step often involves handling missing values in the data via Data imputation strategies. Few data imputation strategies like handling the data via averaging and end-tail imputation has been discussed in the colab notebook’s cell.

For more details, you can read the chapter from CMU statistics out here.

Until next time! Hope the aforementioned steps help you in carrying out EDA on your data.

Bye!!

--

--

Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.