blur code

Blur Code

Working towards providing new and cool programming articles.
A perfect blend of knowledge & technology to become a better coder

Rookie Data Scientist Guide

A comprehensive walkthrough for the Big Mart Sales prediction project using Machine Learning. A perfect start to becoming a Data Scientist.

Suraj Sarangi

11 minute read

‘Machine learning’, one of the hottest topics in the world of tech right now. Are you confused about where to start? Maybe you have finished some online courses, but still not confident to develop your own project. While learning from online courses that make you do some project is a great start, it’s still nowhere close to developing your own models in machine learning.

Contrary to popular belief, machine learning is not all about making a model that gives you predictions. This flow chart will help you to understand what are the principal components of machine learning:

Created by Suraj Sarangi
Created by Suraj Sarangi
The chart shows the steps to making a working model. - The first step is preprocessing. This is the most time taking step according to me. It needs careful observation of the datasets we’re provided with and analyzing the relations between the attributes and the target. It also involves imputation of missing values as NaN values are not very friendly to work with. The last step involves dropping features that don’t seem very important.

  • The second step is feature engineering. This is the most important step as it can make bad datasets work really well. This involves scaling features if the range is very vast.

Some attributes maybe too extravagant to work with, so we make derived variables to make them computationally more efficient.

Then comes the making of a model for predictions. There can be linear models like Linear Regression, Lasso, Ridge, etc. There are also tree models like Decision Tree, Random Forest, etc. There is always a neural network somewhere around the corner.

After making the model, the hyperparameters are carefully adjusted based on training set and test set accuracy. After this the model is ensembled with other algorithms. There is one final step, which is not shown in the graph, i.e, deployment of the model. This requires making an app and uploading data to a cloud.

Preprocessing

First step involves getting data. You can download a data set to work on, from the internet or if you have the resources, you can create your own data set. This data comes with a lot of errors, missing values or some unwanted attributes. While the rules of English put a great emphasis on punctuation marks and spaces, our machine simply doesn’t care. Cleaning the data, like removing white spaces, unnecessary punctuations is generally the first step while making models for text classification. Image classification tasks generally have very robust datasets and generally don’t require any preprocessing.

Let’s work on a business improvement by making predictions for sales data of Big Mart. This model can help Business Associations to understand which products would be in demand and which ones should they think of getting rid of. Let’s load the datasets and required libraries for processing the data.

import pandas as pd
url1 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Train.csv'
train = pd.read_csv(url1)
url2 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Test.csv'
test = pd.read_csv(url)
train.head()
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049 1999 Medium Tier 1 Supermarket Type1 3735.1380
1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018 2009 Medium Tier 3 Supermarket Type2 443.4228
2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OUT049 1999 Medium Tier 1 Supermarket Type1 2097.2700
3 FDX07 19.20 Regular 0.000000 Fruits and Vegetables 182.0950 OUT010 1998 NaN Tier 3 Grocery Store 732.3800
4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013 1987 High Tier 3 Supermarket Type1 994.7052

Two important functions which help in getting some information about the dataset are:

train.info()

<class pandas.core.frame.DataFrame>

RangeIndex: 8523 entries, 0 to 8522

Data columns (total 12 columns):

Item_Identifier 8523 non-null object

Item_Weight 7060 non-null float64

Item_Fat_Content 8523 non-null object

Item_Visibility 8523 non-null float64

Item_Type 8523 non-null object

Item_MRP 8523 non-null float64

Outlet_Identifier 8523 non-null object

Outlet_Establishment_Year 8523 non-null int64

Outlet_Size 6113 non-null object

Outlet_Location_Type 8523 non-null object

Outlet_Type 8523 non-null object

Item_Outlet_Sales 8523 non-null float64

dtypes: float64(4), int64(1), object(7)

memory usage: 799.2+ KB

train.describe()
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales
count 7060.000000 8523.000000 8523.000000 8523.000000 8523.000000
mean 12.857645 0.066132 140.992782 1997.831867 2181.288914
std 4.643456 0.051598 62.275067 8.371760 1706.499616
min 4.555000 0.000000 31.290000 1985.000000 33.290000
25% 8.773750 0.026989 93.826500 1987.000000 834.247400
50% 12.600000 0.053931 143.012800 1999.000000 1794.331000
75% 16.850000 0.094585 185.643700 2004.000000 3101.296400
max 21.350000 0.328391 266.888400 2009.000000 13086.964800

Visualizing the data with different plots can be really helpful to determine the correlation of certain attributes of the data. Distplots, Countplots, Histograms, Scatterplots are some of the major visualization methods. Analyzing Correlation is an important part as well. This helps to determine the most important features in the data. We use libraries seaborn and matplotlib for our visualization:

import matplotlib.pyplot as plt
import seaborn as sb
plt.style.use('seaborn')
plt.figure(figsize=(12,5))
sb.distplot(train.Item_Outlet_Sales,bins=20,color='red')
plt.xlabel('Item Outlet Sales',fontsize=15)
plt.ylabel('No. of Sales',fontsize=15)
plt.title('Histogram of Target',fontweight='bold',fontsize=17);

Created by Suraj Sarangi
Created by Suraj Sarangi
I love graphs, playing with colors and graphs is a must if you’re new to it. You might be wondering about the semicolon(;) at the end. > “It’s python, it’s a sin to use semicolon!”.

Well, while working with plots, semicolon suppresses the useless object outputs. Try it for yourself to see it.


Univariate Analysis

We divide the numerical and categorical data to different dataframes as the analysis for both of them is different

import numpy as np
num=train.select_dtypes(include=np.number)
cate=train.select_dtypes(exclude=np.number)

Numeric Data

Numeric Data is analysed using Correlation. Correlation is a measure of how sensitive the values of one column are with respect to the changes in values of other columns. We can use corr() to get the correlation matrix

co=num.corr()
co
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales
Item_Weight 1.000000 -0.014048 0.027141 -0.011588 0.014123
Item_Visibility -0.014048 1.000000 -0.001315 -0.074834 -0.128625
Item_MRP 0.027141 -0.001315 1.000000 0.005020 0.567574
Outlet_Establishment_Year -0.011588 -0.074834 0.005020 1.000000 -0.049135
Item_Outlet_Sales 0.014123 -0.128625 0.567574 -0.049135 1.000000

Now, if you felt the matrix as boring, I wouldn’t say you’re entirely wrong. Lets see that same matrix in a much better way.

plt.figure(figsize=(8,6))
sb.heatmap(co,square=True,annot=True)
ax=plt.gca()
bo,to=ax.get_ylim()
ax.set_ylim(bo+0.5,to-0.5)
plt.title('Correlation Heatmap',fontweight='bold',fontsize=17);

Created by Suraj Sarangi
Created by Suraj Sarangi
Now seeing the most important correlations

co['Item_Outlet_Sales'].sort_values(ascending=False)

Item_Outlet_Sales 1.000000 Item_MRP 0.567574 Item_Weight 0.014123 Outlet_Establishment_Year -0.049135 Item_Visibility -0.128625 Name: Item_Outlet_Sales, dtype: float64

As we see, the sales column has the maximum correlation with MRP (we disregard the 1.000 as it’s not very informative about its relations).


Categorical Data

For Categorical Data, we plot Countplots.

plt.figure(figsize=(15,5))
cols=list(cate.columns)
cols.remove('Item_Identifier')       # we don’t need the identifiers
cols.remove('Outlet_Identifier')
sb.countplot(cate[cols[0]]);

Created by Suraj Sarangi
Created by Suraj Sarangi
Now seeing the most important correlations

We see the irregularities as LF,reg which are just different versions of Low Fat and Regular Fat. We’ll change these later.

plt.figure(figsize=(15,5))
sb.countplot(cate[cols[1]])
plt.title(cols[1],fontweight='bold',fontsize=15)
plt.xticks(rotation=90,fontsize=15)
plt.xlabel(cols[1],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Count',fontsize=15);

Created by Suraj Sarangi
Created by Suraj Sarangi
There are a lot of features in Item_type. Computation will be really wasteful if we keep so many features. We’ll reduce the features later in feature Engineering.

Likewise, we can plot the graphs for all the other categorical data.


Bivariate Analysis

* We make another variable for all the features in training set cols2=list(train.columns)

This is again divided into numeric, categorical and mixed analysis Numeric

vis_pt=train.pivot_table(index=cols2[4], values=cols2[3], aggfunc=np.median)
vis_pt.plot(kind='bar',color='darkorchid',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[4],fontsize=15)
plt.ylabel(cols2[3],fontsize=15)
plt.title('Item Visibility and Item Type',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();
Created by Suraj Sarangi
Created by Suraj Sarangi

As we can see, there are a lot of zeros in this feature, this needs to be changed while doing imputation of missing values.

year_pt=train.pivot_table(index=cols2[-5], values=cols2[-1], aggfunc=np.median)
year_pt.plot(kind='bar',color='k',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[-5],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Establishment Year and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi
Created by Suraj Sarangi
There is an unusual drop in the year 1998. This makes the trend very irregular, and hence we need to do some feature engineering for this particular attribute.


Categorical

We can use pivot tables to plot categorical data

type_pt=train.pivot_table(index=cols[1], values=cols2[-1], aggfunc=np.median)
type_pt.plot(kind='bar',color='deeppink',figsize=(15,5),alpha=0.8)
plt.xlabel(cols[1],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Item type and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();
Created by Suraj Sarangi
Created by Suraj Sarangi

We can plot for all the remaining pairs and see if there is any irregularity.

Now that we have visualized the data and found the problems, we move to Feature Engineering.


Imputation of Missing Values

Discontinuations in the form of NaN values are really disliked by the model. But the data is not free from these values. The process of filling these values is generally done by taking the average over the column or simply by placing a zero. On analysis of the correlation matrix, the best features are generally shortlisted while the less important ones are dropped.

From info() we found that ‘Item_Weight’ and ‘Outlet_Size’ had null values.

mr=np.mean(train['Item_Weight'])
train['Item_Weight'].fillna(mr,inplace=True)
mr2=np.mean(test['Item_Weight'])
test['Item_Weight'].fillna(mr2,inplace=True)

This will impute the NaN values in the weight column of train and test with the mean of the columns. Generally, mean is a preferred choice for numeric data.

kp=(train.mode(axis=0))['Outlet_Size'].iloc[0]
train['Outlet_Size'].fillna(value=kp,inplace=True)
kp2=(test.mode(axis=0))['Outlet_Size'].iloc[0]
test['Outlet_Size'].fillna(kp2,inplace=True)

This will impute NaN values in Outlet size as the mode of the column in train and test set. Generally, mode is a preferred choice for categorical data.


Feature Engineering

Resurrecting Fat Content

train.replace("LF","Low Fat",inplace=True)
train.replace("low fat","Low Fat",inplace=True)
train.replace("reg","Regular",inplace=True)

test.replace("LF","Low Fat",inplace=True)
test.replace("low fat","Low Fat",inplace=True)
test.replace("reg","Regular",inplace=True)

This replaces the irregularities in the fat content column of train and test. The new visualization would look like this

Created by Suraj Sarangi
Created by Suraj Sarangi
Some items had 0 visibility which makes no sense. But the same items had some visibility in some other outlet. Hence we change the zeros.

pt=train.pivot_table(values='Item_Visibility',index='Item_Identifier')
for i in range(0,len(train['Item_Visibility'])):
    if train['Item_Visibility'].iloc[i]==0:
        train['Item_Visibility'].iloc[i]=pt['Item_Visibility'].loc[train['Item_Identifier'].iloc[i]]

The new visualization:

Created by Suraj Sarangi
Created by Suraj Sarangi
We do the same for the test dataframe as well.

We change the Establishment Year to the number of years operated. This data is from 2013.

train['Years_operated']=2013-train['Outlet_Establishment_Year']
test['Years_operated']=2013-test['Outlet_Establishment_Year']

It looks like this:

Created by Suraj Sarangi
Created by Suraj Sarangi

We saw Item_type has a lot of features. On taking a closer look at the dataset, we have the Item_Identifier column which has data beginning with either FD, NC, DR. We can use this to create a new Item_type. FD can be Food, DR can be drink, NC can be Non Consumable.

di={'FD':'Food','DR':'Drinks','NC':'Non-Consumable'}
l=[]
for i in train['Item_Identifier']:
    l.append(di[i[:2]])
train['Item_Type2']=l

New Visualization:

Created by Suraj Sarangi
Created by Suraj Sarangi

We do the same for test dataframe. Since we added a Non Consumable item, having a fat content for it would make no sense. Hence, we alter the Fat_Content column.

for i in range(0,len(train['Item_Type2'])):
    if train['Item_Type2'].iloc[i]=='Non-Consumable':
        train['Item_Fat_Content'].iloc[i]='Non-Consumable'
for i in range(0,len(test['Item_Type2'])):
    if test['Item_Type2'].iloc[i]=='Non-Consumable':
        test['Item_Fat_Content'].iloc[i]='Non-Consumable'

Fat_content now looks like this:

Created by Suraj Sarangi
Created by Suraj Sarangi


Encoding

Since we are using library scikit-learn for our models, it cannot work with categorical variables. Therefore, we need to use encoding to generate numeric attributes for these categorical columns. We used One Hot Encoding (funny name, eh?) for our attributes. It assigns 1 to the rows that have the particular value and 0 to those who don’t.

cols.append('Item_Type2')    # to add the new feature we created
cols.remove('Item_Type')  #to remove the old Item_type

from sklearn.preprocessing import LabelEncoder
for i in cols:
    train[i]=LabelEncoder().fit_transform(train[i])

for i in cols:
    test[i]=LabelEncoder().fit_transform(test[i])

train=pd.get_dummies(train,columns=cols)
test=pd.get_dummies(test,columns=cols)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 25 columns):
Item_Identifier              8523 non-null object
Item_Weight                  8523 non-null float64
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Item_Outlet_Sales            8523 non-null float64
Years_operated               8523 non-null int64
Item_Fat_Content_0           8523 non-null uint8
Item_Fat_Content_1           8523 non-null uint8
Item_Fat_Content_2           8523 non-null uint8
Outlet_Size_0                8523 non-null uint8
Outlet_Size_1                8523 non-null uint8
Outlet_Size_2                8523 non-null uint8
Outlet_Location_Type_0       8523 non-null uint8
Outlet_Location_Type_1       8523 non-null uint8
Outlet_Location_Type_2       8523 non-null uint8
Outlet_Type_0                8523 non-null uint8
Outlet_Type_1                8523 non-null uint8
Outlet_Type_2                8523 non-null uint8
Outlet_Type_3                8523 non-null uint8
Item_Type2_0                 8523 non-null uint8
Item_Type2_1                 8523 non-null uint8
Item_Type2_2                 8523 non-null uint8
dtypes: float64(4), int64(2), object(3), uint8(16)
memory usage: 732.6+ KB

As we can see the dummy variables have been generated according to the number of different labels present in the categorical attributes. Like, Fat_content had ‘low fat’, ‘regular’, ‘Non-consumable’, we have 3 dummy classes for Fat_Content.

The next step is to remove these redundant attributes from our dataframe. Like Item_type, establishment_yeat.

train.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
test.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
train.head()

Now, we’ll prepare our training and test sets

cols3=list(train.columns)
j=cols3.pop(5) #to get outlet_sales at last
cols3.append(j)
cols3.remove('Item_Identifier')
cols3.remove('Outlet_Identifier')
X_train=train[cols3[:-1]]
X_test=test[cols3[:-1]]
y_train=train[cols3[-1]]

After so much of preprocessing and Feature Engineering, we have finally come up to the favourite part of mine. Yes, the model selection. Since sales is a continuous data, we need to use Regression here.

from sklearn.linear_model import Lasso
las=Lasso().fit(X_train, y_train)
test2=pd.read_csv("test.csv")    # the test set is opened as it is for easy viewing
test2["Item_Outlet_Sales"]=las.predict(X_test)     # A new column is made for predictions
f1=open("Predictions.csv","w")
f1.write(test2[["Item_Identifier","Outlet_Identifier","Item_Outlet_Sales"]].to_csv(index=False, line_terminator='\n'))
f1.close()

This will make a csv file containing the identifiers and the sales for each of them. It’ll look like this:

Item_Identifier Outlet_Identifier Item_Outlet_Sales
0 FDW58 OUT049 1767.496700
1 FDW14 OUT017 1523.687095
2 NCN55 OUT010 1907.343992
3 FDQ58 OUT017 2539.158000
4 FDY38 OUT027 5183.936817
5676 FDB58 OUT046 2360.714311
5677 FDD47 OUT018 2464.120755
5678 NCO17 OUT045 1914.865997
5679 FDJ26 OUT017 3501.090376
5680 FDU37 OUT045 1369.458339

5681 rows × 3 columns

This completes the first 3 steps of our flow chart. The next part is hyperparameter tuning. That’s really important when using neural_networks to make predictions.

Check out my repository on GitHub to find the project in much more detail. You’ll find many more graphs and a lot more colors, if that’s what you’re looking for. github.com/SurajSarangi/Big-Mart-Sales-Prediction

Please give it a star if you happen to be mesmerized by the colours and graphs. Feel free to check out the other machine learning projects as well.


About the author

Suraj Sarangi, is an undergrad, pursuing a degree in BTech. Python and Deep Learning are his weapons of choice. He is well versed in the language of English, likes to debate. Apparently, he loves football and coffee more than anything in his life.

Suraj Sarangi
Suraj Sarangi

Reviews

If You find it interesting!! we would really like to hear from you.

Ping us at Instagram/@the.blur.code

If you want articles on Any topics dm us on insta.

Thanks for reading!! Happy Coding

Recent posts

Categories