Rookie Data Scientist Guide

A comprehensive walkthrough for the Big Mart Sales prediction project using Machine Learning. A perfect start to becoming a Data Scientist.

May 30, 2020 Suraj Sarangi

11 minute read

‘Machine learning’, one of the hottest topics in the world of tech right now. Are you confused about where to start? Maybe you have finished some online courses, but still not confident to develop your own project. While learning from online courses that make you do some project is a great start, it’s still nowhere close to developing your own models in machine learning.

Contrary to popular belief, machine learning is not all about making a model that gives you predictions. This flow chart will help you to understand what are the principal components of machine learning:

Created by Suraj Sarangi

The chart shows the steps to making a working model. - The first step is preprocessing. This is the most time taking step according to me. It needs careful observation of the datasets we’re provided with and analyzing the relations between the attributes and the target. It also involves imputation of missing values as NaN values are not very friendly to work with. The last step involves dropping features that don’t seem very important.

The second step is feature engineering. This is the most important step as it can make bad datasets work really well. This involves scaling features if the range is very vast.

Some attributes maybe too extravagant to work with, so we make derived variables to make them computationally more efficient.

Then comes the making of a model for predictions. There can be linear models like Linear Regression, Lasso, Ridge, etc. There are also tree models like Decision Tree, Random Forest, etc. There is always a neural network somewhere around the corner.

After making the model, the hyperparameters are carefully adjusted based on training set and test set accuracy. After this the model is ensembled with other algorithms. There is one final step, which is not shown in the graph, i.e, deployment of the model. This requires making an app and uploading data to a cloud.

Preprocessing

First step involves getting data. You can download a data set to work on, from the internet or if you have the resources, you can create your own data set. This data comes with a lot of errors, missing values or some unwanted attributes. While the rules of English put a great emphasis on punctuation marks and spaces, our machine simply doesn’t care. Cleaning the data, like removing white spaces, unnecessary punctuations is generally the first step while making models for text classification. Image classification tasks generally have very robust datasets and generally don’t require any preprocessing.

Let’s work on a business improvement by making predictions for sales data of Big Mart. This model can help Business Associations to understand which products would be in demand and which ones should they think of getting rid of. Let’s load the datasets and required libraries for processing the data.

import pandas as pd
url1 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Train.csv'
train = pd.read_csv(url1)
url2 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Test.csv'
test = pd.read_csv(url)
train.head()

	Item_Identifier	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Identifier	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
0	FDA15	9.30	Low Fat	0.016047	Dairy	249.8092	OUT049	1999	Medium	Tier 1	Supermarket Type1	3735.1380
1	DRC01	5.92	Regular	0.019278	Soft Drinks	48.2692	OUT018	2009	Medium	Tier 3	Supermarket Type2	443.4228
2	FDN15	17.50	Low Fat	0.016760	Meat	141.6180	OUT049	1999	Medium	Tier 1	Supermarket Type1	2097.2700
3	FDX07	19.20	Regular	0.000000	Fruits and Vegetables	182.0950	OUT010	1998	NaN	Tier 3	Grocery Store	732.3800
4	NCD19	8.93	Low Fat	0.000000	Household	53.8614	OUT013	1987	High	Tier 3	Supermarket Type1	994.7052

Two important functions which help in getting some information about the dataset are:

train.info()

<class pandas.core.frame.DataFrame>

RangeIndex: 8523 entries, 0 to 8522

Data columns (total 12 columns):

Item_Identifier 8523 non-null object

Item_Weight 7060 non-null float64

Item_Fat_Content 8523 non-null object

Item_Visibility 8523 non-null float64

Item_Type 8523 non-null object

Item_MRP 8523 non-null float64

Outlet_Identifier 8523 non-null object

Outlet_Establishment_Year 8523 non-null int64

Outlet_Size 6113 non-null object

Outlet_Location_Type 8523 non-null object

Outlet_Type 8523 non-null object

Item_Outlet_Sales 8523 non-null float64

dtypes: float64(4), int64(1), object(7)

memory usage: 799.2+ KB

train.describe()

	Item_Weight	Item_Visibility	Item_MRP	Outlet_Establishment_Year	Item_Outlet_Sales
count	7060.000000	8523.000000	8523.000000	8523.000000	8523.000000
mean	12.857645	0.066132	140.992782	1997.831867	2181.288914
std	4.643456	0.051598	62.275067	8.371760	1706.499616
min	4.555000	0.000000	31.290000	1985.000000	33.290000
25%	8.773750	0.026989	93.826500	1987.000000	834.247400
50%	12.600000	0.053931	143.012800	1999.000000	1794.331000
75%	16.850000	0.094585	185.643700	2004.000000	3101.296400
max	21.350000	0.328391	266.888400	2009.000000	13086.964800

Visualizing the data with different plots can be really helpful to determine the correlation of certain attributes of the data. Distplots, Countplots, Histograms, Scatterplots are some of the major visualization methods. Analyzing Correlation is an important part as well. This helps to determine the most important features in the data. We use libraries seaborn and matplotlib for our visualization:

import matplotlib.pyplot as plt
import seaborn as sb
plt.style.use('seaborn')
plt.figure(figsize=(12,5))
sb.distplot(train.Item_Outlet_Sales,bins=20,color='red')
plt.xlabel('Item Outlet Sales',fontsize=15)
plt.ylabel('No. of Sales',fontsize=15)
plt.title('Histogram of Target',fontweight='bold',fontsize=17);

Created by Suraj Sarangi

I love graphs, playing with colors and graphs is a must if you’re new to it. You might be wondering about the semicolon(;) at the end. > “It’s python, it’s a sin to use semicolon!”.

Well, while working with plots, semicolon suppresses the useless object outputs. Try it for yourself to see it.

Univariate Analysis

We divide the numerical and categorical data to different dataframes as the analysis for both of them is different

import numpy as np
num=train.select_dtypes(include=np.number)
cate=train.select_dtypes(exclude=np.number)

Numeric Data

Numeric Data is analysed using Correlation. Correlation is a measure of how sensitive the values of one column are with respect to the changes in values of other columns. We can use corr() to get the correlation matrix

co=num.corr()
co

	Item_Weight	Item_Visibility	Item_MRP	Outlet_Establishment_Year	Item_Outlet_Sales
Item_Weight	1.000000	-0.014048	0.027141	-0.011588	0.014123
Item_Visibility	-0.014048	1.000000	-0.001315	-0.074834	-0.128625
Item_MRP	0.027141	-0.001315	1.000000	0.005020	0.567574
Outlet_Establishment_Year	-0.011588	-0.074834	0.005020	1.000000	-0.049135
Item_Outlet_Sales	0.014123	-0.128625	0.567574	-0.049135	1.000000

Now, if you felt the matrix as boring, I wouldn’t say you’re entirely wrong. Lets see that same matrix in a much better way.

plt.figure(figsize=(8,6))
sb.heatmap(co,square=True,annot=True)
ax=plt.gca()
bo,to=ax.get_ylim()
ax.set_ylim(bo+0.5,to-0.5)
plt.title('Correlation Heatmap',fontweight='bold',fontsize=17);

Created by Suraj Sarangi

Now seeing the most important correlations

co['Item_Outlet_Sales'].sort_values(ascending=False)

Item_Outlet_Sales 1.000000 Item_MRP 0.567574 Item_Weight 0.014123 Outlet_Establishment_Year -0.049135 Item_Visibility -0.128625 Name: Item_Outlet_Sales, dtype: float64

As we see, the sales column has the maximum correlation with MRP (we disregard the 1.000 as it’s not very informative about its relations).

Categorical Data

For Categorical Data, we plot Countplots.

plt.figure(figsize=(15,5))
cols=list(cate.columns)
cols.remove('Item_Identifier')       # we don’t need the identifiers
cols.remove('Outlet_Identifier')
sb.countplot(cate[cols[0]]);

Created by Suraj Sarangi

Now seeing the most important correlations

We see the irregularities as LF,reg which are just different versions of Low Fat and Regular Fat. We’ll change these later.

plt.figure(figsize=(15,5))
sb.countplot(cate[cols[1]])
plt.title(cols[1],fontweight='bold',fontsize=15)
plt.xticks(rotation=90,fontsize=15)
plt.xlabel(cols[1],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Count',fontsize=15);

Created by Suraj Sarangi

There are a lot of features in Item_type. Computation will be really wasteful if we keep so many features. We’ll reduce the features later in feature Engineering.

Likewise, we can plot the graphs for all the other categorical data.

Bivariate Analysis

* We make another variable for all the features in training set cols2=list(train.columns)

This is again divided into numeric, categorical and mixed analysis Numeric

vis_pt=train.pivot_table(index=cols2[4], values=cols2[3], aggfunc=np.median)
vis_pt.plot(kind='bar',color='darkorchid',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[4],fontsize=15)
plt.ylabel(cols2[3],fontsize=15)
plt.title('Item Visibility and Item Type',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi

As we can see, there are a lot of zeros in this feature, this needs to be changed while doing imputation of missing values.

year_pt=train.pivot_table(index=cols2[-5], values=cols2[-1], aggfunc=np.median)
year_pt.plot(kind='bar',color='k',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[-5],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Establishment Year and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi

There is an unusual drop in the year 1998. This makes the trend very irregular, and hence we need to do some feature engineering for this particular attribute.

Categorical

We can use pivot tables to plot categorical data

type_pt=train.pivot_table(index=cols[1], values=cols2[-1], aggfunc=np.median)
type_pt.plot(kind='bar',color='deeppink',figsize=(15,5),alpha=0.8)
plt.xlabel(cols[1],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Item type and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi

We can plot for all the remaining pairs and see if there is any irregularity.

Now that we have visualized the data and found the problems, we move to Feature Engineering.

Imputation of Missing Values

Discontinuations in the form of NaN values are really disliked by the model. But the data is not free from these values. The process of filling these values is generally done by taking the average over the column or simply by placing a zero. On analysis of the correlation matrix, the best features are generally shortlisted while the less important ones are dropped.

From info() we found that ‘Item_Weight’ and ‘Outlet_Size’ had null values.

mr=np.mean(train['Item_Weight'])
train['Item_Weight'].fillna(mr,inplace=True)
mr2=np.mean(test['Item_Weight'])
test['Item_Weight'].fillna(mr2,inplace=True)

This will impute the NaN values in the weight column of train and test with the mean of the columns. Generally, mean is a preferred choice for numeric data.

kp=(train.mode(axis=0))['Outlet_Size'].iloc[0]
train['Outlet_Size'].fillna(value=kp,inplace=True)
kp2=(test.mode(axis=0))['Outlet_Size'].iloc[0]
test['Outlet_Size'].fillna(kp2,inplace=True)

This will impute NaN values in Outlet size as the mode of the column in train and test set. Generally, mode is a preferred choice for categorical data.

Feature Engineering

Resurrecting Fat Content

train.replace("LF","Low Fat",inplace=True)
train.replace("low fat","Low Fat",inplace=True)
train.replace("reg","Regular",inplace=True)

test.replace("LF","Low Fat",inplace=True)
test.replace("low fat","Low Fat",inplace=True)
test.replace("reg","Regular",inplace=True)

This replaces the irregularities in the fat content column of train and test. The new visualization would look like this

Created by Suraj Sarangi

Some items had 0 visibility which makes no sense. But the same items had some visibility in some other outlet. Hence we change the zeros.

pt=train.pivot_table(values='Item_Visibility',index='Item_Identifier')
for i in range(0,len(train['Item_Visibility'])):
    if train['Item_Visibility'].iloc[i]==0:
        train['Item_Visibility'].iloc[i]=pt['Item_Visibility'].loc[train['Item_Identifier'].iloc[i]]

The new visualization:

Created by Suraj Sarangi

We do the same for the test dataframe as well.

We change the Establishment Year to the number of years operated. This data is from 2013.

train['Years_operated']=2013-train['Outlet_Establishment_Year']
test['Years_operated']=2013-test['Outlet_Establishment_Year']

It looks like this:

Created by Suraj Sarangi

We saw Item_type has a lot of features. On taking a closer look at the dataset, we have the Item_Identifier column which has data beginning with either FD, NC, DR. We can use this to create a new Item_type. FD can be Food, DR can be drink, NC can be Non Consumable.

di={'FD':'Food','DR':'Drinks','NC':'Non-Consumable'}
l=[]
for i in train['Item_Identifier']:
    l.append(di[i[:2]])
train['Item_Type2']=l

New Visualization:

Created by Suraj Sarangi

We do the same for test dataframe. Since we added a Non Consumable item, having a fat content for it would make no sense. Hence, we alter the Fat_Content column.

for i in range(0,len(train['Item_Type2'])):
    if train['Item_Type2'].iloc[i]=='Non-Consumable':
        train['Item_Fat_Content'].iloc[i]='Non-Consumable'
for i in range(0,len(test['Item_Type2'])):
    if test['Item_Type2'].iloc[i]=='Non-Consumable':
        test['Item_Fat_Content'].iloc[i]='Non-Consumable'

Fat_content now looks like this:

Created by Suraj Sarangi

Encoding

Since we are using library scikit-learn for our models, it cannot work with categorical variables. Therefore, we need to use encoding to generate numeric attributes for these categorical columns. We used One Hot Encoding (funny name, eh?) for our attributes. It assigns 1 to the rows that have the particular value and 0 to those who don’t.

cols.append('Item_Type2')    # to add the new feature we created
cols.remove('Item_Type')  #to remove the old Item_type

from sklearn.preprocessing import LabelEncoder
for i in cols:
    train[i]=LabelEncoder().fit_transform(train[i])

for i in cols:
    test[i]=LabelEncoder().fit_transform(test[i])

train=pd.get_dummies(train,columns=cols)
test=pd.get_dummies(test,columns=cols)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 25 columns):
Item_Identifier              8523 non-null object
Item_Weight                  8523 non-null float64
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Item_Outlet_Sales            8523 non-null float64
Years_operated               8523 non-null int64
Item_Fat_Content_0           8523 non-null uint8
Item_Fat_Content_1           8523 non-null uint8
Item_Fat_Content_2           8523 non-null uint8
Outlet_Size_0                8523 non-null uint8
Outlet_Size_1                8523 non-null uint8
Outlet_Size_2                8523 non-null uint8
Outlet_Location_Type_0       8523 non-null uint8
Outlet_Location_Type_1       8523 non-null uint8
Outlet_Location_Type_2       8523 non-null uint8
Outlet_Type_0                8523 non-null uint8
Outlet_Type_1                8523 non-null uint8
Outlet_Type_2                8523 non-null uint8
Outlet_Type_3                8523 non-null uint8
Item_Type2_0                 8523 non-null uint8
Item_Type2_1                 8523 non-null uint8
Item_Type2_2                 8523 non-null uint8
dtypes: float64(4), int64(2), object(3), uint8(16)
memory usage: 732.6+ KB

As we can see the dummy variables have been generated according to the number of different labels present in the categorical attributes. Like, Fat_content had ‘low fat’, ‘regular’, ‘Non-consumable’, we have 3 dummy classes for Fat_Content.

The next step is to remove these redundant attributes from our dataframe. Like Item_type, establishment_yeat.

train.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
test.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
train.head()

Now, we’ll prepare our training and test sets

cols3=list(train.columns)
j=cols3.pop(5) #to get outlet_sales at last
cols3.append(j)
cols3.remove('Item_Identifier')
cols3.remove('Outlet_Identifier')
X_train=train[cols3[:-1]]
X_test=test[cols3[:-1]]
y_train=train[cols3[-1]]

After so much of preprocessing and Feature Engineering, we have finally come up to the favourite part of mine. Yes, the model selection. Since sales is a continuous data, we need to use Regression here.

from sklearn.linear_model import Lasso
las=Lasso().fit(X_train, y_train)
test2=pd.read_csv("test.csv")    # the test set is opened as it is for easy viewing
test2["Item_Outlet_Sales"]=las.predict(X_test)     # A new column is made for predictions
f1=open("Predictions.csv","w")
f1.write(test2[["Item_Identifier","Outlet_Identifier","Item_Outlet_Sales"]].to_csv(index=False, line_terminator='\n'))
f1.close()

This will make a csv file containing the identifiers and the sales for each of them. It’ll look like this:

	Item_Identifier	Outlet_Identifier	Item_Outlet_Sales
0	FDW58	OUT049	1767.496700
—	—	—	—
1	FDW14	OUT017	1523.687095
2	NCN55	OUT010	1907.343992
3	FDQ58	OUT017	2539.158000
4	FDY38	OUT027	5183.936817
…	…	…	…
5676	FDB58	OUT046	2360.714311
5677	FDD47	OUT018	2464.120755
5678	NCO17	OUT045	1914.865997
5679	FDJ26	OUT017	3501.090376
5680	FDU37	OUT045	1369.458339

5681 rows × 3 columns

This completes the first 3 steps of our flow chart. The next part is hyperparameter tuning. That’s really important when using neural_networks to make predictions.

Check out my repository on GitHub to find the project in much more detail. You’ll find many more graphs and a lot more colors, if that’s what you’re looking for. github.com/SurajSarangi/Big-Mart-Sales-Prediction

Please give it a star if you happen to be mesmerized by the colours and graphs. Feel free to check out the other machine learning projects as well.

About the author

Suraj Sarangi, is an undergrad, pursuing a degree in BTech. Python and Deep Learning are his weapons of choice. He is well versed in the language of English, likes to debate. Apparently, he loves football and coffee more than anything in his life.

Suraj Sarangi

Reviews

If You find it interesting!! we would really like to hear from you.

Ping us at Instagram/@the.blur.code

If you want articles on Any topics dm us on insta.

Thanks for reading!! Happy Coding

Blur Code