Rookie Data Scientist Guide
A comprehensive walkthrough for the Big Mart Sales prediction project using Machine Learning. A perfect start to becoming a Data Scientist.
‘Machine learning’, one of the hottest topics in the world of tech right now. Are you confused about where to start? Maybe you have finished some online courses, but still not confident to develop your own project. While learning from online courses that make you do some project is a great start, it’s still nowhere close to developing your own models in machine learning.
Contrary to popular belief, machine learning is not all about making a model that gives you predictions. This flow chart will help you to understand what are the principal components of machine learning:
The chart shows the steps to making a working model. - The first step is preprocessing. This is the most time taking step according to me. It needs careful observation of the datasets we’re provided with and analyzing the relations between the attributes and the target. It also involves imputation of missing values as NaN values are not very friendly to work with. The last step involves dropping features that don’t seem very important.- The second step is feature engineering. This is the most important step as it can make bad datasets work really well. This involves scaling features if the range is very vast.
Some attributes maybe too extravagant to work with, so we make derived variables to make them computationally more efficient.
Then comes the making of a model for predictions. There can be linear models like Linear Regression, Lasso, Ridge, etc. There are also tree models like Decision Tree, Random Forest, etc. There is always a neural network somewhere around the corner.
After making the model, the hyperparameters are carefully adjusted based on training set and test set accuracy. After this the model is ensembled with other algorithms. There is one final step, which is not shown in the graph, i.e, deployment of the model. This requires making an app and uploading data to a cloud.
Preprocessing
First step involves getting data. You can download a data set to work on, from the internet or if you have the resources, you can create your own data set. This data comes with a lot of errors, missing values or some unwanted attributes. While the rules of English put a great emphasis on punctuation marks and spaces, our machine simply doesn’t care. Cleaning the data, like removing white spaces, unnecessary punctuations is generally the first step while making models for text classification. Image classification tasks generally have very robust datasets and generally don’t require any preprocessing.
Let’s work on a business improvement by making predictions for sales data of Big Mart. This model can help Business Associations to understand which products would be in demand and which ones should they think of getting rid of. Let’s load the datasets and required libraries for processing the data.
import pandas as pd
url1 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Train.csv'
train = pd.read_csv(url1)
url2 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Test.csv'
test = pd.read_csv(url)
train.head()
Item_Identifier | Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Identifier | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FDA15 | 9.30 | Low Fat | 0.016047 | Dairy | 249.8092 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 3735.1380 |
1 | DRC01 | 5.92 | Regular | 0.019278 | Soft Drinks | 48.2692 | OUT018 | 2009 | Medium | Tier 3 | Supermarket Type2 | 443.4228 |
2 | FDN15 | 17.50 | Low Fat | 0.016760 | Meat | 141.6180 | OUT049 | 1999 | Medium | Tier 1 | Supermarket Type1 | 2097.2700 |
3 | FDX07 | 19.20 | Regular | 0.000000 | Fruits and Vegetables | 182.0950 | OUT010 | 1998 | NaN | Tier 3 | Grocery Store | 732.3800 |
4 | NCD19 | 8.93 | Low Fat | 0.000000 | Household | 53.8614 | OUT013 | 1987 | High | Tier 3 | Supermarket Type1 | 994.7052 |
Two important functions which help in getting some information about the dataset are:
train.info()
<class pandas.core.frame.DataFrame>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier 8523 non-null object
Item_Weight 7060 non-null float64
Item_Fat_Content 8523 non-null object
Item_Visibility 8523 non-null float64
Item_Type 8523 non-null object
Item_MRP 8523 non-null float64
Outlet_Identifier 8523 non-null object
Outlet_Establishment_Year 8523 non-null int64
Outlet_Size 6113 non-null object
Outlet_Location_Type 8523 non-null object
Outlet_Type 8523 non-null object
Item_Outlet_Sales 8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB
train.describe()
Item_Weight | Item_Visibility | Item_MRP | Outlet_Establishment_Year | Item_Outlet_Sales | |
---|---|---|---|---|---|
count | 7060.000000 | 8523.000000 | 8523.000000 | 8523.000000 | 8523.000000 |
mean | 12.857645 | 0.066132 | 140.992782 | 1997.831867 | 2181.288914 |
std | 4.643456 | 0.051598 | 62.275067 | 8.371760 | 1706.499616 |
min | 4.555000 | 0.000000 | 31.290000 | 1985.000000 | 33.290000 |
25% | 8.773750 | 0.026989 | 93.826500 | 1987.000000 | 834.247400 |
50% | 12.600000 | 0.053931 | 143.012800 | 1999.000000 | 1794.331000 |
75% | 16.850000 | 0.094585 | 185.643700 | 2004.000000 | 3101.296400 |
max | 21.350000 | 0.328391 | 266.888400 | 2009.000000 | 13086.964800 |
Visualizing the data with different plots can be really helpful to determine the correlation of certain attributes of the data. Distplots, Countplots, Histograms, Scatterplots are some of the major visualization methods. Analyzing Correlation is an important part as well. This helps to determine the most important features in the data. We use libraries seaborn and matplotlib for our visualization:
import matplotlib.pyplot as plt
import seaborn as sb
plt.style.use('seaborn')
plt.figure(figsize=(12,5))
sb.distplot(train.Item_Outlet_Sales,bins=20,color='red')
plt.xlabel('Item Outlet Sales',fontsize=15)
plt.ylabel('No. of Sales',fontsize=15)
plt.title('Histogram of Target',fontweight='bold',fontsize=17);
I love graphs, playing with colors and graphs is a must if you’re new to it. You might be wondering about the semicolon(;) at the end. > “It’s python, it’s a sin to use semicolon!”.
Well, while working with plots, semicolon suppresses the useless object outputs. Try it for yourself to see it.
Univariate Analysis
We divide the numerical and categorical data to different dataframes as the analysis for both of them is different
import numpy as np
num=train.select_dtypes(include=np.number)
cate=train.select_dtypes(exclude=np.number)
Numeric Data
Numeric Data is analysed using Correlation. Correlation is a measure of how sensitive the values of one column are with respect to the changes in values of other columns. We can use corr()
to get the correlation matrix
co=num.corr()
co
Item_Weight | Item_Visibility | Item_MRP | Outlet_Establishment_Year | Item_Outlet_Sales | |
---|---|---|---|---|---|
Item_Weight | 1.000000 | -0.014048 | 0.027141 | -0.011588 | 0.014123 |
Item_Visibility | -0.014048 | 1.000000 | -0.001315 | -0.074834 | -0.128625 |
Item_MRP | 0.027141 | -0.001315 | 1.000000 | 0.005020 | 0.567574 |
Outlet_Establishment_Year | -0.011588 | -0.074834 | 0.005020 | 1.000000 | -0.049135 |
Item_Outlet_Sales | 0.014123 | -0.128625 | 0.567574 | -0.049135 | 1.000000 |
Now, if you felt the matrix as boring, I wouldn’t say you’re entirely wrong. Lets see that same matrix in a much better way.
plt.figure(figsize=(8,6))
sb.heatmap(co,square=True,annot=True)
ax=plt.gca()
bo,to=ax.get_ylim()
ax.set_ylim(bo+0.5,to-0.5)
plt.title('Correlation Heatmap',fontweight='bold',fontsize=17);
Now seeing the most important correlations
co['Item_Outlet_Sales'].sort_values(ascending=False)
Item_Outlet_Sales 1.000000 Item_MRP 0.567574 Item_Weight 0.014123 Outlet_Establishment_Year -0.049135 Item_Visibility -0.128625 Name: Item_Outlet_Sales, dtype: float64
As we see, the sales column has the maximum correlation with MRP (we disregard the 1.000 as it’s not very informative about its relations).
Categorical Data
For Categorical Data, we plot Countplots.
plt.figure(figsize=(15,5))
cols=list(cate.columns)
cols.remove('Item_Identifier') # we don’t need the identifiers
cols.remove('Outlet_Identifier')
sb.countplot(cate[cols[0]]);
Now seeing the most important correlations
We see the irregularities as LF,reg which are just different versions of Low Fat and Regular Fat. We’ll change these later.
plt.figure(figsize=(15,5))
sb.countplot(cate[cols[1]])
plt.title(cols[1],fontweight='bold',fontsize=15)
plt.xticks(rotation=90,fontsize=15)
plt.xlabel(cols[1],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Count',fontsize=15);
There are a lot of features in Item_type. Computation will be really wasteful if we keep so many features. We’ll reduce the features later in feature Engineering.
Likewise, we can plot the graphs for all the other categorical data.
Bivariate Analysis
*
We make another variable for all the features in training set
cols2=list(train.columns)
This is again divided into numeric, categorical and mixed analysis Numeric
vis_pt=train.pivot_table(index=cols2[4], values=cols2[3], aggfunc=np.median)
vis_pt.plot(kind='bar',color='darkorchid',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[4],fontsize=15)
plt.ylabel(cols2[3],fontsize=15)
plt.title('Item Visibility and Item Type',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();
As we can see, there are a lot of zeros in this feature, this needs to be changed while doing imputation of missing values.
year_pt=train.pivot_table(index=cols2[-5], values=cols2[-1], aggfunc=np.median)
year_pt.plot(kind='bar',color='k',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[-5],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Establishment Year and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();
There is an unusual drop in the year 1998. This makes the trend very irregular, and hence we need to do some feature engineering for this particular attribute.
Categorical
We can use pivot tables to plot categorical data
type_pt=train.pivot_table(index=cols[1], values=cols2[-1], aggfunc=np.median)
type_pt.plot(kind='bar',color='deeppink',figsize=(15,5),alpha=0.8)
plt.xlabel(cols[1],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Item type and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();
We can plot for all the remaining pairs and see if there is any irregularity.
Now that we have visualized the data and found the problems, we move to Feature Engineering.
Imputation of Missing Values
Discontinuations in the form of NaN values are really disliked by the model. But the data is not free from these values. The process of filling these values is generally done by taking the average over the column or simply by placing a zero. On analysis of the correlation matrix, the best features are generally shortlisted while the less important ones are dropped.
From info()
we found that ‘Item_Weight’ and ‘Outlet_Size’ had null values.
mr=np.mean(train['Item_Weight'])
train['Item_Weight'].fillna(mr,inplace=True)
mr2=np.mean(test['Item_Weight'])
test['Item_Weight'].fillna(mr2,inplace=True)
This will impute the NaN
values in the weight column of train and test with the mean of the columns. Generally, mean is a preferred choice for numeric data.
kp=(train.mode(axis=0))['Outlet_Size'].iloc[0]
train['Outlet_Size'].fillna(value=kp,inplace=True)
kp2=(test.mode(axis=0))['Outlet_Size'].iloc[0]
test['Outlet_Size'].fillna(kp2,inplace=True)
This will impute NaN
values in Outlet size as the mode of the column in train and test set. Generally, mode is a preferred choice for categorical data.
Feature Engineering
Resurrecting Fat Content
train.replace("LF","Low Fat",inplace=True)
train.replace("low fat","Low Fat",inplace=True)
train.replace("reg","Regular",inplace=True)
test.replace("LF","Low Fat",inplace=True)
test.replace("low fat","Low Fat",inplace=True)
test.replace("reg","Regular",inplace=True)
This replaces the irregularities in the fat content column of train and test. The new visualization would look like this
Some items had 0 visibility which makes no sense. But the same items had some visibility in some other outlet. Hence we change the zeros.pt=train.pivot_table(values='Item_Visibility',index='Item_Identifier')
for i in range(0,len(train['Item_Visibility'])):
if train['Item_Visibility'].iloc[i]==0:
train['Item_Visibility'].iloc[i]=pt['Item_Visibility'].loc[train['Item_Identifier'].iloc[i]]
The new visualization:
We do the same for the test dataframe as well.
We change the Establishment Year to the number of years operated. This data is from 2013.
train['Years_operated']=2013-train['Outlet_Establishment_Year']
test['Years_operated']=2013-test['Outlet_Establishment_Year']
It looks like this:
We saw Item_type has a lot of features. On taking a closer look at the dataset, we have the Item_Identifier column which has data beginning with either FD, NC, DR. We can use this to create a new Item_type. FD can be Food, DR can be drink, NC can be Non Consumable.
di={'FD':'Food','DR':'Drinks','NC':'Non-Consumable'}
l=[]
for i in train['Item_Identifier']:
l.append(di[i[:2]])
train['Item_Type2']=l
New Visualization:
We do the same for test dataframe. Since we added a Non Consumable item, having a fat content for it would make no sense. Hence, we alter the Fat_Content column.
for i in range(0,len(train['Item_Type2'])):
if train['Item_Type2'].iloc[i]=='Non-Consumable':
train['Item_Fat_Content'].iloc[i]='Non-Consumable'
for i in range(0,len(test['Item_Type2'])):
if test['Item_Type2'].iloc[i]=='Non-Consumable':
test['Item_Fat_Content'].iloc[i]='Non-Consumable'
Fat_content now looks like this:
Encoding
Since we are using library scikit-learn for our models, it cannot work with categorical variables. Therefore, we need to use encoding to generate numeric attributes for these categorical columns. We used One Hot Encoding (funny name, eh?) for our attributes. It assigns 1 to the rows that have the particular value and 0 to those who don’t.
cols.append('Item_Type2') # to add the new feature we created
cols.remove('Item_Type') #to remove the old Item_type
from sklearn.preprocessing import LabelEncoder
for i in cols:
train[i]=LabelEncoder().fit_transform(train[i])
for i in cols:
test[i]=LabelEncoder().fit_transform(test[i])
train=pd.get_dummies(train,columns=cols)
test=pd.get_dummies(test,columns=cols)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 25 columns):
Item_Identifier 8523 non-null object
Item_Weight 8523 non-null float64
Item_Visibility 8523 non-null float64
Item_Type 8523 non-null object
Item_MRP 8523 non-null float64
Outlet_Identifier 8523 non-null object
Outlet_Establishment_Year 8523 non-null int64
Item_Outlet_Sales 8523 non-null float64
Years_operated 8523 non-null int64
Item_Fat_Content_0 8523 non-null uint8
Item_Fat_Content_1 8523 non-null uint8
Item_Fat_Content_2 8523 non-null uint8
Outlet_Size_0 8523 non-null uint8
Outlet_Size_1 8523 non-null uint8
Outlet_Size_2 8523 non-null uint8
Outlet_Location_Type_0 8523 non-null uint8
Outlet_Location_Type_1 8523 non-null uint8
Outlet_Location_Type_2 8523 non-null uint8
Outlet_Type_0 8523 non-null uint8
Outlet_Type_1 8523 non-null uint8
Outlet_Type_2 8523 non-null uint8
Outlet_Type_3 8523 non-null uint8
Item_Type2_0 8523 non-null uint8
Item_Type2_1 8523 non-null uint8
Item_Type2_2 8523 non-null uint8
dtypes: float64(4), int64(2), object(3), uint8(16)
memory usage: 732.6+ KB
As we can see the dummy variables have been generated according to the number of different labels present in the categorical attributes. Like, Fat_content had ‘low fat’, ‘regular’, ‘Non-consumable’, we have 3 dummy classes for Fat_Content.
The next step is to remove these redundant attributes from our dataframe. Like Item_type, establishment_yeat.
train.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
test.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
train.head()
Now, we’ll prepare our training and test sets
cols3=list(train.columns)
j=cols3.pop(5) #to get outlet_sales at last
cols3.append(j)
cols3.remove('Item_Identifier')
cols3.remove('Outlet_Identifier')
X_train=train[cols3[:-1]]
X_test=test[cols3[:-1]]
y_train=train[cols3[-1]]
After so much of preprocessing and Feature Engineering, we have finally come up to the favourite part of mine. Yes, the model selection. Since sales is a continuous data, we need to use Regression here.
from sklearn.linear_model import Lasso
las=Lasso().fit(X_train, y_train)
test2=pd.read_csv("test.csv") # the test set is opened as it is for easy viewing
test2["Item_Outlet_Sales"]=las.predict(X_test) # A new column is made for predictions
f1=open("Predictions.csv","w")
f1.write(test2[["Item_Identifier","Outlet_Identifier","Item_Outlet_Sales"]].to_csv(index=False, line_terminator='\n'))
f1.close()
This will make a csv file containing the identifiers and the sales for each of them. It’ll look like this:
Item_Identifier | Outlet_Identifier | Item_Outlet_Sales | |
---|---|---|---|
0 | FDW58 | OUT049 | 1767.496700 |
— | — | — | — |
1 | FDW14 | OUT017 | 1523.687095 |
2 | NCN55 | OUT010 | 1907.343992 |
3 | FDQ58 | OUT017 | 2539.158000 |
4 | FDY38 | OUT027 | 5183.936817 |
… | … | … | … |
5676 | FDB58 | OUT046 | 2360.714311 |
5677 | FDD47 | OUT018 | 2464.120755 |
5678 | NCO17 | OUT045 | 1914.865997 |
5679 | FDJ26 | OUT017 | 3501.090376 |
5680 | FDU37 | OUT045 | 1369.458339 |
5681 rows × 3 columns
This completes the first 3 steps of our flow chart. The next part is hyperparameter tuning. That’s really important when using neural_networks to make predictions.
Check out my repository on GitHub to find the project in much more detail. You’ll find many more graphs and a lot more colors, if that’s what you’re looking for. github.com/SurajSarangi/Big-Mart-Sales-Prediction
Please give it a star if you happen to be mesmerized by the colours and graphs. Feel free to check out the other machine learning projects as well.
About the author
Suraj Sarangi, is an undergrad, pursuing a degree in BTech. Python and Deep Learning are his weapons of choice. He is well versed in the language of English, likes to debate. Apparently, he loves football and coffee more than anything in his life.
Reviews
If You find it interesting!! we would really like to hear from you.
Ping us at Instagram/@the.blur.code
If you want articles on Any topics dm us on insta.
Thanks for reading!!
Happy Coding