1) Data Preparation
2) Exploratory Data Analysis (EDA)
2.1) How many orders for each country?
2.2) How much money did each country spent?
2.3) How were sales over the months?
2.4) How orders were distributed by day (of month)
2.5) How orders were distributed by day (of week):
2.6) How orders were distributed by hour?
2.7) What are the most common stock codes?
2.8) Which stock codes contribute the most to revenue?
3) Categorizing Products
3.1) Clearing Product Data
3.2) Product Clustering
Loading all the modules that will be used:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from webcolors import CSS3_NAMES_TO_HEX
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)
from pandas_profiling import ProfileReport
Then reading the data set and taking the first look:
df = pd.read_csv('../1985_3404_bundle_archive/data.csv', encoding = 'ISO-8859-1')
df.head()
Let's standardize the dataframe's columns and text, facilitating the handling.
# New column names
new_cols = ['invoice_num', 'stock_code', 'description', 'quantity',
'invoice_date', 'unit_price', 'customer_id', 'country']
df.columns = new_cols
df.head()
# New columns to represent stock code, description and country in lower case
# Lowercase stock code
df.loc[~df['stock_code'].isna(), 'lowercase_stockcode'] = df.loc[
~df['stock_code'].isna(), 'stock_code'
].apply(lambda text: text.lower())
# Lowercase description
df.loc[~df['description'].isna(), 'lowercase_description'] = df.loc[
~df['description'].isna(), 'description'
].apply(lambda text: text.lower())
# Lowercase country
df.loc[~df['country'].isna(), 'lowercase_country'] = df.loc[
~df['country'].isna(), 'country'
].apply(lambda text: text.lower())
df.head()
Creating a report to get an overview of the data:
# This doesn't appear on github, but the design is great
# It is used only as an overview
# Everything we need from this will be explored later
profile = ProfileReport(df, title='Profiling E-Commerce',html={'style':{'full_width':True}})
profile.to_notebook_iframe()
Let's focus on data types.
df.info()
# Distribuition of missing values
plt.figure(figsize=(13,9))
sns.heatmap(df.isna(), cmap='magma')
plt.xticks(rotation=45, fontsize=12)
plt.tight_layout()
There is no problem in keeping fields that we won't do calculations as "object", such as: invoice_num and stock_code. There are missing values in customer_id, but I will keep it this way for now. I will convert customer_id to string ("object").
# We don't need ".0" in the end of the customer_id values
df.loc[~df['customer_id'].isna(), 'customer_id'] = df.loc[
~df['customer_id'].isna(), 'customer_id'
].apply(lambda ID: str(ID)[:-2])
df['customer_id'].value_counts()
df.head()
The next step is to transform invoice_date to "datetime".
df['invoice_date'] = pd.to_datetime(df['invoice_date'])
df.info()
I will split the InvoiceDate column into: Year, Month, WeekDay, Hour and YearMonth. This way, we can group our data however we want based on any time analysis.
df['year'] = df['invoice_date'].apply(lambda time: time.year)
df['month'] = df['invoice_date'].apply(lambda time: time.month)
df['month_day'] = df['invoice_date'].apply(lambda time: time.day)
df['week_day'] = df['invoice_date'].apply(lambda time: time.weekday())
df['hour'] = df['invoice_date'].apply(lambda date: date.hour)
# zfill(2) is used to fill the string with zeros until it is 2 characters long.
df['year_month'] = df['invoice_date'].apply(lambda time: str(time.year) + str(time.month).zfill(2))
df.head()
We have quantity and unit_price, but we don't have the amount spent on each line. Creating the amount_spent column will make it easier to analyse revenue.
df['amount_spent'] = df['quantity'] * df['unit_price']
Taking a look at missing values...
df.isna().sum().sort_values(ascending=False)
Perhaps, there are non-numeric fields with missing values and it did not appear in our previous analysis. Missing values can appear in strange ways, such as "", "na", "nan" and "null" strings.
nullsdict = {}
nulls_vec = ['', 'na', 'nan', 'null']
for col in df.columns:
str_nan = df.loc[~df[col].isna(), col].apply(lambda row: row in nulls_vec)
real_nan = df[col].isna()
nullsdict.update({col:{'str_nan':sum(str_nan),
'real_nan':sum(real_nan)}})
pd.DataFrame(nullsdict).T
We could erase the NaN values, but it will damage the sum of the revenue.
There is nothing we can do about the customer_id's missing values for now. I will analyse description column in the Exploratory Data Analysis section.
In Data Preparation section it was possible to observe missing values in lowercase_description column. Let's take an overview of the products.
products = df[['lowercase_stockcode', 'lowercase_description']].drop_duplicates()
products.head()
products['lowercase_stockcode'].value_counts()
There is more than one description for the same product. Looking at the product with the stock code equal to 20713...
products[products['lowercase_stockcode'] == '20713']
The product with stock_code equal to 20713 actually is "jumbo bag owls". There are missing values, product incorrectly marked as 23343 and found(?).
Regardless, we can assume the revenue amount is correct, so we will not change any description and unit price.
products['lowercase_stockcode'].map(len).value_counts()
According to the source, if the code starts with letter "C", it indicates a cancellation.
There are stock_code with different lengths.
# stock_code with 6 characters
products[products['lowercase_stockcode'].map(len) == 6].head()
# stock_code with 5 characters
products[products['lowercase_stockcode'].map(len) == 5].head()
It seems that stock_code values represented with 6 or 5 characters are standard sales.
products['lowercase_stockcode'].map(len).value_counts().index
# Create a dataframe with stock_code descriptions
def desc_stockcode(prod):
# stock_code lengths different from 5 and 6
stockcode_len = [8, 12, 1, 7, 4, 9, 3, 2]
stockcode_len = sorted(stockcode_len)
df_stockcode = (prod[prod['lowercase_stockcode'].map(len) == stockcode_len[0]])
for length in stockcode_len[1:]:
df_stockcode = pd.concat([df_stockcode,
prod[prod['lowercase_stockcode'].map(len) == length]],
ignore_index=True)
return df_stockcode
desc_stockcode(products)
Analysing the previous dataframe, stock_code can represent special events as discounts, bad debt and gift vouchers.
df.describe()
There are negative values in quantity and unit_price.
len(df[(df['quantity'] < 0)])
len(df[(df['unit_price'] < 0)])
There are 10624 negative values in quantity and only 2 in unit_price. I will explore and find the meaning of these values.
# Starting with unit_price, because there are only 2 negative values.
df[(df['unit_price'] < 0)]
Negative unit_price values are bad debt.
# Checking negative quantity
df[(df['quantity'] < 0)]
# Checking invoice_num starting character.
df[(df['quantity'] < 0)]['invoice_num'].apply(lambda invoice: invoice[0]).value_counts()
df[(df['quantity'] < 0)].sort_values('invoice_num')
# How many sales had unit_price equal to 0?
sum(df['unit_price'] == 0)
# How many invoice_num codes start with "5"?
sum(df[df['unit_price'] == 0]['invoice_num'].apply(lambda invoice: invoice[0] == '5'))
All invoices with free itens have invoice_num starting with "5"
# Analysing unit_price equal 0
plt.figure(figsize=(12,8))
df[df['unit_price'] == 0]['quantity'].hist(bins=30, color='steelblue')
plt.xticks(fontsize=12)
plt.tight_layout()
# Descriptions of "free items" with positive quantity
df[(df['quantity'] > 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()
# Descriptions of "free items" with negative quantity
df[(df['quantity'] < 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()
There are positive and negative quantity values for invoices with unit_price equal to 0. Some of them are damaged or from display as we can see. Perhaps some are free items for promotions or events.
Now that we know our data set better, let's do some visualization and answer some questions.
plt.figure(figsize=(12,8))
ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')
plt.title('Orders per Country', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.figure(figsize=(12,8))
# UK stands out too much on the chart
ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts().iloc[1:]
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')
plt.title('Orders per Country (without UK)', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.figure(figsize=(12,8))
ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False)
sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')
plt.title('Revenue per Country', fontsize=15)
plt.xlabel('Amount (millions)', fontsize=13)
plt.ylabel('Country', fontsize=13)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.figure(figsize=(12,8))
# UK stands out too much on the chart
ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False).iloc[1:]
sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')
plt.title('Revenue per Country (without UK)', fontsize=15)
plt.xlabel('Amount', fontsize=13)
plt.ylabel('Country', fontsize=13)
plt.yticks(fontsize=12)
plt.tight_layout()
This simple analysis shows that we have two important points of view:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Revenue per Month
ax = df.groupby('year_month').sum()['amount_spent']
sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])
axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)
label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()
axes[0].set_xticklabels(label, rotation=45)
# Orders x Month
ax = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])
axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)
axes[1].set_xticklabels(label, rotation=45)
plt.tight_layout()
Remember that sales occurred between 01/12/2010 and 09/12/2011. December 2011 sales go through the 9th.
Although the behavior of revenue and orders per month tends to be similar, we can see that the average amount spent on each order per month is different. This will be discussed in the Average Ticket section.
January, February and April sales were the worst, except for December 2011. On the other hand, September, October and November were the best sales, in these months there is a growing demand for Christmas gifts and wholesalers are preparing for Black Friday (last Friday in November).
# Comparing UK sales to sales outside UK
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Revenue per Month (Only UK)
ax = df[
df['lowercase_country'] == 'united kingdom'
].groupby('year_month').sum()['amount_spent']
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[0])
# Revenue per Month (Without UK)
ax = df[
df['lowercase_country'] != 'united kingdom'
].groupby('year_month').sum()['amount_spent']
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])
axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)
label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()
axes[0].set_xticklabels(label, rotation=45)
axes[0].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
# Orders x Month (Only UK)
ax = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[1])
# Orders x Month (Without UK)
ax = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])
axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)
axes[1].set_xticklabels(label, rotation=45)
axes[1].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
plt.tight_layout()
Sales outside UK appear to be more stable. Although the "Without UK" curve is much lower than the "UK Only", it looks smoothier over the months. Christmas demand is more pronounced for sales in the UK than for sales outside the UK.
plt.figure(figsize=(10,6))
ax1 = df.groupby('year_month')['amount_spent'].sum()
ax2 = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2
sns.barplot(x=ax.index, y=ax, color='steelblue')
plt.title('Average Ticket', fontsize=15)
plt.xlabel('Month', fontsize=13)
plt.ylabel('Amount', fontsize=13)
label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()
plt.xticks(ticks=range(len(label)), labels=label, rotation=45, fontsize=12)
plt.tight_layout()
# Comparing UK Average Ticket to Average Ticket outside UK
fig, axes = plt.subplots(figsize=(12, 6))
# Average Ticker (Only UK)
ax1 = df[
df['lowercase_country'] == 'united kingdom'
].groupby('year_month')['amount_spent'].sum()
ax2 = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)
label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()
# Average Ticker (Without UK)
ax1 = df[
df['lowercase_country'] != 'united kingdom'
].groupby('year_month')['amount_spent'].sum()
ax2 = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()
# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)
axes.set_title('Average Ticket', fontsize=15)
axes.set_xlabel('Month', fontsize=13)
axes.set_ylabel('Amount', fontsize=13)
axes.set_xticklabels(label, rotation=45)
axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
plt.tight_layout()
Althongh revenue and number of orders in the UK are higher than abroad, the opposite is true for the average ticket. This indicates that the amount spent on orders outside the UK are higher.
Looking at all the sales over the months, we can see a significant drop in April.
There are some key dates that move retail and April is just between them:
Frebruary: Valentine's Day
May: Mother's Day
June: Father's Day
August/September: Back-to-school (most countries are in the northern hemisphere)
October: Halloween
November: Black Friday
December: Christmas
plt.figure(figsize=(10,6))
ax = df.groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax.index, y=ax, color='steelblue')
plt.title('Orders per Day', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
# Comparing UK orders to orders outside UK per day
fig, axes = plt.subplots(figsize=(10, 6))
# Order per day (UK only)
ax = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)
# Order per day (without UK)
ax = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)
axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)
axes.tick_params(axis="x", labelsize=12)
axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
plt.tight_layout()
There is a substancial demand between the 5th and 10th, but it is not enough to say the trend with certainty.
For this analysis, I will use the Polynomial Regression approach, taking the number of orders per day as a series. The goal is to plot just a trend line.
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
for i in range(4):
# Where each subplot will be
if i < 2:
plot = (0,i)
else:
plot = (1,(i-2))
# Order per day (UK only)
y = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
x = [index[0] for index in y.index]
sns.lineplot(x=x, y=y, color='orange', ax=axes[plot])
# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=(i+1))
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
axes[plot].plot(x, y_poly_pred, 'r--')
axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
axes[plot].set_ylabel('')
axes[plot].set_ylim(ymin=0)
plt.tight_layout()
The dashed red line shows the trend. The second and third-degree are almost the same. I will use the third-degree. Higher degrees will just overfit it.
fig, axes = plt.subplots(figsize=(10, 6))
# Order per day (UK only)
y = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
x = [index[0] for index in y.index]
sns.lineplot(x=x, y=y, color='orange', ax=axes)
# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
axes.plot(x, y_poly_pred, 'r--')
axes.set_title('Orders per Day (UK Only)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)
axes.tick_params(axis="x", labelsize=12)
axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)
plt.ylim(ymin=0)
plt.tight_layout()
In the UK, the number of orders starts high at the begining of the month and falls over the days.
fig, axes = plt.subplots(2, 2, figsize=(10, 7))
for i in range(4):
# Where each subplot will be
if i < 2:
plot = (0,i)
else:
plot = (1,(i-2))
# Order per day (UK only)
y = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
x = [index[0] for index in y.index]
sns.lineplot(x=x, y=y, color='steelblue', ax=axes[plot])
# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=(i+1))
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
axes[plot].plot(x, y_poly_pred, 'r--')
axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
axes[plot].set_ylabel('')
axes[plot].set_ylim(ymin=0)
plt.tight_layout()
Third-degree fits better the number of orders per day.
fig, axes = plt.subplots(figsize=(10, 6))
# Order per day (without UK)
y = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
# Removing the index brackets
x = [index[0] for index in y.index]
sns.lineplot(x=x, y=y, color='steelblue', ax=axes)
# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
axes.plot(x, y_poly_pred, 'r--')
axes.set_title('Orders per Day (Without UK)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)
axes.tick_params(axis="x", labelsize=12)
axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)
plt.ylim(ymin=0)
plt.tight_layout()
Outside the UK, we can see that the number of orders remains stable until the 20th and then starts to fall.
Although a third-degree polynomial cannot fully express the sales forecast, it did a good job helping us understand the chaotic behavior of orders per day.
plt.figure(figsize=(10,6))
ax = df.groupby('invoice_num')['week_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax.index, y=ax, color='steelblue')
plt.title('Orders per Day of Week', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)
# WeekDay goes from 0-Monday to 6-Sunday
label = ['Mon','Tue','Wed','Thur','Fri','Sun']
tick = range(len(label))
plt.xticks(ticks=tick, labels=label, fontsize=12)
plt.tight_layout()
# Comparing UK orders to orders outside UK per day of week
fig, axes = plt.subplots(figsize=(10, 6))
# Order per day of week (UK only)
ax = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)
# Order per day of week (without UK)
ax = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)
axes.set_title('Orders per Day of Week', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)
label = ['','Mon','Tue','Wed','Thur','Fri', 'Sat', 'Sun']
axes.set_xticklabels(label, fontsize=12)
axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
plt.tight_layout()
The peak in the number of orders is on Thursday. Although Saturday appears on the chart, there are no recorded sales, it is just the trend line between Friday and Sunday.
plt.figure(figsize=(10,6))
ax = df.groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.barplot(x=ax.index, y=ax, color='steelblue')
plt.title('Orders per Hour', fontsize=15)
plt.xlabel('Hour', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
Sales are concetrated around noon. Indicates that there is a preference to buy at lunch.
# Comparing UK orders to orders outside UK per hour
fig, axes = plt.subplots(figsize=(10, 6))
# Order per hour (UK only)
ax = df[
df['lowercase_country'] == 'united kingdom'
].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)
# Order per hour (without UK)
ax = df[
df['lowercase_country'] != 'united kingdom'
].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()
# Removing the index brackets
ax.index = [index[0] for index in ax.index]
sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)
axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)
axes.tick_params(axis="x", labelsize=12)
axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)
plt.tight_layout()
Sales in the UK take place between 6 am and 8 pm, while external sales take place between 7 am and 7 pm.
These are the top 20 stock codes and how much do they represent in revenue (percentage).
common_codes = []
for code in df['lowercase_stockcode'].value_counts()[:20].index:
common_codes.append((code,
products[
(products[~products.isna()]['lowercase_stockcode'] == code)
]['lowercase_description'].iloc[0],
df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])
fig, axes = plt.subplots(2, 1, figsize=(12, 12))
# The 20 most common stock codes
ax = df['lowercase_stockcode'].value_counts()[:20]
sns.barplot(x=ax.index, y=ax,
palette=sns.cubehelix_palette(20, start=.5, rot=-.75),
ax=axes[0])
axes[0].set_title('Most Common Stock Codes', fontsize=15)
axes[0].set_xlabel('Stock Code', fontsize=13)
axes[0].set_ylabel('Quantity', fontsize=13)
axes[0].tick_params(axis="x", labelsize=12)
# Percentage of revenue (20 most common stock codes)
common_codes = []
for code in df['lowercase_stockcode'].value_counts()[:20].index:
common_codes.append((code,
df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
ax = pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'percertage'])
sns.barplot(x='lowercase_stockcode', y='percertage', data=ax,
palette=sns.cubehelix_palette(20),
ax=axes[1])
axes[1].set_title('Percentage of Revenue', fontsize=15)
axes[1].set_xlabel('Stock Code', fontsize=13)
axes[1].set_ylabel('Percentage', fontsize=13)
axes[1].tick_params(axis="x", labelsize=12)
plt.tight_layout()
Being the most frequent does not mean necessarily represent a large part of the revenue. The product 85123A - "white hanging heart t-light holder" is the best seller and representes practically 1.0% of the revenue while the product 22423 - "regency cakestand 3 tier" is the second place and represents more than 1.6% of the revenue.
The top 20 stock codes that contribute the most to revenue:
top_revenue = []
for code in df.groupby(
'lowercase_stockcode'
)[
'amount_spent'
].sum().sort_values(ascending=False)[:20].index:
top_revenue.append((code,
products[
(products[~products.isna()]['lowercase_stockcode'] == code)
]['lowercase_description'].iloc[0],
df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
pd.DataFrame(top_revenue, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])
plt.figure(figsize=(12,6))
# top 20 stock codes in revenue
ax = df.groupby(
'lowercase_stockcode'
)[
'amount_spent'
].sum().sort_values(ascending=False)[:20] / df['amount_spent'].sum() * 100
sns.barplot(x=ax.index, y=ax, palette='GnBu')
plt.title('Percentage of Revenue', fontsize=15)
plt.xlabel('Stock Code', fontsize=13)
plt.ylabel('Percentage', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
DOT means "dotcom postage", that is, it is not a product but a service. The real top product is the stock code 22423 - "regency cakestand 3 tier" and stands out compared to others.
We previously performed a superficial analysis dealing with stock codes. It's time to clean up this mess, separating what is a product from other operations. In this way, it is possible to group products into categories.
products
Remember that we interpreted stock codes with lengths of 5 and 6 as products.
# stock_code lengths different from 5 and 6
desc_stockcode(products).dropna()
Of all the stock codes above, only these represent products:
I will drop the remaining stock codes.
products_clean = products.dropna()
products_clean
# List of stock codes to drop (not a product)
codesToDrop = ['d', 'm', 's', 'b', 'c2', 'dot', 'post', 'pads', 'cruk',
'79323gr', '79323lp', 'dcgs0073', 'dcgs0069', 'dcgs0068',
'dcgs0067', 'dcgs0003', 'amazonfee', 'bank charges',
'gift_0001_40', 'gift_0001_50', 'gift_0001_30', 'gift_0001_20',
'gift_0001_10', 'gift_0001_20']
products_clean = products_clean[~products_clean['lowercase_stockcode'].isin(codesToDrop)]
products_clean
We need to remove unnecessary things from product description before our analysis.
Punctuation, color names and numbers will not be very effective when we try to correlate products by their description. Suppose a product called "Action Figure: Red Swamp Monster" and another called "15cm Action Figure". Between these two descriptions what matters is the fact that they are Action Figures, not the colon, color or their size.
Removing stopwords will eliminate generic words that can be commom in many descriptions. Stemming the words will allow we to group variations of words with the same origin and meaning.
# Stemming
ps = PorterStemmer()
def cleanText(text):
# Stripping punctuation from the string
# This performs raw string operations in C
text = text.translate(str.maketrans('', '', string.punctuation))
# Stripping numbers
text = ''.join([i for i in text if not i.isdigit()])
# Removing color names
text = ' '.join([i for i in text.split() if(i not in CSS3_NAMES_TO_HEX)])
text = text.strip().split()
# Removing stopwords and stemming
text = [ps.stem(word) for word in text if(word not in stopwords.words('english'))]
return ' '.join(text)
products_clean['cleanned_description'] = products_clean['lowercase_description'].apply(cleanText)
Tokenize allows to count how many words are in each description and search for products with a specific word.
products_clean['tokenized_description'] = products_clean['cleanned_description'].apply(lambda text: text.split())
products_clean
The frequency of the number of words in each description can help define what is a product and what is dirty data.
plt.figure(figsize=(10,6))
products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')
plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
The description pattern with 0, 1 and 2 words is a hint that there are "defective descriptions" between them.
# Wordless descriptions
products_clean[
products_clean['tokenized_description'].map(len) == 0
].head()
Wordless descriptions are just numbers and punctuation. Taking stock code 21275 at random, we can see there is a real description associated with this "noise".
products_clean[products_clean['lowercase_stockcode'] == '21275']
# Removing wordless descriptions
products_clean = products_clean[products_clean['tokenized_description'].map(len) != 0]
products_clean[products_clean['tokenized_description'].map(len) == 0]
# Description with 1 word
products_clean[
products_clean['tokenized_description'].map(len) == 1
].head()
# Removing duplicate descriptions
products_clean[
products_clean['tokenized_description'].map(len) == 1
].drop_duplicates(subset=['cleanned_description'])
The stock code 62018 - sombrero is the only product from the list above.
# List of descriptions to drop (not a product)
descToDrop = ['amazon', 'check', 'damag', 'faulti', 'found', 'count',
'dotcom', 'samplesdamag', 'showroom', 'mia', 'adjust',
'damagesdisplay', 'broken', 'lost', 'crack', 'sold',
'display', 'miss', 'return', 'crush', 'sampl', 'mailout',
'wetrusti', 'damagesdotcom', 'smash', 'fba', 'damagessampl',
'test', 'dagam', 'wetmouldi', 'mouldi', 'readjust', 'breakag',
'mark', 'dotcomstock', 'wet', 'mix']
products_clean = products_clean[~products_clean['cleanned_description'].isin(descToDrop)]
products_clean[
products_clean['tokenized_description'].map(len) == 1
].drop_duplicates(subset=['cleanned_description'])
# Description with 2 words
print(
products_clean[
products_clean['tokenized_description'].map(len) == 2
]['cleanned_description'].value_counts().index.to_list()
)
Based on the words that indicates the stock code is not a product, we can make a list and drop all records with those words.
# If there is at least one word from the list, it will return "True"
def isInDesc(desc, wordsList):
count = 0
for word in desc:
count += (word in wordsList)
if count > 0:
return True
else:
return False
# Listing possible words that represent that the stock code is not a product
print(
products_clean[products_clean[
'tokenized_description'
].apply(lambda desc: isInDesc(desc, descToDrop))
]['cleanned_description'].to_list()
)
# Words chosen from the list above
wordsToDrop = ['dotcom', 'amazon', 'adjust', 'mouldi', 'label',
'wrongli', 'sold', 'damag', 'stock', 'barcod',
'wet', 'wrong', 'rusti', 'alloc', 'crush', 'code',
'crack', 'mark', 'unsal', 'c', 'sampl', 'found']
# Removing records
products_clean = products_clean[~products_clean[
'tokenized_description'
].apply(lambda desc: isInDesc(desc, wordsToDrop))]
plt.figure(figsize=(10,6))
products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')
plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
After cleaning, we only have products in the dataframe. Is there more than one description per product?
# Products with more than one description
products_clean[
'lowercase_stockcode'
].value_counts()[products_clean['lowercase_stockcode'].value_counts() > 1]
Looking at stock code 23196, we can see that the descriptions are similar, but there are grammatical errors and/or some adjectives.
products_clean[products_clean['lowercase_stockcode'] == '23196']
If there were different stock codes for the same product with diferent colors or some other characteristic, it would be great. We can't change that, but we can reduce it to just one description per product. We do not want to lose information, so we will keep the description with greater number of words and discard the rest.
# Size of description
pd.DataFrame(zip(products_clean['lowercase_stockcode'],
products_clean['tokenized_description'].map(len)),
columns=['stock_code', 'desc_length'],
index=products_clean.index)
# Sorting by stock code
descSize = pd.DataFrame(zip(products_clean['lowercase_stockcode'],
products_clean['tokenized_description'].map(len)),
columns=['stock_code', 'desc_length'],
index=products_clean.index).sort_values('stock_code')
# Keeps the longer description
index = 0
indexToKeep = []
while(index != descSize.shape[0]):
# Calculate the range with the same stock code
count = 0
count += descSize[descSize['stock_code'] == descSize.iloc[index]['stock_code']]['stock_code'].count()
# Finds the description with more words in the range
indexToKeep.append(descSize['desc_length'].iloc[index:(index+count)].idxmax())
index += count
products_clean = products_clean.loc[indexToKeep]
plt.figure(figsize=(10,6))
products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')
plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()
We now have clean descriptions with no duplicates.
It's time to group products by description and identify which category each cluster represents. For this task, I will choose a simple path using the K-Means approach.
Feature extraction is not an easy thing when it comes to product clustering. I will use 2 methods: CountVectorizer and TF-IDF.
CountVectorizer will only convert our text into a vector of token counts. There is no type of weight based on the "relevance" of a word in the vector, being one of the most simple ways to approach this problem.
vectorizer = CountVectorizer()
# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])
#vectorizer.vocabulary_
# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])
# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
There is no clue as to how many groups we are looking for, how many product categories there are. Let's use the Elbow Method to help us find a good K value.
# Determine k using elbow method
Sum_of_squared_distances = []
# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(vector)
Sum_of_squared_distances.append(km.inertia_)
plt.figure(figsize=(12,6))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)
plt.xticks(fontsize=12)
plt.tight_layout()
It is not easy to say exactly where the "elbow" is, but choosing the number of clusters equal to 35 (K = 35) is quite fair.
K = 35
km = KMeans(n_clusters=K)
km.fit(vector)
km.labels_
plt.figure(figsize=(12,6))
sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()
prod_clusters = pd.DataFrame({
'lowercase_description': products_clean['lowercase_description'],
'cleanned_description': products_clean['cleanned_description'],
'category': km.labels_
})
prod_clusters
It is difficult to measure whether the clusterings are correct or not. I will create a dataframe where each row is a group/category and the columns are samples of product descriptions from the same cluster.
This way, our metrics will be common sense. The more it make sense, the better.
# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)
top_desc_cluster = pd.DataFrame({
'category': 0,
'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
}, index=[0])
for cluster in range(1, K):
top_desc_cluster = pd.concat([top_desc_cluster,
pd.DataFrame({
'category': cluster,
'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
}, index=[0])],
ignore_index=True)
top_desc_cluster
There is some correlation between the products of the same cluster, but it is far from what we want. I will move on to the next approach.
vectorizer = TfidfVectorizer()
# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])
#vectorizer.vocabulary_
# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])
# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
# Determine k using elbow method
Sum_of_squared_distances = []
# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(vector)
Sum_of_squared_distances.append(km.inertia_)
plt.figure(figsize=(12,6))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)
plt.xticks(fontsize=12)
plt.tight_layout()
# This time, 36 seems to be reasonable number
K = 36
km = KMeans(n_clusters=K)
km.fit(vector)
km.labels_
plt.figure(figsize=(12,6))
sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()
prod_clusters = pd.DataFrame({
'lowercase_description': products_clean['lowercase_description'],
'cleanned_description': products_clean['cleanned_description'],
'category': km.labels_
})
prod_clusters
# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)
top_desc_cluster = pd.DataFrame({
'category': 0,
'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
}, index=[0])
for cluster in range(1, K):
top_desc_cluster = pd.concat([top_desc_cluster,
pd.DataFrame({
'category': cluster,
'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
}, index=[0])],
ignore_index=True)
top_desc_cluster
products_clean[products_clean['lowercase_description'] == 'colouring pencils brown tube']
The TF-IDF approach seems to be better than CountVectorizer. Clusters appear to be more concise. There are more than one product with the same description, but I will assume they are different.
In both approachs, there are clusters that stand out for the discrepant number of products. Next time, I will apply the same method to these clusters and create subcategories.