E-Commerce Data Analysis¶

Table of Contents¶

1) Data Preparation

2) Exploratory Data Analysis (EDA)

2.1) How many orders for each country?

2.2) How much money did each country spent?

2.3) How were sales over the months?

Average Ticket

2.4) How orders were distributed by day (of month)

2.5) How orders were distributed by day (of week):

2.6) How orders were distributed by hour?

2.7) What are the most common stock codes?

2.8) Which stock codes contribute the most to revenue?

3) Categorizing Products

3.1) Clearing Product Data

3.2) Product Clustering

Count Vectorizer
TF-IDF

1) Data Preparation ¶

Loading all the modules that will be used:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

from webcolors import CSS3_NAMES_TO_HEX

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

from pandas_profiling import ProfileReport

Then reading the data set and taking the first look:

df = pd.read_csv('../1985_3404_bundle_archive/data.csv', encoding = 'ISO-8859-1')

df.head()

Let's standardize the dataframe's columns and text, facilitating the handling.

# New column names

new_cols = ['invoice_num', 'stock_code', 'description', 'quantity',
            'invoice_date', 'unit_price', 'customer_id', 'country']
df.columns = new_cols

df.head()

# New columns to represent stock code, description and country in lower case

# Lowercase stock code
df.loc[~df['stock_code'].isna(), 'lowercase_stockcode'] = df.loc[
                                                                 ~df['stock_code'].isna(), 'stock_code'
                                                                ].apply(lambda text: text.lower())

# Lowercase description
df.loc[~df['description'].isna(), 'lowercase_description'] = df.loc[
                                                                    ~df['description'].isna(), 'description'
                                                                   ].apply(lambda text: text.lower())

# Lowercase country
df.loc[~df['country'].isna(), 'lowercase_country'] = df.loc[
                                                            ~df['country'].isna(), 'country'
                                                           ].apply(lambda text: text.lower())

df.head()

Creating a report to get an overview of the data:

# This doesn't appear on github, but the design is great
# It is used only as an overview
# Everything we need from this will be explored later
profile = ProfileReport(df, title='Profiling E-Commerce',html={'style':{'full_width':True}})
profile.to_notebook_iframe()

Summarize dataset: 100%|████████████████████████████████████████████████████| 25/25 [00:54<00:00,  2.18s/it, Completed]
Generate report structure: 100%|█████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.87s/it]
Render HTML: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.10s/it]

Let's focus on data types.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   invoice_num            541909 non-null  object 
 1   stock_code             541909 non-null  object 
 2   description            540455 non-null  object 
 3   quantity               541909 non-null  int64  
 4   invoice_date           541909 non-null  object 
 5   unit_price             541909 non-null  float64
 6   customer_id            406829 non-null  float64
 7   country                541909 non-null  object 
 8   lowercase_stockcode    541909 non-null  object 
 9   lowercase_description  540455 non-null  object 
 10  lowercase_country      541909 non-null  object 
dtypes: float64(2), int64(1), object(8)
memory usage: 45.5+ MB

# Distribuition of missing values
plt.figure(figsize=(13,9))

sns.heatmap(df.isna(), cmap='magma')

plt.xticks(rotation=45, fontsize=12)

plt.tight_layout()

There is no problem in keeping fields that we won't do calculations as "object", such as: invoice_num and stock_code. There are missing values in customer_id, but I will keep it this way for now. I will convert customer_id to string ("object").

# We don't need ".0" in the end of the customer_id values
df.loc[~df['customer_id'].isna(), 'customer_id'] = df.loc[
                                                          ~df['customer_id'].isna(), 'customer_id'
                                                         ].apply(lambda ID: str(ID)[:-2])

df['customer_id'].value_counts()

17841    7983
14911    5903
14096    5128
12748    4642
14606    2782
         ... 
17443       1
16428       1
14119       1
14424       1
13185       1
Name: customer_id, Length: 4372, dtype: int64

df.head()

The next step is to transform invoice_date to "datetime".

df['invoice_date'] = pd.to_datetime(df['invoice_date'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   invoice_num            541909 non-null  object        
 1   stock_code             541909 non-null  object        
 2   description            540455 non-null  object        
 3   quantity               541909 non-null  int64         
 4   invoice_date           541909 non-null  datetime64[ns]
 5   unit_price             541909 non-null  float64       
 6   customer_id            406829 non-null  object        
 7   country                541909 non-null  object        
 8   lowercase_stockcode    541909 non-null  object        
 9   lowercase_description  540455 non-null  object        
 10  lowercase_country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 45.5+ MB

I will split the InvoiceDate column into: Year, Month, WeekDay, Hour and YearMonth. This way, we can group our data however we want based on any time analysis.

df['year'] = df['invoice_date'].apply(lambda time: time.year)

df['month'] = df['invoice_date'].apply(lambda time: time.month)

df['month_day'] = df['invoice_date'].apply(lambda time: time.day)

df['week_day'] = df['invoice_date'].apply(lambda time: time.weekday())

df['hour'] = df['invoice_date'].apply(lambda date: date.hour)

# zfill(2) is used to fill the string with zeros until it is 2 characters long.
df['year_month'] = df['invoice_date'].apply(lambda time: str(time.year) + str(time.month).zfill(2))

df.head()

We have quantity and unit_price, but we don't have the amount spent on each line. Creating the amount_spent column will make it easier to analyse revenue.

df['amount_spent'] = df['quantity'] * df['unit_price']

Taking a look at missing values...

df.isna().sum().sort_values(ascending=False)

customer_id              135080
description                1454
lowercase_description      1454
amount_spent                  0
country                       0
stock_code                    0
quantity                      0
invoice_date                  0
unit_price                    0
lowercase_stockcode           0
year_month                    0
lowercase_country             0
year                          0
month                         0
month_day                     0
week_day                      0
hour                          0
invoice_num                   0
dtype: int64

Perhaps, there are non-numeric fields with missing values and it did not appear in our previous analysis. Missing values can appear in strange ways, such as "", "na", "nan" and "null" strings.

nullsdict = {}
nulls_vec = ['', 'na', 'nan', 'null']

for col in df.columns:    
    str_nan = df.loc[~df[col].isna(), col].apply(lambda row: row in nulls_vec)
    real_nan = df[col].isna()
    
    nullsdict.update({col:{'str_nan':sum(str_nan),
                           'real_nan':sum(real_nan)}})

pd.DataFrame(nullsdict).T

We could erase the NaN values, but it will damage the sum of the revenue.

There is nothing we can do about the customer_id's missing values for now. I will analyse description column in the Exploratory Data Analysis section.

2) Exploratory Data Analysis (EDA)¶

In Data Preparation section it was possible to observe missing values in lowercase_description column. Let's take an overview of the products.

products = df[['lowercase_stockcode', 'lowercase_description']].drop_duplicates()
products.head()

products['lowercase_stockcode'].value_counts()

20713     8
23084     8
21830     7
85175     7
23131     6
         ..
23368     1
84661b    1
21908     1
20763     1
90001d    1
Name: lowercase_stockcode, Length: 3958, dtype: int64

There is more than one description for the same product. Looking at the product with the stock code equal to 20713...

products[products['lowercase_stockcode'] == '20713']

The product with stock_code equal to 20713 actually is "jumbo bag owls". There are missing values, product incorrectly marked as 23343 and found(?). Regardless, we can assume the revenue amount is correct, so we will not change any description and unit price.

products['lowercase_stockcode'].map(len).value_counts()

5     4186
6     1405
8       17
12       9
1        4
4        4
7        3
9        3
3        2
2        2
Name: lowercase_stockcode, dtype: int64

According to the source, if the code starts with letter "C", it indicates a cancellation.

There are stock_code with different lengths.

# stock_code with 6 characters
products[products['lowercase_stockcode'].map(len) == 6].head()

# stock_code with 5 characters
products[products['lowercase_stockcode'].map(len) == 5].head()

It seems that stock_code values represented with 6 or 5 characters are standard sales.

products['lowercase_stockcode'].map(len).value_counts().index

Int64Index([5, 6, 8, 12, 1, 4, 7, 9, 3, 2], dtype='int64')

# Create a dataframe with stock_code descriptions
def desc_stockcode(prod):
    
    # stock_code lengths different from 5 and 6
    stockcode_len = [8, 12, 1, 7, 4, 9, 3, 2]
    
    stockcode_len = sorted(stockcode_len)
    
    df_stockcode = (prod[prod['lowercase_stockcode'].map(len) == stockcode_len[0]])
    
    for length in stockcode_len[1:]:
        df_stockcode = pd.concat([df_stockcode,
                                  prod[prod['lowercase_stockcode'].map(len) == length]],
                                  ignore_index=True)
    
    return df_stockcode
                             
desc_stockcode(products)

Analysing the previous dataframe, stock_code can represent special events as discounts, bad debt and gift vouchers.

df.describe()

There are negative values in quantity and unit_price.

len(df[(df['quantity'] < 0)])

10624

len(df[(df['unit_price'] < 0)])

2

There are 10624 negative values in quantity and only 2 in unit_price. I will explore and find the meaning of these values.

# Starting with unit_price, because there are only 2 negative values.
df[(df['unit_price'] < 0)]

Negative unit_price values are bad debt.

# Checking negative quantity
df[(df['quantity'] < 0)]

# Checking invoice_num starting character.
df[(df['quantity'] < 0)]['invoice_num'].apply(lambda invoice: invoice[0]).value_counts()

C    9288
5    1336
Name: invoice_num, dtype: int64

df[(df['quantity'] < 0)].sort_values('invoice_num')

# How many sales had unit_price equal to 0?
sum(df['unit_price'] == 0)

2515

# How many invoice_num codes start with "5"?
sum(df[df['unit_price'] == 0]['invoice_num'].apply(lambda invoice: invoice[0] == '5'))

2515

All invoices with free itens have invoice_num starting with "5"

# Analysing unit_price equal 0
plt.figure(figsize=(12,8))
df[df['unit_price'] == 0]['quantity'].hist(bins=30, color='steelblue')

plt.xticks(fontsize=12)
plt.tight_layout()

# Descriptions of "free items" with positive quantity
df[(df['quantity'] > 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()

check                                  39
found                                  34
amazon                                 18
adjustment                             16
french blue metal door sign 1           9
                                       ..
set of 2 trays home sweet home          1
christmas pudding trinket pot           1
fba                                     1
fridge magnets les enfants assorted     1
childs garden fork pink                 1
Name: lowercase_description, Length: 241, dtype: int64

# Descriptions of "free items" with negative quantity
df[(df['quantity'] < 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()

check                                  123
damaged                                 57
damages                                 46
?                                       41
sold as set on dotcom                   20
                                      ... 
oops ! adjustment                        1
sold in set?                             1
incorrectly credited c550456 see 47      1
lost??                                   1
damaged stock                            1
Name: lowercase_description, Length: 127, dtype: int64

There are positive and negative quantity values for invoices with unit_price equal to 0. Some of them are damaged or from display as we can see. Perhaps some are free items for promotions or events.

Now that we know our data set better, let's do some visualization and answer some questions.

2.1) How many orders for each country?¶

plt.figure(figsize=(12,8))

ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Orders per Country', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()

plt.figure(figsize=(12,8))

# UK stands out too much on the chart
ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts().iloc[1:]

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Orders per Country (without UK)', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()

2.2) How much money did each country spent?¶

plt.figure(figsize=(12,8))

ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False)

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Revenue per Country', fontsize=15)
plt.xlabel('Amount (millions)', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()

plt.figure(figsize=(12,8))

# UK stands out too much on the chart
ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False).iloc[1:]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Revenue per Country (without UK)', fontsize=15)
plt.xlabel('Amount', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()

This simple analysis shows that we have two important points of view:

Look at the big picture
Explore UK sales and separate outside

2.3) How were sales over the months?¶

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Revenue per Month
ax = df.groupby('year_month').sum()['amount_spent']

sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])

axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

axes[0].set_xticklabels(label, rotation=45)

# Orders x Month
ax = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])

axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)

axes[1].set_xticklabels(label, rotation=45)

plt.tight_layout()

Remember that sales occurred between 01/12/2010 and 09/12/2011. December 2011 sales go through the 9th.

Although the behavior of revenue and orders per month tends to be similar, we can see that the average amount spent on each order per month is different. This will be discussed in the Average Ticket section.

January, February and April sales were the worst, except for December 2011. On the other hand, September, October and November were the best sales, in these months there is a growing demand for Christmas gifts and wholesalers are preparing for Black Friday (last Friday in November).

# Comparing UK sales to sales outside UK
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Revenue per Month (Only UK)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('year_month').sum()['amount_spent']

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[0])

# Revenue per Month (Without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('year_month').sum()['amount_spent']

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])

axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

axes[0].set_xticklabels(label, rotation=45)

axes[0].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

# Orders x Month (Only UK)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[1])

# Orders x Month (Without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])

axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)

axes[1].set_xticklabels(label, rotation=45)

axes[1].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()

Sales outside UK appear to be more stable. Although the "Without UK" curve is much lower than the "UK Only", it looks smoothier over the months. Christmas demand is more pronounced for sales in the UK than for sales outside the UK.

Average Ticket ¶

plt.figure(figsize=(10,6))

ax1 = df.groupby('year_month')['amount_spent'].sum()
ax2 = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Average Ticket', fontsize=15)
plt.xlabel('Month', fontsize=13)
plt.ylabel('Amount', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

plt.xticks(ticks=range(len(label)), labels=label, rotation=45, fontsize=12)
plt.tight_layout()

# Comparing UK Average Ticket to Average Ticket outside UK
fig, axes = plt.subplots(figsize=(12, 6))

# Average Ticker (Only UK)
ax1 = df[
         df['lowercase_country'] == 'united kingdom'
        ].groupby('year_month')['amount_spent'].sum()

ax2 = df[
         df['lowercase_country'] == 'united kingdom'
        ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

# Average Ticker (Without UK)
ax1 = df[
         df['lowercase_country'] != 'united kingdom'
        ].groupby('year_month')['amount_spent'].sum()

ax2 = df[
         df['lowercase_country'] != 'united kingdom'
        ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Average Ticket', fontsize=15)
axes.set_xlabel('Month', fontsize=13)
axes.set_ylabel('Amount', fontsize=13)

axes.set_xticklabels(label, rotation=45)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()

Althongh revenue and number of orders in the UK are higher than abroad, the opposite is true for the average ticket. This indicates that the amount spent on orders outside the UK are higher.

Looking at all the sales over the months, we can see a significant drop in April.

There are some key dates that move retail and April is just between them:

Frebruary: Valentine's Day
May: Mother's Day
June: Father's Day
August/September: Back-to-school (most countries are in the northern hemisphere)
October: Halloween
November: Black Friday
December: Christmas

2.4) How orders were distributed by day (of month)¶

plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Day', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()

# Comparing UK orders to orders outside UK per day
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per day (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()

There is a substancial demand between the 5th and 10th, but it is not enough to say the trend with certainty.

For this analysis, I will use the Polynomial Regression approach, taking the number of orders per day as a series. The goal is to plot just a trend line.

fig, axes = plt.subplots(2, 2, figsize=(10, 7))

for i in range(4):
    
    # Where each subplot will be
    if i < 2:
        plot = (0,i)
    else:
        plot = (1,(i-2))

    # Order per day (UK only)
    y = df[
           df['lowercase_country'] == 'united kingdom'
          ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
    
    # Removing the index brackets
    x = [index[0] for index in y.index]
    
    sns.lineplot(x=x, y=y, color='orange', ax=axes[plot])
    
    # transforming the data to include another axis
    x = np.array(x)
    x = x[:, np.newaxis]
    y = y[:, np.newaxis]
    
    polynomial_features= PolynomialFeatures(degree=(i+1))
    x_poly = polynomial_features.fit_transform(x)
    
    model = LinearRegression()
    model.fit(x_poly, y)
    y_poly_pred = model.predict(x_poly)
    
    axes[plot].plot(x, y_poly_pred, 'r--')
    
    axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
    axes[plot].set_ylabel('')
    
    axes[plot].set_ylim(ymin=0)

plt.tight_layout()

The dashed red line shows the trend. The second and third-degree are almost the same. I will use the third-degree. Higher degrees will just overfit it.

fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (UK only)
y = df[
       df['lowercase_country'] == 'united kingdom'
      ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
x = [index[0] for index in y.index]

sns.lineplot(x=x, y=y, color='orange', ax=axes)

# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]

polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)

model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)

axes.plot(x, y_poly_pred, 'r--')

axes.set_title('Orders per Day (UK Only)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)

plt.ylim(ymin=0)

plt.tight_layout()

In the UK, the number of orders starts high at the begining of the month and falls over the days.

fig, axes = plt.subplots(2, 2, figsize=(10, 7))

for i in range(4):
    
    # Where each subplot will be
    if i < 2:
        plot = (0,i)
    else:
        plot = (1,(i-2))

    # Order per day (UK only)
    y = df[
           df['lowercase_country'] != 'united kingdom'
          ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
    
    # Removing the index brackets
    x = [index[0] for index in y.index]
    
    sns.lineplot(x=x, y=y, color='steelblue', ax=axes[plot])
    
    # transforming the data to include another axis
    x = np.array(x)
    x = x[:, np.newaxis]
    y = y[:, np.newaxis]
    
    polynomial_features= PolynomialFeatures(degree=(i+1))
    x_poly = polynomial_features.fit_transform(x)
    
    model = LinearRegression()
    model.fit(x_poly, y)
    y_poly_pred = model.predict(x_poly)
    
    axes[plot].plot(x, y_poly_pred, 'r--')
    
    axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
    axes[plot].set_ylabel('')
    
    axes[plot].set_ylim(ymin=0)

plt.tight_layout()

Third-degree fits better the number of orders per day.

fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (without UK)
y = df[
       df['lowercase_country'] != 'united kingdom'
      ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
x = [index[0] for index in y.index]

sns.lineplot(x=x, y=y, color='steelblue', ax=axes)

# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]

polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)

model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)

axes.plot(x, y_poly_pred, 'r--')

axes.set_title('Orders per Day (Without UK)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)

plt.ylim(ymin=0)

plt.tight_layout()

Outside the UK, we can see that the number of orders remains stable until the 20th and then starts to fall.

Although a third-degree polynomial cannot fully express the sales forecast, it did a good job helping us understand the chaotic behavior of orders per day.

2.5) How orders were distributed by day (of week):¶

plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Day of Week', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

# WeekDay goes from 0-Monday to 6-Sunday
label = ['Mon','Tue','Wed','Thur','Fri','Sun']
tick = range(len(label))

plt.xticks(ticks=tick, labels=label, fontsize=12)
plt.tight_layout()

# Comparing UK orders to orders outside UK per day of week
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day of week (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per day of week (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day of Week', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

label = ['','Mon','Tue','Wed','Thur','Fri', 'Sat', 'Sun']

axes.set_xticklabels(label, fontsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()

The peak in the number of orders is on Thursday. Although Saturday appears on the chart, there are no recorded sales, it is just the trend line between Friday and Sunday.

2.6) How orders were distributed by hour?¶

plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Hour', fontsize=15)
plt.xlabel('Hour', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()

Sales are concetrated around noon. Indicates that there is a preference to buy at lunch.

# Comparing UK orders to orders outside UK per hour
fig, axes = plt.subplots(figsize=(10, 6))

# Order per hour (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per hour (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()

Sales in the UK take place between 6 am and 8 pm, while external sales take place between 7 am and 7 pm.

2.7) What are the most common stock codes?¶

These are the top 20 stock codes and how much do they represent in revenue (percentage).

common_codes = []

for code in df['lowercase_stockcode'].value_counts()[:20].index:
    
    common_codes.append((code,
                         products[
                                  (products[~products.isna()]['lowercase_stockcode'] == code)
                                 ]['lowercase_description'].iloc[0],
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])

fig, axes = plt.subplots(2, 1, figsize=(12, 12))

# The 20 most common stock codes
ax = df['lowercase_stockcode'].value_counts()[:20]

sns.barplot(x=ax.index, y=ax, 
            palette=sns.cubehelix_palette(20, start=.5, rot=-.75),
            ax=axes[0])

axes[0].set_title('Most Common Stock Codes', fontsize=15)
axes[0].set_xlabel('Stock Code', fontsize=13)
axes[0].set_ylabel('Quantity', fontsize=13)

axes[0].tick_params(axis="x", labelsize=12)

# Percentage of revenue (20 most common stock codes)
common_codes = []

for code in df['lowercase_stockcode'].value_counts()[:20].index:
    
    common_codes.append((code,
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

ax = pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'percertage'])

sns.barplot(x='lowercase_stockcode', y='percertage', data=ax,
            palette=sns.cubehelix_palette(20),
            ax=axes[1])

axes[1].set_title('Percentage of Revenue', fontsize=15)
axes[1].set_xlabel('Stock Code', fontsize=13)
axes[1].set_ylabel('Percentage', fontsize=13)

axes[1].tick_params(axis="x", labelsize=12)

plt.tight_layout()

Being the most frequent does not mean necessarily represent a large part of the revenue. The product 85123A - "white hanging heart t-light holder" is the best seller and representes practically 1.0% of the revenue while the product 22423 - "regency cakestand 3 tier" is the second place and represents more than 1.6% of the revenue.

2.8) Which stock codes contribute the most to revenue?¶

The top 20 stock codes that contribute the most to revenue:

top_revenue = []

for code in df.groupby(
                       'lowercase_stockcode'
                      )[
                        'amount_spent'
                       ].sum().sort_values(ascending=False)[:20].index:
    
    top_revenue.append((code,
                        products[
                                 (products[~products.isna()]['lowercase_stockcode'] == code)
                                ]['lowercase_description'].iloc[0],
                        df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
                        df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

pd.DataFrame(top_revenue, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])

plt.figure(figsize=(12,6))

# top 20 stock codes in revenue
ax = df.groupby(
                'lowercase_stockcode'
               )[
                 'amount_spent'
                ].sum().sort_values(ascending=False)[:20] / df['amount_spent'].sum() * 100

sns.barplot(x=ax.index, y=ax, palette='GnBu')

plt.title('Percentage of Revenue', fontsize=15)
plt.xlabel('Stock Code', fontsize=13)
plt.ylabel('Percentage', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()

DOT means "dotcom postage", that is, it is not a product but a service. The real top product is the stock code 22423 - "regency cakestand 3 tier" and stands out compared to others.

3) Categorizing Products ¶

We previously performed a superficial analysis dealing with stock codes. It's time to clean up this mess, separating what is a product from other operations. In this way, it is possible to group products into categories.

3.1) Clearing Product Data ¶

products

Remember that we interpreted stock codes with lengths of 5 and 6 as products.

# stock_code lengths different from 5 and 6
desc_stockcode(products).dropna()

Of all the stock codes above, only these represent products:

15056bl - edwardian parasol black
dcgs0076 - sunjar led night night light
dcgs0003 - boxed glass ashtray
dcgs0070 - camouflage dog collar
dcgs0069 - ooh la la dogs collar
dcgssboy - boys party bag
dcgs0004 - haynes camper shoulder bag
dcgssgirl - girls party bag

I will drop the remaining stock codes.

products_clean = products.dropna()
products_clean

# List of stock codes to drop (not a product)
codesToDrop = ['d', 'm', 's', 'b', 'c2', 'dot', 'post', 'pads', 'cruk',
               '79323gr', '79323lp', 'dcgs0073', 'dcgs0069', 'dcgs0068',
               'dcgs0067', 'dcgs0003', 'amazonfee', 'bank charges',
               'gift_0001_40', 'gift_0001_50', 'gift_0001_30', 'gift_0001_20',
               'gift_0001_10', 'gift_0001_20']

products_clean = products_clean[~products_clean['lowercase_stockcode'].isin(codesToDrop)]

products_clean

We need to remove unnecessary things from product description before our analysis.

Punctuation, color names and numbers will not be very effective when we try to correlate products by their description. Suppose a product called "Action Figure: Red Swamp Monster" and another called "15cm Action Figure". Between these two descriptions what matters is the fact that they are Action Figures, not the colon, color or their size.

Removing stopwords will eliminate generic words that can be commom in many descriptions. Stemming the words will allow we to group variations of words with the same origin and meaning.

# Stemming
ps = PorterStemmer()

def cleanText(text):
    # Stripping punctuation from the string
    # This performs raw string operations in C
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Stripping numbers
    text = ''.join([i for i in text if not i.isdigit()])
    # Removing color names
    text = ' '.join([i for i in text.split() if(i not in CSS3_NAMES_TO_HEX)])
    text = text.strip().split()
    # Removing stopwords and stemming
    text = [ps.stem(word) for word in text if(word not in stopwords.words('english'))]
    
    return ' '.join(text)

products_clean['cleanned_description'] = products_clean['lowercase_description'].apply(cleanText)

C:\Users\Joelma\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Tokenize allows to count how many words are in each description and search for products with a specific word.

products_clean['tokenized_description'] = products_clean['cleanned_description'].apply(lambda text: text.split())

C:\Users\Joelma\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

products_clean

The frequency of the number of words in each description can help define what is a product and what is dirty data.

plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()

The description pattern with 0, 1 and 2 words is a hint that there are "defective descriptions" between them.

# Wordless descriptions
products_clean[
               products_clean['tokenized_description'].map(len) == 0
              ].head()

Wordless descriptions are just numbers and punctuation. Taking stock code 21275 at random, we can see there is a real description associated with this "noise".

products_clean[products_clean['lowercase_stockcode'] == '21275']

# Removing wordless descriptions
products_clean = products_clean[products_clean['tokenized_description'].map(len) != 0]

products_clean[products_clean['tokenized_description'].map(len) == 0]

# Description with 1 word
products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].head()

# Removing duplicate descriptions
products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].drop_duplicates(subset=['cleanned_description'])

The stock code 62018 - sombrero is the only product from the list above.

# List of descriptions to drop (not a product)
descToDrop = ['amazon', 'check', 'damag', 'faulti', 'found', 'count',
              'dotcom', 'samplesdamag', 'showroom', 'mia', 'adjust',
              'damagesdisplay', 'broken', 'lost', 'crack', 'sold',
              'display', 'miss', 'return', 'crush', 'sampl', 'mailout',
              'wetrusti', 'damagesdotcom', 'smash', 'fba', 'damagessampl',
              'test', 'dagam', 'wetmouldi', 'mouldi', 'readjust', 'breakag',
              'mark', 'dotcomstock', 'wet', 'mix']

products_clean = products_clean[~products_clean['cleanned_description'].isin(descToDrop)]

products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].drop_duplicates(subset=['cleanned_description'])

# Description with 2 words
print(
products_clean[
               products_clean['tokenized_description'].map(len) == 2
              ]['cleanned_description'].value_counts().index.to_list()
      )

['sweetheart bracelet', 'feltcraft hairband', 'paper parasol', 'diamant necklac', 'vintag ear', 'edwardian parasol', 'kitchen scale', 'cherri light', 'refectori clock', 'shark helicopt', 'space cadet', 'thrown away', 'dragonfli helicopt', 'polkadot bowl', 'disco handbag', 'stonecryst ear', 'polkadot plate', 'fli swat', 'polkadot wrap', 'wicker star', 'frangipani necklac', 'fish gnome', 'teacoffeesugar jar', 'ubolight triobas', 'polkadot cup', 'polkadot beaker', 'floral monster', 'photo frame', 'ninja rabbit', 'cat bowl', 'geisha girl', 'henchick basket', 'babi bunt', 'teddi bear', 'birdhous decor', 'dog bowl', 'crystalglass bracelet', 'breakfast tray', 'gemston bracelet', 'fabric mirror', 'pud spoon', 'stand gnome', 'egg spoon', 'aperitif glass', 'rose washbag', 'frost base', 'parti bag', 'tumbler baroqu', 'tlight set', 'wrap carousel', 'champagn glass', 'rosebud ring', 'mug dotcomgiftshopcom', 'bendi straw', 'retrospot bowl', 'doormat flock', 'classic frame', 'string curtain', 'bag stone', 'basket toadstool', 'chrysanthemum journal', 'feather penlight', 'riviera necklac', 'flower handbag', 'chrysanthemum sketchbook', 'feather curtain', 'lipstick pen', 'hole punch', 'doormat heart', 'paisley journal', 'water bunni', 'frangipani hairclip', 'wrap pear', 'hand sign', 'box ribbon', 'newspap stand', 'carniv bracelet', 'lovebird lantern', 'flower sticker', 'sweeti sticker', 'retrospot purs', 'toast dinosaur', 'riviera handbag', 'frog candl', 'knit hen', 'daisi sketchbook', 'calcul ruler', 'heart calcul', 'parti sunglass', 'orbit bracelet', 'purs heart', 'mirror cornic', 'chambr hook', 'paisley notebook', 'retrospot lamp', 'necklacebracelet butterfli', 'capiz chandeli', 'skull sticker', 'lariat cm', 'woodland sticker', 'vintag mug', 'butterfli sticker', 'glass candlehold', 'cinderella chandeli', 'plasmatron lamp', 'water dinosaur', 'daisi notebook', 'doormat airmail', 'space owl', 'water eleph', 'harmonica box', 'grow plant', 'tile tray', 'fern journal', 'bathroom hook', 'button box', 'retrospot plate', 'fern notebook', 'candlepot jardin', 'photo cube', 'votiv candl', 'dolphin windmil', 'wrap flock', 'urban ribbon', 'retrospot tape', 'retrospot mug', 'flamingo light', 'roccoco chandeli', 'ribbon empir', 'cinammon wreath', 'bingo set', 'wrap cowboy', 'sandalwood fan', 'dog cannist', 'glassshellpearl necklac', 'cat plate', 'flower mug', 'parti bunt', 'throw away', 'toybox wrap', 'grand chocolatecandl', 'glamor mug', 'feather penhot', 'polkadot pen', 'doormat friendship', 'embroid quilt', 'congratul bunt', 'tile hook', 'doggi rubber', 'pictur domino', 'rocket balloon', 'rain poncho', 'wrap appl', 'retrospot wrap', 'spaceboy beaker', 'potter mug', 'feather pencoal', 'owl doorstop', 'space frog', 'spotti bunt', 'cacti candl', 'photo album', 'easter hensflow', 'polkadot mug', 'daisi journal', 'retrospot apron', 'sage incens', 'paisley sketchbook', 'doormat topiari', 'wine goblet', 'pinkamethystgold necklac', 'chrysanthemum notebook', 'doormat retrospot', 'tissu ream', 'retrospot umbrella', 'butterfli washbag', 'heart handbag', 'chilli light', 'strawberri handbag', 'thrown awayrusti', 'daisi mobil', 'pompom curtain', 'retrospot washbag', 'ribbon purs', 'fern sketchbook', 'look mirror', 'skull tape', 'toilett bottl', 'paint rabbit', 'john lewi', 'toast love', 'christma tree', 'voil lampshad', 'damagescredit aso', 'anim sticker', 'polkadot purs', 'pad mobil', 'retrospot cup', 'pack charg', 'fan frame', 'string butterfliespink', 'wrap christma', 'sweeti necklac', 'incens tin', 'wheelbarrow children', 'decoupagegreet card', 'heart sticker']

Based on the words that indicates the stock code is not a product, we can make a list and drop all records with those words.

# If there is at least one word from the list, it will return "True"
def isInDesc(desc, wordsList):
    count = 0
    for word in desc:
        count += (word in wordsList)
    if count > 0:
        return True
    else:
        return False

# Listing possible words that represent that the stock code is not a product
print(
products_clean[products_clean[
                              'tokenized_description'
                             ].apply(lambda desc: isInDesc(desc, descToDrop))
              ]['cleanned_description'].to_list()
      )

['check cat doorstop', 'robot mug display box', 'set tone egg shape mix bowl', 'set colour mix bowl', 'cruso check lampshad', 'pair pad hanger check', 'check bag w handl xcm', 'small dolli mix design bowl', 'mix nut light bowl', 'sunset check hammock', 'crystal stud ear clear display']

# Words chosen from the list above
wordsToDrop = ['dotcom', 'amazon', 'adjust', 'mouldi', 'label',
               'wrongli', 'sold', 'damag', 'stock', 'barcod',
               'wet', 'wrong', 'rusti', 'alloc', 'crush', 'code',
               'crack', 'mark', 'unsal', 'c', 'sampl', 'found']

# Removing records
products_clean = products_clean[~products_clean[
                                                'tokenized_description'
                                               ].apply(lambda desc: isInDesc(desc, wordsToDrop))]

plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()

After cleaning, we only have products in the dataframe. Is there more than one description per product?

# Products with more than one description
products_clean[
               'lowercase_stockcode'
              ].value_counts()[products_clean['lowercase_stockcode'].value_counts() > 1]

23131     4
23236     4
23196     4
23240     3
22937     3
         ..
23061     2
23041     2
22502     2
79030d    2
23497     2
Name: lowercase_stockcode, Length: 240, dtype: int64

Looking at stock code 23196, we can see that the descriptions are similar, but there are grammatical errors and/or some adjectives.

products_clean[products_clean['lowercase_stockcode'] == '23196']

If there were different stock codes for the same product with diferent colors or some other characteristic, it would be great. We can't change that, but we can reduce it to just one description per product. We do not want to lose information, so we will keep the description with greater number of words and discard the rest.

# Size of description
pd.DataFrame(zip(products_clean['lowercase_stockcode'],
                 products_clean['tokenized_description'].map(len)),
             columns=['stock_code', 'desc_length'],
             index=products_clean.index)

# Sorting by stock code
descSize = pd.DataFrame(zip(products_clean['lowercase_stockcode'],
                            products_clean['tokenized_description'].map(len)),
                        columns=['stock_code', 'desc_length'],
                        index=products_clean.index).sort_values('stock_code')

# Keeps the longer description

index = 0
indexToKeep = []

while(index != descSize.shape[0]):
    
    # Calculate the range with the same stock code
    count = 0
    count += descSize[descSize['stock_code'] == descSize.iloc[index]['stock_code']]['stock_code'].count()
    
    # Finds the description with more words in the range
    indexToKeep.append(descSize['desc_length'].iloc[index:(index+count)].idxmax())
    
    index += count

products_clean = products_clean.loc[indexToKeep]

plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()

We now have clean descriptions with no duplicates.

3.2) Product Clustering ¶

It's time to group products by description and identify which category each cluster represents. For this task, I will choose a simple path using the K-Means approach.

Feature extraction is not an easy thing when it comes to product clustering. I will use 2 methods: CountVectorizer and TF-IDF.

Count Vectorizer ¶

CountVectorizer will only convert our text into a vector of token counts. There is no type of weight based on the "relevance" of a word in the vector, being one of the most simple ways to approach this problem.

vectorizer = CountVectorizer()

# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

#vectorizer.vocabulary_

# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])

# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())

There is no clue as to how many groups we are looking for, how many product categories there are. Let's use the Elbow Method to help us find a good K value.

# Determine k using elbow method

Sum_of_squared_distances = []

# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(vector)
    Sum_of_squared_distances.append(km.inertia_)

plt.figure(figsize=(12,6))    
    
plt.plot(K, Sum_of_squared_distances, 'bx-')

plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)

plt.xticks(fontsize=12)
plt.tight_layout()

It is not easy to say exactly where the "elbow" is, but choosing the number of clusters equal to 35 (K = 35) is quite fair.

K = 35

km = KMeans(n_clusters=K)
km.fit(vector)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=35, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

km.labels_

array([12, 12, 12, ..., 33, 10, 10])

plt.figure(figsize=(12,6))

sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()

prod_clusters = pd.DataFrame({
                              'lowercase_description': products_clean['lowercase_description'],
                              'cleanned_description': products_clean['cleanned_description'],
                              'category': km.labels_
                             })

prod_clusters

It is difficult to measure whether the clusterings are correct or not. I will create a dataframe where each row is a group/category and the columns are samples of product descriptions from the same cluster.

This way, our metrics will be common sense. The more it make sense, the better.

# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)

top_desc_cluster = pd.DataFrame({
                                 'category': 0,
                                 'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
                                }, index=[0])

for cluster in range(1, K):
    
    top_desc_cluster = pd.concat([top_desc_cluster,
                                  pd.DataFrame({
                                 'category': cluster,
                                 'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
                                }, index=[0])],
                                  ignore_index=True)

top_desc_cluster

There is some correlation between the products of the same cluster, but it is far from what we want. I will move on to the next approach.

TF-IDF ¶

vectorizer = TfidfVectorizer()

# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

#vectorizer.vocabulary_

# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])

# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())

# Determine k using elbow method

Sum_of_squared_distances = []

# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(vector)
    Sum_of_squared_distances.append(km.inertia_)

plt.figure(figsize=(12,6))    
    
plt.plot(K, Sum_of_squared_distances, 'bx-')

plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)

plt.xticks(fontsize=12)
plt.tight_layout()

# This time, 36 seems to be reasonable number
K = 36

km = KMeans(n_clusters=K)
km.fit(vector)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=36, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

km.labels_

array([ 5,  5,  5, ...,  5, 23, 23])

plt.figure(figsize=(12,6))

sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()

prod_clusters = pd.DataFrame({
                              'lowercase_description': products_clean['lowercase_description'],
                              'cleanned_description': products_clean['cleanned_description'],
                              'category': km.labels_
                             })

prod_clusters

# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)

top_desc_cluster = pd.DataFrame({
                                 'category': 0,
                                 'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
                                }, index=[0])

for cluster in range(1, K):
    
    top_desc_cluster = pd.concat([top_desc_cluster,
                                  pd.DataFrame({
                                 'category': cluster,
                                 'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
                                }, index=[0])],
                                  ignore_index=True)

top_desc_cluster

products_clean[products_clean['lowercase_description'] == 'colouring pencils brown tube']

The TF-IDF approach seems to be better than CountVectorizer. Clusters appear to be more concise. There are more than one product with the same description, but I will assume they are different.

In both approachs, there are clusters that stand out for the discrepant number of products. Next time, I will apply the same method to these clusters and create subcategories.

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom	85123a	white hanging heart t-light holder	united kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom	71053	white metal lantern	united kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom	84406b	cream cupid hearts coat hanger	united kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom	84029g	knitted union flag hot water bottle	united kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom	84029e	red woolly hottie white heart.	united kingdom

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom	85123a	white hanging heart t-light holder	united kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom	71053	white metal lantern	united kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom	84406b	cream cupid hearts coat hanger	united kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom	84029g	knitted union flag hot water bottle	united kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom	84029e	red woolly hottie white heart.	united kingdom

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country	year	month	month_day	week_day	hour	year_month
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850	United Kingdom	85123a	white hanging heart t-light holder	united kingdom	2010	12	1	2	8	201012
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	71053	white metal lantern	united kingdom	2010	12	1	2	8	201012
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850	United Kingdom	84406b	cream cupid hearts coat hanger	united kingdom	2010	12	1	2	8	201012
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	84029g	knitted union flag hot water bottle	united kingdom	2010	12	1	2	8	201012
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	84029e	red woolly hottie white heart.	united kingdom	2010	12	1	2	8	201012

	lowercase_stockcode	lowercase_description
522	20713	jumbo bag owls
136272	20713	NaN
263884	20713	wrongly marked. 23343 in box
380688	20713	wrongly coded-23343
403534	20713	found
415582	20713	wrongly marked 23343
420390	20713	marked as 23343
422750	20713	wrongly coded 23343

	lowercase_stockcode	lowercase_description
1	71053	white metal lantern
5	22752	set 7 babushka nesting boxes
6	21730	glass star frosted t-light holder
7	22633	hand warmer union jack
8	22632	hand warmer red polka dot

	lowercase_stockcode	lowercase_description
0	d	discount
1	m	manual
2	s	samples
3	b	adjust bad debt
4	c2	carriage
5	c2	NaN
6	dot	dotcom postage
7	dot	NaN
8	post	postage
9	post	NaN
10	pads	pads to match all cushions
11	cruk	cruk commission
12	15056bl	edwardian parasol black
13	79323gr	unsaleable, destroyed.
14	79323lp	unsaleable, destroyed.
15	dcgs0076	sunjar led night night light
16	dcgs0003	boxed glass ashtray
17	dcgs0070	camouflage dog collar
18	dcgs0055	NaN
19	dcgs0072	NaN
20	dcgs0074	NaN
21	dcgs0069	ooh la la dogs collar
22	dcgs0057	NaN
23	dcgssboy	boys party bag
24	dcgs0004	haynes camper shoulder bag
25	dcgs0073	ebay
26	dcgs0071	NaN
27	dcgs0070	NaN
28	dcgs0069	ebay
29	dcgs0068	ebay
30	dcgs0067	ebay
31	dcgs0003	ebay
32	amazonfee	amazon fee
33	dcgssgirl	girls party bag
34	dcgs0066p	NaN
35	bank charges	bank charges
36	gift_0001_40	dotcomgiftshop gift voucher £40.00
37	gift_0001_50	dotcomgiftshop gift voucher £50.00
38	gift_0001_30	dotcomgiftshop gift voucher £30.00
39	gift_0001_20	dotcomgiftshop gift voucher £20.00
40	gift_0001_10	dotcomgiftshop gift voucher £10.00
41	gift_0001_20	to push order througha s stock was
42	gift_0001_30	NaN
43	gift_0001_10	NaN

	quantity	unit_price	year	month	month_day	week_day	hour	amount_spent
count	541909.000000	541909.000000	541909.000000	541909.000000	541909.000000	541909.000000	541909.000000	541909.000000
mean	9.552250	4.611114	2010.921609	7.553128	15.023096	2.431277	13.078729	17.987795
std	218.081158	96.759853	0.268787	3.509055	8.664063	1.844709	2.443270	378.810824
min	-80995.000000	-11062.060000	2010.000000	1.000000	1.000000	0.000000	6.000000	-168469.600000
25%	1.000000	1.250000	2011.000000	5.000000	7.000000	1.000000	11.000000	3.400000
50%	3.000000	2.080000	2011.000000	8.000000	15.000000	2.000000	13.000000	9.750000
75%	10.000000	4.130000	2011.000000	11.000000	22.000000	4.000000	15.000000	17.400000
max	80995.000000	38970.000000	2011.000000	12.000000	31.000000	6.000000	20.000000	168469.600000

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country	year	month	month_day	week_day	hour	year_month	amount_spent
299983	A563186	B	Adjust bad debt	1	2011-08-12 14:51:00	-11062.06	NaN	United Kingdom	b	adjust bad debt	united kingdom	2011	8	12	4	14	201108	-11062.06
299984	A563187	B	Adjust bad debt	1	2011-08-12 14:52:00	-11062.06	NaN	United Kingdom	b	adjust bad debt	united kingdom	2011	8	12	4	14	201108	-11062.06

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country	year	month	month_day	week_day	hour	year_month	amount_spent
141	C536379	D	Discount	-1	2010-12-01 09:41:00	27.50	14527	United Kingdom	d	discount	united kingdom	2010	12	1	2	9	201012	-27.50
154	C536383	35004C	SET OF 3 COLOURED FLYING DUCKS	-1	2010-12-01 09:49:00	4.65	15311	United Kingdom	35004c	set of 3 coloured flying ducks	united kingdom	2010	12	1	2	9	201012	-4.65
235	C536391	22556	PLASTERS IN TIN CIRCUS PARADE	-12	2010-12-01 10:24:00	1.65	17548	United Kingdom	22556	plasters in tin circus parade	united kingdom	2010	12	1	2	10	201012	-19.80
236	C536391	21984	PACK OF 12 PINK PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548	United Kingdom	21984	pack of 12 pink paisley tissues	united kingdom	2010	12	1	2	10	201012	-6.96
237	C536391	21983	PACK OF 12 BLUE PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548	United Kingdom	21983	pack of 12 blue paisley tissues	united kingdom	2010	12	1	2	10	201012	-6.96
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
540449	C581490	23144	ZINC T-LIGHT HOLDER STARS SMALL	-11	2011-12-09 09:57:00	0.83	14397	United Kingdom	23144	zinc t-light holder stars small	united kingdom	2011	12	9	4	9	201112	-9.13
541541	C581499	M	Manual	-1	2011-12-09 10:28:00	224.69	15498	United Kingdom	m	manual	united kingdom	2011	12	9	4	10	201112	-224.69
541715	C581568	21258	VICTORIAN SEWING BOX LARGE	-5	2011-12-09 11:57:00	10.95	15311	United Kingdom	21258	victorian sewing box large	united kingdom	2011	12	9	4	11	201112	-54.75
541716	C581569	84978	HANGING HEART JAR T-LIGHT HOLDER	-1	2011-12-09 11:58:00	1.25	17315	United Kingdom	84978	hanging heart jar t-light holder	united kingdom	2011	12	9	4	11	201112	-1.25
541717	C581569	20979	36 PENCILS TUBE RED RETROSPOT	-5	2011-12-09 11:58:00	1.25	17315	United Kingdom	20979	36 pencils tube red retrospot	united kingdom	2011	12	9	4	11	201112	-6.25

	invoice_num	stock_code	description	quantity	invoice_date	unit_price	customer_id	country	lowercase_stockcode	lowercase_description	lowercase_country	year	month	month_day	week_day	hour	year_month	amount_spent
2406	536589	21777	NaN	-10	2010-12-01 16:50:00	0.00	NaN	United Kingdom	21777	NaN	united kingdom	2010	12	1	2	16	201012	-0.00
4347	536764	84952C	NaN	-38	2010-12-02 14:42:00	0.00	NaN	United Kingdom	84952c	NaN	united kingdom	2010	12	2	3	14	201012	-0.00
7188	536996	22712	NaN	-20	2010-12-03 15:30:00	0.00	NaN	United Kingdom	22712	NaN	united kingdom	2010	12	3	4	15	201012	-0.00
7189	536997	22028	NaN	-20	2010-12-03 15:30:00	0.00	NaN	United Kingdom	22028	NaN	united kingdom	2010	12	3	4	15	201012	-0.00
7190	536998	85067	NaN	-6	2010-12-03 15:30:00	0.00	NaN	United Kingdom	85067	NaN	united kingdom	2010	12	3	4	15	201012	-0.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
540449	C581490	23144	ZINC T-LIGHT HOLDER STARS SMALL	-11	2011-12-09 09:57:00	0.83	14397	United Kingdom	23144	zinc t-light holder stars small	united kingdom	2011	12	9	4	9	201112	-9.13
541541	C581499	M	Manual	-1	2011-12-09 10:28:00	224.69	15498	United Kingdom	m	manual	united kingdom	2011	12	9	4	10	201112	-224.69
541715	C581568	21258	VICTORIAN SEWING BOX LARGE	-5	2011-12-09 11:57:00	10.95	15311	United Kingdom	21258	victorian sewing box large	united kingdom	2011	12	9	4	11	201112	-54.75
541716	C581569	84978	HANGING HEART JAR T-LIGHT HOLDER	-1	2011-12-09 11:58:00	1.25	17315	United Kingdom	84978	hanging heart jar t-light holder	united kingdom	2011	12	9	4	11	201112	-1.25
541717	C581569	20979	36 PENCILS TUBE RED RETROSPOT	-5	2011-12-09 11:58:00	1.25	17315	United Kingdom	20979	36 pencils tube red retrospot	united kingdom	2011	12	9	4	11	201112	-6.25

	lowercase_stockcode	description	amount	percertage
0	85123a	white hanging heart t-light holder	99846.98	1.024308
1	22423	regency cakestand 3 tier	164762.19	1.690259
2	85099b	jumbo bag red retrospot	92356.03	0.947460
3	47566	party bunting	98302.98	1.008469
4	20725	lunch bag red retrospot	35187.31	0.360979
5	84879	assorted colour bird ornament	58959.73	0.604855
6	22720	set of 3 cake tins pantry design	37413.44	0.383816
7	22197	small popcorn holder	50987.47	0.523069
8	21212	pack of 72 retrospot cake cases	21059.72	0.216047
9	20727	lunch bag black skull.	22219.01	0.227940
10	22383	lunch bag suki design	22318.64	0.228962
11	22457	natural slate heart chalkboard	27995.36	0.287198
12	23203	mailout	40991.38	0.420522
13	post	postage	66230.64	0.679446
14	22386	jumbo bag pink polkadot	41619.66	0.426967
15	22469	heart of wicker small	31899.85	0.327254
16	22960	jam making set with jars	36116.09	0.370507
17	21931	jumbo storage bag suki	30167.79	0.309485
18	22086	paper chain kit 50's christmas	63791.94	0.654427
19	22411	jumbo shopper vintage red paisley	28012.48	0.287374

	lowercase_stockcode	description	amount	percertage
0	dot	dotcom postage	206245.48	2.115827
1	22423	regency cakestand 3 tier	164762.19	1.690259
2	85123a	white hanging heart t-light holder	99846.98	1.024308
3	47566	party bunting	98302.98	1.008469
4	85099b	jumbo bag red retrospot	92356.03	0.947460
5	23084	rabbit night light	66756.59	0.684841
6	post	postage	66230.64	0.679446
7	22086	paper chain kit 50's christmas	63791.94	0.654427
8	84879	assorted colour bird ornament	58959.73	0.604855
9	79321	chilli lights	53768.06	0.551595
10	22502	picnic basket wicker small	51041.37	0.523622
11	22197	small popcorn holder	50987.47	0.523069
12	23298	spotty bunting	42700.02	0.438050
13	22386	jumbo bag pink polkadot	41619.66	0.426967
14	23203	mailout	40991.38	0.420522
15	21137	black record cover frame	40596.96	0.416475
16	22720	set of 3 cake tins pantry design	37413.44	0.383816
17	23284	doormat keep calm and come in	36565.39	0.375116
18	22960	jam making set with jars	36116.09	0.370507
19	82484	wood black board ant white finish	35859.27	0.367872

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
0	85123a	white hanging heart t-light holder	hang heart tlight holder	[hang, heart, tlight, holder]
1	71053	white metal lantern	metal lantern	[metal, lantern]
2	84406b	cream cupid hearts coat hanger	cream cupid heart coat hanger	[cream, cupid, heart, coat, hanger]
3	84029g	knitted union flag hot water bottle	knit union flag hot water bottl	[knit, union, flag, hot, water, bottl]
4	84029e	red woolly hottie white heart.	woolli hotti heart	[woolli, hotti, heart]
...	...	...	...	...
535336	22576	check	check	[check]
536908	23090	missing	miss	[miss]
537621	85123a	cream hanging heart t-light holder	cream hang heart tlight holder	[cream, hang, heart, tlight, holder]
538919	23169	smashed	smash	[smash]
540421	23843	paper craft , little birdie	paper craft littl birdi	[paper, craft, littl, birdi]

	lowercase_stockcode	lowercase_description	tokenized_description
7313	21275	?	[]
21518	20956	?	[]
38261	21479	?	[]
43662	22837	?	[]
50806	21258	?	[]

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
7313	21275	?		[]
17390	21275	zinc top 2 door wooden shelf	zinc top door wooden shelf	[zinc, top, door, wooden, shelf]

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
144	21832	chocolate calculator	calcul	[calcul]
1010	71406c	black orange squeezer	squeezer	[squeezer]
6391	22734	amazon	amazon	[amazon]
6392	22139	amazon	amazon	[amazon]
8591	62018	sombrero	sombrero	[sombrero]

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
144	21832	chocolate calculator	calcul	[calcul]
1010	71406c	black orange squeezer	squeezer	[squeezer]
6391	22734	amazon	amazon	[amazon]
8591	62018	sombrero	sombrero	[sombrero]
12953	21336	gold washbag	washbag	[washbag]
13217	84968f	check	check	[check]
13264	35833g	damages	damag	[damag]
21338	22423	faulty	faulti	[faulti]
39047	85135b	found	found	[found]
55189	84050	counted	count	[count]
65077	46000s	dotcom	dotcom	[dotcom]
73589	85161	samples/damages	samplesdamag	[samplesdamag]
107041	21769	showroom	showroom	[showroom]
112036	20681	mia	mia	[mia]
112219	21736	adjustment	adjust	[adjust]
115805	20893	damages/display	damagesdisplay	[damagesdisplay]
115813	21320	broken	broken	[broken]
128876	22459	?lost	lost	[lost]
140281	82494l	cracked	crack	[crack]
140283	22719	sold as 22467	sold	[sold]
165021	84763	display	display	[display]
165448	22458	missing	miss	[miss]
166646	84531a	returned	return	[return]
170542	84228	crushed	crush	[crush]
174044	21656	samples	sampl	[sampl]
177705	23200	mailout	mailout	[mailout]
192289	15058a	wet/rusty	wetrusti	[wetrusti]
192293	37370	damages/dotcom?	damagesdotcom	[damagesdotcom]
201488	22700	smashed	smash	[smash]
270339	82583	fba	fba	[fba]
285726	84352	damages/samples	damagessampl	[damagessampl]
332404	22355	test	test	[test]
355423	84422	dagamed	dagam	[dagam]
384867	21090	wet/mouldy	wetmouldi	[wetmouldi]
394643	23322	mouldy	mouldi	[mouldi]
417108	23595	re-adjustment	readjust	[readjust]
418065	23118	breakages	breakag	[breakag]
420390	20713	marked as 23343	mark	[mark]
476866	72817	dotcomstock	dotcomstock	[dotcomstock]
514649	84859c	wet?	wet	[wet]

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
237422	23196	retro leaves magnetic notepad	retro leav magnet notepad	[retro, leav, magnet, notepad]
238991	23196	reto leaves magnetic shopping list	reto leav magnet shop list	[reto, leav, magnet, shop, list]
246802	23196	leaves magnetic shopping list	leav magnet shop list	[leav, magnet, shop, list]
252851	23196	vintage leaf magnetic notepad	vintag leaf magnet notepad	[vintag, leaf, magnet, notepad]

	abc	abstract	acapulco	account	acryl	address	ade	adult	advent	afghan	...	xcm	xma	yellowblu	yelloworang	yellowpink	your	yuletid	zero	zinc	zincglass
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3795	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3796	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3797	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3798	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3799	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	lowercase_description	cleanned_description	category
31	inflatable political globe	inflat polit globe	12
103332	groovy cactus inflatable	groovi cactu inflat	12
5452	doggy rubber	doggi rubber	12
5451	hearts wrapping tape	heart wrap tape	6
12492	spots on red bookcover tape	spot bookcov tape	12
...	...	...	...
170783	haynes camper shoulder bag	hayn camper shoulder bag	4
40052	camouflage dog collar	camouflag dog collar	12
21326	sunjar led night night light	sunjar led night night light	33
84016	boys party bag	boy parti bag	10
84017	girls party bag	girl parti bag	10

	category	desc_1	desc_2	desc_3	desc_4	desc_5
0	0	clear stationery box set	box of 9 pebble candles	vintage keepsake box pink flower	vintage keepsake box paris days	vintage keepsake box travelogue
1	1	colouring pencils brown tube	colouring pencils brown tube	assorted colours silk fan	popart wooden pencils asst	teatime round pencil sharpener
2	2	glitter hanging butterfly string	pink poodle hanging decoration	ivory hanging decoration heart	ivory hanging decoration egg	ivory hanging decoration bird
3	3	vintage notebook paris days	vintage notebook travelogue	vintage notebook beauty girl	vintage photo album paris days	set/4 modern vintage cotton napkins
4	4	the king gift bag	funky monkey gift bag medium	swirly circular rubbers in bag	jumbo bag toys	jumbo bag woodland animals
5	5	asstd design racing car pen	ice cream design garden parasol	key fob , garage design	skulls design flannel	cakes and rabbits design flannel
6	6	hearts wrapping tape	red purse with pink heart	pink heart red handbag	red heart luggage tag	zinc heart lattice 2 wall planter
7	7	porcelain budah incense holder	assorted laquered incense holders	pink tall porcelain t-light holder	white tall porcelain t-light holder	zinc heart lattice t-light holder
8	8	small folding scissor(pointed edge)	small chinese style scissor	ass col small sand gecko p'weight	small blue provencal ceramic ball	small glass sundae dish clear
9	9	wrap english rose	origami rose incense in tube	blue stones on wire for candle	blue rose patch purse pink butterfl	blue rose fabric mirror
10	10	party time pencil erasers	party food shopper bag	party invites woodland	party invites dinosaurs	party invites balloon girl
11	11	large chinese style scissor	ass col large sand frog p'weight	large black diamante hairslide	large blue provencal ceramic ball	large pink glass sundae dish
12	12	inflatable political globe	groovy cactus inflatable	doggy rubber	spots on red bookcover tape	army camo bookcover tape
13	13	blue glass gems in bag	clear milkshake glass	silver aperitif glass	gold aperitif glass	antique all glass candlestick
14	14	beware of the cat metal sign	cook with wine metal sign	you're confusing me metal sign	bathroom metal sign	party metal sign
15	15	black ear muff headphones	gold ear muff headphones	camouflage ear muff headphones	antique raspberry flower earrings	antique mid blue flower earrings
16	16	fairy cake design umbrella	wrap pink fairy cakes	fairy cake flannel assorted colour	pack of 72 retrospot cake cases	pack of 72 skull cake cases
17	17	piece of camo stationery set	origami vanilla incense/candle set	origami jasmine incense/candle set	origami lavender incense/candl set	origami rose incense/candle set
18	18	home sweet home metal sign	doormat fancy font home sweet home	doormat spotty home sweet home	home sweet home mug	paperweight home sweet home
19	19	blue polkadot garden parasol	pink polkadot garden parasol	pop art push down rubber	garden path pocket book	garden path sketchbook
20	20	ribbons purse	mummy mouse red gingham ribbon	6 ribbons shimmering pinks	6 ribbons empire	6 ribbons rustic charm
21	21	dolly mixture children's umbrella	moody girl door hanger	keep out girls door hanger	tv dinner tray dolly girl	magic drawing slate dolly girl
22	22	flowers handbag blue and orange	flower fairy,5 summer b'draw liners	flower fairy incense bouquet	rose flower candle+incense 16x16cm	toast its - fairy flower
23	23	mini funky design tapes	mini highlighter pens	gold mini tape measure	silver mini tape measure	black mini tape measure
24	24	disco ball christmas decoration	crazy daisy heart decoration	etched glass star tree decoration	glitter snow pear tree decoration	swiss chalet tree decoration
25	25	blue polkadot passport cover	cherry blossom passport cover	first class passport cover	queen of the skies passport cover	tropical passport cover
26	26	tumbler, baroque	new baroque jewellery box	pink new baroque flock candlestick	blue new baroque flock candlestick	new baroque black boxes
27	27	key ring baseball boot assorted	key ring baseball boot union jack	bling key ring stand	letter "a" bling key ring	letter "b" bling key ring
28	28	raspberry ant copper flower necklac	midnight blue copper flower necklac	copper/olive green flower necklace	midnight blue drop crystal necklace	black drop crystal necklace
29	29	balloon art make your own flowers	art lights,funk monkey	large white/pink rose art flower	medium white/pink rose art flower	small white/pink rose art flower
30	30	rattle snake eggs	blue polkadot egg cup	red egg spoon	blue egg spoon	2 picture book eggs easter chicks
31	31	wrap 50's christmas	bird box christmas tree decoration	christmas toilet roll	green goose feather christmas tree	white goose feather christmas tree
32	32	fan black frame	pastel pink photo album	pastel blue photo album	black photo album	maxwell 2 tone blue 60 page photo a
33	33	red toadstool led night light	star portable table light	snowflake portable table light	set 10 night owl lights	pig keyring with light & sound
34	34	red stripe ceramic drawer knob	blue stripe ceramic drawer knob	blue spot ceramic drawer knob	red spot ceramic drawer knob	white spot red ceramic drawer knob

	category	desc_1	desc_2	desc_3	desc_4	desc_5
0	0	silver looking mirror	goldie looking mirror	blue rose fabric mirror	pink rose fabric mirror	silver fabric mirror
1	1	crazy daisy heart decoration	etched glass star tree decoration	glitter snow pear tree decoration	swiss chalet tree decoration	dark bird house tree decoration
2	2	blue stones on wire for candle	red retrospot luggage tag	red retrospot purse	red retrospot bowl	red retrospot childrens umbrella
3	3	vintage union jack bunting	vintage union jack cushion cover	vintage union jack memoboard	vintage union jack doorstop	vintage union jack apron
4	4	french style embossed heart cabinet	french lavender scent heart	classic french style basket green	classic french style basket brown	classic french style basket natural
5	5	inflatable political globe	groovy cactus inflatable	doggy rubber	army camo bookcover tape	colouring pencils brown tube
6	6	zinc heart lattice 2 wall planter	set of skull wall stickers	mirrored dove wall decoration	mirrored wall art stars	mirrored wall art photo frames
7	7	hearts wrapping tape	wrap, carousel	wrap bad hair day	wrap pink flock	wrap english rose
8	8	wrap 50's christmas	disco ball christmas decoration	set/9 christmas t-lights scented	bird box christmas tree decoration	christmas toilet roll
9	9	blue polkadot garden parasol	pink polkadot garden parasol	blue polkadot passport cover	blue polkadot luggage tag	blue polkadot purse
10	10	spots on red bookcover tape	cream slice flannel chocolate spot	cream slice flannel pink spot	large cake towel, chocolate spots	large cake towel pink spots
11	11	bling key ring stand	letter "a" bling key ring	letter "b" bling key ring	letter "d" bling key ring	letter "e" bling key ring
12	12	black ear muff headphones	gold ear muff headphones	camouflage ear muff headphones	midnight blue crystal drop earrings	black crystal drop earrings
13	13	midnight blue copper flower necklac	copper/olive green flower necklace	midnight blue drop crystal necklace	black drop crystal necklace	silver/mop orbit necklace
14	14	mini funky design tapes	asstd design racing car pen	vintage red mug	vintage red teatime mug	vintage billboard love/hate mug
15	15	ice cream design garden parasol	mini highlighter pens	gold mini tape measure	silver mini tape measure	black mini tape measure
16	16	flowers handbag blue and orange	flower fairy,5 summer b'draw liners	flower fairy incense bouquet	rose flower candle+incense 16x16cm	toast its - fairy flower
17	17	fairy cake design umbrella	wrap pink fairy cakes	pack of 72 retrospot cake cases	pack of 72 skull cake cases	ivory paper cup cake cases
18	18	rattle snake eggs	2 picture book eggs easter chicks	2 picture book eggs easter ducks	2 picture book eggs easter bunny	yellow easter egg hunt start post
19	19	midnight blue glass/silver bracelet	black/white glass/silver bracelet	amber glass/silver bracelet	silver m.o.p orbit bracelet	gold m.o.p orbit bracelet
20	20	small folding scissor(pointed edge)	small chinese style scissor	red purse with pink heart	pink heart red handbag	red heart luggage tag
21	21	love potion masala incense	red flock love heart photo frame	toast its - i love you	love large wood letters	red heart shape love bucket
22	22	skulls greeting card	fancy font birthday card,	elephant birthday card	vintage kid dolly card	cowboys and indians birthday card
23	23	party time pencil erasers	party food shopper bag	party invites woodland	party invites dinosaurs	party invites balloon girl
24	24	large black diamante hairslide	pink diamante pen in gift box	green diamante pen in gift box	silver diamante pen in gift box	blue diamante pen in gift box
25	25	piece of camo stationery set	origami vanilla incense/candle set	origami jasmine incense/candle set	origami lavender incense/candl set	origami rose incense/candle set
26	26	victorian metal postcard spring	home sweet home metal sign	beware of the cat metal sign	cook with wine metal sign	you're confusing me metal sign
27	27	s/3 pot pouri cushions blue colours	cherry blossom passport cover	first class passport cover	queen of the skies passport cover	tropical passport cover
28	28	clear stationery box set	box of 9 pebble candles	vintage keepsake box pink flower	vintage keepsake box paris days	vintage keepsake box travelogue
29	29	pink tall porcelain t-light holder	white tall porcelain t-light holder	zinc heart lattice t-light holder	hanging bauble t-light holder small	hanging bauble t-light holder large
30	30	assorted colours silk fan	assorted incense pack	assorted laquered incense holders	assorted colour mini cases	funky washing up gloves assorted
31	31	blue glass gems in bag	clear milkshake glass	large pink glass sundae dish	small glass sundae dish clear	small pink glass sundae dish
32	32	pack of 12 skull tissues	pack of 12 red retrospot tissues	pack of 12 woodland tissues	pack of 12 suki tissues	pack of 12 blue paisley tissues
33	33	white skull hot water bottle	fawn blue hot water bottle	chick grey hot water bottle	retrospot heart hot water bottle	pink heart dots hot water bottle
34	34	vintage kitchen print puddings	red polkadot pudding bowl	blue polkadot pudding bowl	red pudding spoon	blue pudding spoon
35	35	blue tiled tray	blue tile hook	number tile cottage garden 0	number tile cottage garden 1	number tile cottage garden 2

	lowercase_stockcode	lowercase_description	cleanned_description	tokenized_description
741	10133	colouring pencils brown tube	colour pencil tube	[colour, pencil, tube]
2461	10135	colouring pencils brown tube	colour pencil tube	[colour, pencil, tube]