E-Commerce Data Analysis

Table of Contents


Loading all the modules that will be used:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

from webcolors import CSS3_NAMES_TO_HEX

import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

from pandas_profiling import ProfileReport




Then reading the data set and taking the first look:

In [3]:
df = pd.read_csv('../1985_3404_bundle_archive/data.csv', encoding = 'ISO-8859-1')
In [4]:
df.head()
Out[4]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom




Let's standardize the dataframe's columns and text, facilitating the handling.

In [5]:
# New column names

new_cols = ['invoice_num', 'stock_code', 'description', 'quantity',
            'invoice_date', 'unit_price', 'customer_id', 'country']
df.columns = new_cols

df.head()
Out[5]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
In [6]:
# New columns to represent stock code, description and country in lower case

# Lowercase stock code
df.loc[~df['stock_code'].isna(), 'lowercase_stockcode'] = df.loc[
                                                                 ~df['stock_code'].isna(), 'stock_code'
                                                                ].apply(lambda text: text.lower())

# Lowercase description
df.loc[~df['description'].isna(), 'lowercase_description'] = df.loc[
                                                                    ~df['description'].isna(), 'description'
                                                                   ].apply(lambda text: text.lower())

# Lowercase country
df.loc[~df['country'].isna(), 'lowercase_country'] = df.loc[
                                                            ~df['country'].isna(), 'country'
                                                           ].apply(lambda text: text.lower())
In [7]:
df.head()
Out[7]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom 85123a white hanging heart t-light holder united kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 71053 white metal lantern united kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom 84406b cream cupid hearts coat hanger united kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 84029g knitted union flag hot water bottle united kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom 84029e red woolly hottie white heart. united kingdom




Creating a report to get an overview of the data:

In [8]:
# This doesn't appear on github, but the design is great
# It is used only as an overview
# Everything we need from this will be explored later
profile = ProfileReport(df, title='Profiling E-Commerce',html={'style':{'full_width':True}})
profile.to_notebook_iframe()
Summarize dataset: 100%|████████████████████████████████████████████████████| 25/25 [00:54<00:00,  2.18s/it, Completed]
Generate report structure: 100%|█████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.87s/it]
Render HTML: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.10s/it]




Let's focus on data types.

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   invoice_num            541909 non-null  object 
 1   stock_code             541909 non-null  object 
 2   description            540455 non-null  object 
 3   quantity               541909 non-null  int64  
 4   invoice_date           541909 non-null  object 
 5   unit_price             541909 non-null  float64
 6   customer_id            406829 non-null  float64
 7   country                541909 non-null  object 
 8   lowercase_stockcode    541909 non-null  object 
 9   lowercase_description  540455 non-null  object 
 10  lowercase_country      541909 non-null  object 
dtypes: float64(2), int64(1), object(8)
memory usage: 45.5+ MB
In [10]:
# Distribuition of missing values
plt.figure(figsize=(13,9))

sns.heatmap(df.isna(), cmap='magma')

plt.xticks(rotation=45, fontsize=12)

plt.tight_layout()




There is no problem in keeping fields that we won't do calculations as "object", such as: invoice_num and stock_code. There are missing values in customer_id, but I will keep it this way for now. I will convert customer_id to string ("object").

In [11]:
# We don't need ".0" in the end of the customer_id values
df.loc[~df['customer_id'].isna(), 'customer_id'] = df.loc[
                                                          ~df['customer_id'].isna(), 'customer_id'
                                                         ].apply(lambda ID: str(ID)[:-2])
In [12]:
df['customer_id'].value_counts()
Out[12]:
17841    7983
14911    5903
14096    5128
12748    4642
14606    2782
         ... 
17443       1
16428       1
14119       1
14424       1
13185       1
Name: customer_id, Length: 4372, dtype: int64
In [13]:
df.head()
Out[13]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom 85123a white hanging heart t-light holder united kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom 71053 white metal lantern united kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom 84406b cream cupid hearts coat hanger united kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom 84029g knitted union flag hot water bottle united kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom 84029e red woolly hottie white heart. united kingdom




The next step is to transform invoice_date to "datetime".

In [14]:
df['invoice_date'] = pd.to_datetime(df['invoice_date'])
In [15]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   invoice_num            541909 non-null  object        
 1   stock_code             541909 non-null  object        
 2   description            540455 non-null  object        
 3   quantity               541909 non-null  int64         
 4   invoice_date           541909 non-null  datetime64[ns]
 5   unit_price             541909 non-null  float64       
 6   customer_id            406829 non-null  object        
 7   country                541909 non-null  object        
 8   lowercase_stockcode    541909 non-null  object        
 9   lowercase_description  540455 non-null  object        
 10  lowercase_country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 45.5+ MB




I will split the InvoiceDate column into: Year, Month, WeekDay, Hour and YearMonth. This way, we can group our data however we want based on any time analysis.

In [16]:
df['year'] = df['invoice_date'].apply(lambda time: time.year)

df['month'] = df['invoice_date'].apply(lambda time: time.month)

df['month_day'] = df['invoice_date'].apply(lambda time: time.day)

df['week_day'] = df['invoice_date'].apply(lambda time: time.weekday())

df['hour'] = df['invoice_date'].apply(lambda date: date.hour)

# zfill(2) is used to fill the string with zeros until it is 2 characters long.
df['year_month'] = df['invoice_date'].apply(lambda time: str(time.year) + str(time.month).zfill(2))
In [17]:
df.head()
Out[17]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country year month month_day week_day hour year_month
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 85123a white hanging heart t-light holder united kingdom 2010 12 1 2 8 201012
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 71053 white metal lantern united kingdom 2010 12 1 2 8 201012
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 84406b cream cupid hearts coat hanger united kingdom 2010 12 1 2 8 201012
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 84029g knitted union flag hot water bottle united kingdom 2010 12 1 2 8 201012
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 84029e red woolly hottie white heart. united kingdom 2010 12 1 2 8 201012




We have quantity and unit_price, but we don't have the amount spent on each line. Creating the amount_spent column will make it easier to analyse revenue.

In [18]:
df['amount_spent'] = df['quantity'] * df['unit_price']




Taking a look at missing values...

In [19]:
df.isna().sum().sort_values(ascending=False)
Out[19]:
customer_id              135080
description                1454
lowercase_description      1454
amount_spent                  0
country                       0
stock_code                    0
quantity                      0
invoice_date                  0
unit_price                    0
lowercase_stockcode           0
year_month                    0
lowercase_country             0
year                          0
month                         0
month_day                     0
week_day                      0
hour                          0
invoice_num                   0
dtype: int64




Perhaps, there are non-numeric fields with missing values and it did not appear in our previous analysis. Missing values can appear in strange ways, such as "", "na", "nan" and "null" strings.

In [20]:
nullsdict = {}
nulls_vec = ['', 'na', 'nan', 'null']

for col in df.columns:    
    str_nan = df.loc[~df[col].isna(), col].apply(lambda row: row in nulls_vec)
    real_nan = df[col].isna()
    
    nullsdict.update({col:{'str_nan':sum(str_nan),
                           'real_nan':sum(real_nan)}})

pd.DataFrame(nullsdict).T
Out[20]:
str_nan real_nan
invoice_num 0 0
stock_code 0 0
description 0 1454
quantity 0 0
invoice_date 0 0
unit_price 0 0
customer_id 0 135080
country 0 0
lowercase_stockcode 0 0
lowercase_description 0 1454
lowercase_country 0 0
year 0 0
month 0 0
month_day 0 0
week_day 0 0
hour 0 0
year_month 0 0
amount_spent 0 0




We could erase the NaN values, but it will damage the sum of the revenue.

There is nothing we can do about the customer_id's missing values for now. I will analyse description column in the Exploratory Data Analysis section.




In Data Preparation section it was possible to observe missing values in lowercase_description column. Let's take an overview of the products.

In [21]:
products = df[['lowercase_stockcode', 'lowercase_description']].drop_duplicates()
products.head()
Out[21]:
lowercase_stockcode lowercase_description
0 85123a white hanging heart t-light holder
1 71053 white metal lantern
2 84406b cream cupid hearts coat hanger
3 84029g knitted union flag hot water bottle
4 84029e red woolly hottie white heart.
In [22]:
products['lowercase_stockcode'].value_counts()
Out[22]:
20713     8
23084     8
21830     7
85175     7
23131     6
         ..
23368     1
84661b    1
21908     1
20763     1
90001d    1
Name: lowercase_stockcode, Length: 3958, dtype: int64




There is more than one description for the same product. Looking at the product with the stock code equal to 20713...

In [23]:
products[products['lowercase_stockcode'] == '20713']
Out[23]:
lowercase_stockcode lowercase_description
522 20713 jumbo bag owls
136272 20713 NaN
263884 20713 wrongly marked. 23343 in box
380688 20713 wrongly coded-23343
403534 20713 found
415582 20713 wrongly marked 23343
420390 20713 marked as 23343
422750 20713 wrongly coded 23343




The product with stock_code equal to 20713 actually is "jumbo bag owls". There are missing values, product incorrectly marked as 23343 and found(?). Regardless, we can assume the revenue amount is correct, so we will not change any description and unit price.

In [24]:
products['lowercase_stockcode'].map(len).value_counts()
Out[24]:
5     4186
6     1405
8       17
12       9
1        4
4        4
7        3
9        3
3        2
2        2
Name: lowercase_stockcode, dtype: int64




According to the source, if the code starts with letter "C", it indicates a cancellation.

There are stock_code with different lengths.

In [25]:
# stock_code with 6 characters
products[products['lowercase_stockcode'].map(len) == 6].head()
Out[25]:
lowercase_stockcode lowercase_description
0 85123a white hanging heart t-light holder
2 84406b cream cupid hearts coat hanger
3 84029g knitted union flag hot water bottle
4 84029e red woolly hottie white heart.
60 82494l wooden frame antique white
In [26]:
# stock_code with 5 characters
products[products['lowercase_stockcode'].map(len) == 5].head()
Out[26]:
lowercase_stockcode lowercase_description
1 71053 white metal lantern
5 22752 set 7 babushka nesting boxes
6 21730 glass star frosted t-light holder
7 22633 hand warmer union jack
8 22632 hand warmer red polka dot




It seems that stock_code values represented with 6 or 5 characters are standard sales.

In [27]:
products['lowercase_stockcode'].map(len).value_counts().index
Out[27]:
Int64Index([5, 6, 8, 12, 1, 4, 7, 9, 3, 2], dtype='int64')
In [28]:
# Create a dataframe with stock_code descriptions
def desc_stockcode(prod):
    
    # stock_code lengths different from 5 and 6
    stockcode_len = [8, 12, 1, 7, 4, 9, 3, 2]
    
    stockcode_len = sorted(stockcode_len)
    
    df_stockcode = (prod[prod['lowercase_stockcode'].map(len) == stockcode_len[0]])
    
    for length in stockcode_len[1:]:
        df_stockcode = pd.concat([df_stockcode,
                                  prod[prod['lowercase_stockcode'].map(len) == length]],
                                  ignore_index=True)
    
    return df_stockcode
                             
desc_stockcode(products)
Out[28]:
lowercase_stockcode lowercase_description
0 d discount
1 m manual
2 s samples
3 b adjust bad debt
4 c2 carriage
5 c2 NaN
6 dot dotcom postage
7 dot NaN
8 post postage
9 post NaN
10 pads pads to match all cushions
11 cruk cruk commission
12 15056bl edwardian parasol black
13 79323gr unsaleable, destroyed.
14 79323lp unsaleable, destroyed.
15 dcgs0076 sunjar led night night light
16 dcgs0003 boxed glass ashtray
17 dcgs0070 camouflage dog collar
18 dcgs0055 NaN
19 dcgs0072 NaN
20 dcgs0074 NaN
21 dcgs0069 ooh la la dogs collar
22 dcgs0057 NaN
23 dcgssboy boys party bag
24 dcgs0004 haynes camper shoulder bag
25 dcgs0073 ebay
26 dcgs0071 NaN
27 dcgs0070 NaN
28 dcgs0069 ebay
29 dcgs0068 ebay
30 dcgs0067 ebay
31 dcgs0003 ebay
32 amazonfee amazon fee
33 dcgssgirl girls party bag
34 dcgs0066p NaN
35 bank charges bank charges
36 gift_0001_40 dotcomgiftshop gift voucher £40.00
37 gift_0001_50 dotcomgiftshop gift voucher £50.00
38 gift_0001_30 dotcomgiftshop gift voucher £30.00
39 gift_0001_20 dotcomgiftshop gift voucher £20.00
40 gift_0001_10 dotcomgiftshop gift voucher £10.00
41 gift_0001_20 to push order througha s stock was
42 gift_0001_30 NaN
43 gift_0001_10 NaN




Analysing the previous dataframe, stock_code can represent special events as discounts, bad debt and gift vouchers.

In [29]:
df.describe()
Out[29]:
quantity unit_price year month month_day week_day hour amount_spent
count 541909.000000 541909.000000 541909.000000 541909.000000 541909.000000 541909.000000 541909.000000 541909.000000
mean 9.552250 4.611114 2010.921609 7.553128 15.023096 2.431277 13.078729 17.987795
std 218.081158 96.759853 0.268787 3.509055 8.664063 1.844709 2.443270 378.810824
min -80995.000000 -11062.060000 2010.000000 1.000000 1.000000 0.000000 6.000000 -168469.600000
25% 1.000000 1.250000 2011.000000 5.000000 7.000000 1.000000 11.000000 3.400000
50% 3.000000 2.080000 2011.000000 8.000000 15.000000 2.000000 13.000000 9.750000
75% 10.000000 4.130000 2011.000000 11.000000 22.000000 4.000000 15.000000 17.400000
max 80995.000000 38970.000000 2011.000000 12.000000 31.000000 6.000000 20.000000 168469.600000




There are negative values in quantity and unit_price.

In [30]:
len(df[(df['quantity'] < 0)])
Out[30]:
10624
In [31]:
len(df[(df['unit_price'] < 0)])
Out[31]:
2




There are 10624 negative values in quantity and only 2 in unit_price. I will explore and find the meaning of these values.

In [32]:
# Starting with unit_price, because there are only 2 negative values.
df[(df['unit_price'] < 0)]
Out[32]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country year month month_day week_day hour year_month amount_spent
299983 A563186 B Adjust bad debt 1 2011-08-12 14:51:00 -11062.06 NaN United Kingdom b adjust bad debt united kingdom 2011 8 12 4 14 201108 -11062.06
299984 A563187 B Adjust bad debt 1 2011-08-12 14:52:00 -11062.06 NaN United Kingdom b adjust bad debt united kingdom 2011 8 12 4 14 201108 -11062.06




Negative unit_price values are bad debt.

In [33]:
# Checking negative quantity
df[(df['quantity'] < 0)]
Out[33]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country year month month_day week_day hour year_month amount_spent
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527 United Kingdom d discount united kingdom 2010 12 1 2 9 201012 -27.50
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311 United Kingdom 35004c set of 3 coloured flying ducks united kingdom 2010 12 1 2 9 201012 -4.65
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548 United Kingdom 22556 plasters in tin circus parade united kingdom 2010 12 1 2 10 201012 -19.80
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548 United Kingdom 21984 pack of 12 pink paisley tissues united kingdom 2010 12 1 2 10 201012 -6.96
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548 United Kingdom 21983 pack of 12 blue paisley tissues united kingdom 2010 12 1 2 10 201012 -6.96
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
540449 C581490 23144 ZINC T-LIGHT HOLDER STARS SMALL -11 2011-12-09 09:57:00 0.83 14397 United Kingdom 23144 zinc t-light holder stars small united kingdom 2011 12 9 4 9 201112 -9.13
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498 United Kingdom m manual united kingdom 2011 12 9 4 10 201112 -224.69
541715 C581568 21258 VICTORIAN SEWING BOX LARGE -5 2011-12-09 11:57:00 10.95 15311 United Kingdom 21258 victorian sewing box large united kingdom 2011 12 9 4 11 201112 -54.75
541716 C581569 84978 HANGING HEART JAR T-LIGHT HOLDER -1 2011-12-09 11:58:00 1.25 17315 United Kingdom 84978 hanging heart jar t-light holder united kingdom 2011 12 9 4 11 201112 -1.25
541717 C581569 20979 36 PENCILS TUBE RED RETROSPOT -5 2011-12-09 11:58:00 1.25 17315 United Kingdom 20979 36 pencils tube red retrospot united kingdom 2011 12 9 4 11 201112 -6.25

10624 rows × 18 columns

In [34]:
# Checking invoice_num starting character.
df[(df['quantity'] < 0)]['invoice_num'].apply(lambda invoice: invoice[0]).value_counts()
Out[34]:
C    9288
5    1336
Name: invoice_num, dtype: int64
In [35]:
df[(df['quantity'] < 0)].sort_values('invoice_num')
Out[35]:
invoice_num stock_code description quantity invoice_date unit_price customer_id country lowercase_stockcode lowercase_description lowercase_country year month month_day week_day hour year_month amount_spent
2406 536589 21777 NaN -10 2010-12-01 16:50:00 0.00 NaN United Kingdom 21777 NaN united kingdom 2010 12 1 2 16 201012 -0.00
4347 536764 84952C NaN -38 2010-12-02 14:42:00 0.00 NaN United Kingdom 84952c NaN united kingdom 2010 12 2 3 14 201012 -0.00
7188 536996 22712 NaN -20 2010-12-03 15:30:00 0.00 NaN United Kingdom 22712 NaN united kingdom 2010 12 3 4 15 201012 -0.00
7189 536997 22028 NaN -20 2010-12-03 15:30:00 0.00 NaN United Kingdom 22028 NaN united kingdom 2010 12 3 4 15 201012 -0.00
7190 536998 85067 NaN -6 2010-12-03 15:30:00 0.00 NaN United Kingdom 85067 NaN united kingdom 2010 12 3 4 15 201012 -0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
540449 C581490 23144 ZINC T-LIGHT HOLDER STARS SMALL -11 2011-12-09 09:57:00 0.83 14397 United Kingdom 23144 zinc t-light holder stars small united kingdom 2011 12 9 4 9 201112 -9.13
541541 C581499 M Manual -1 2011-12-09 10:28:00 224.69 15498 United Kingdom m manual united kingdom 2011 12 9 4 10 201112 -224.69
541715 C581568 21258 VICTORIAN SEWING BOX LARGE -5 2011-12-09 11:57:00 10.95 15311 United Kingdom 21258 victorian sewing box large united kingdom 2011 12 9 4 11 201112 -54.75
541716 C581569 84978 HANGING HEART JAR T-LIGHT HOLDER -1 2011-12-09 11:58:00 1.25 17315 United Kingdom 84978 hanging heart jar t-light holder united kingdom 2011 12 9 4 11 201112 -1.25
541717 C581569 20979 36 PENCILS TUBE RED RETROSPOT -5 2011-12-09 11:58:00 1.25 17315 United Kingdom 20979 36 pencils tube red retrospot united kingdom 2011 12 9 4 11 201112 -6.25

10624 rows × 18 columns

In [36]:
# How many sales had unit_price equal to 0?
sum(df['unit_price'] == 0)
Out[36]:
2515
In [37]:
# How many invoice_num codes start with "5"?
sum(df[df['unit_price'] == 0]['invoice_num'].apply(lambda invoice: invoice[0] == '5'))
Out[37]:
2515




All invoices with free itens have invoice_num starting with "5"

In [38]:
# Analysing unit_price equal 0
plt.figure(figsize=(12,8))
df[df['unit_price'] == 0]['quantity'].hist(bins=30, color='steelblue')

plt.xticks(fontsize=12)
plt.tight_layout()
In [39]:
# Descriptions of "free items" with positive quantity
df[(df['quantity'] > 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()
Out[39]:
check                                  39
found                                  34
amazon                                 18
adjustment                             16
french blue metal door sign 1           9
                                       ..
set of 2 trays home sweet home          1
christmas pudding trinket pot           1
fba                                     1
fridge magnets les enfants assorted     1
childs garden fork pink                 1
Name: lowercase_description, Length: 241, dtype: int64
In [40]:
# Descriptions of "free items" with negative quantity
df[(df['quantity'] < 0) & (df['unit_price'] == 0)]['lowercase_description'].value_counts()
Out[40]:
check                                  123
damaged                                 57
damages                                 46
?                                       41
sold as set on dotcom                   20
                                      ... 
oops ! adjustment                        1
sold in set?                             1
incorrectly credited c550456 see 47      1
lost??                                   1
damaged stock                            1
Name: lowercase_description, Length: 127, dtype: int64




There are positive and negative quantity values for invoices with unit_price equal to 0. Some of them are damaged or from display as we can see. Perhaps some are free items for promotions or events.

Now that we know our data set better, let's do some visualization and answer some questions.


In [41]:
plt.figure(figsize=(12,8))

ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Orders per Country', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()
In [42]:
plt.figure(figsize=(12,8))

# UK stands out too much on the chart
ax = df.groupby('invoice_num')['lowercase_country'].unique().value_counts().iloc[1:]

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Orders per Country (without UK)', fontsize=15)
plt.xlabel('Number of Orders', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()




In [43]:
plt.figure(figsize=(12,8))

ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False)

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Revenue per Country', fontsize=15)
plt.xlabel('Amount (millions)', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()
In [44]:
plt.figure(figsize=(12,8))

# UK stands out too much on the chart
ax = df.groupby('lowercase_country').sum()['amount_spent'].sort_values(ascending=False).iloc[1:]

sns.barplot(x=ax, y=ax.index, palette='magma_r', orient='h')

plt.title('Revenue per Country (without UK)', fontsize=15)
plt.xlabel('Amount', fontsize=13)
plt.ylabel('Country', fontsize=13)

plt.yticks(fontsize=12)
plt.tight_layout()




This simple analysis shows that we have two important points of view:

  • Look at the big picture
  • Explore UK sales and separate outside

In [45]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Revenue per Month
ax = df.groupby('year_month').sum()['amount_spent']

sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])

axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

axes[0].set_xticklabels(label, rotation=45)

# Orders x Month
ax = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])

axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)

axes[1].set_xticklabels(label, rotation=45)

plt.tight_layout()




Remember that sales occurred between 01/12/2010 and 09/12/2011. December 2011 sales go through the 9th.

Although the behavior of revenue and orders per month tends to be similar, we can see that the average amount spent on each order per month is different. This will be discussed in the Average Ticket section.

January, February and April sales were the worst, except for December 2011. On the other hand, September, October and November were the best sales, in these months there is a growing demand for Christmas gifts and wholesalers are preparing for Black Friday (last Friday in November).

In [46]:
# Comparing UK sales to sales outside UK
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Revenue per Month (Only UK)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('year_month').sum()['amount_spent']

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[0])

# Revenue per Month (Without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('year_month').sum()['amount_spent']

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[0])

axes[0].set_title('Revenue per Month', fontsize=15)
axes[0].set_xlabel('Month', fontsize=13)
axes[0].set_ylabel('Amount (millions)', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

axes[0].set_xticklabels(label, rotation=45)

axes[0].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

# Orders x Month (Only UK)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes[1])

# Orders x Month (Without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes[1])

axes[1].set_title('Orders per Month', fontsize=15)
axes[1].set_xlabel('Month', fontsize=13)
axes[1].set_ylabel('Number of Orders', fontsize=13)

axes[1].set_xticklabels(label, rotation=45)

axes[1].legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()




Sales outside UK appear to be more stable. Although the "Without UK" curve is much lower than the "UK Only", it looks smoothier over the months. Christmas demand is more pronounced for sales in the UK than for sales outside the UK.

In [47]:
plt.figure(figsize=(10,6))

ax1 = df.groupby('year_month')['amount_spent'].sum()
ax2 = df.groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Average Ticket', fontsize=15)
plt.xlabel('Month', fontsize=13)
plt.ylabel('Amount', fontsize=13)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

plt.xticks(ticks=range(len(label)), labels=label, rotation=45, fontsize=12)
plt.tight_layout()
In [48]:
# Comparing UK Average Ticket to Average Ticket outside UK
fig, axes = plt.subplots(figsize=(12, 6))

# Average Ticker (Only UK)
ax1 = df[
         df['lowercase_country'] == 'united kingdom'
        ].groupby('year_month')['amount_spent'].sum()

ax2 = df[
         df['lowercase_country'] == 'united kingdom'
        ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

label = df['year_month'].apply(lambda date: date[4:] + '_' + date[:4]).unique()

# Average Ticker (Without UK)
ax1 = df[
         df['lowercase_country'] != 'united kingdom'
        ].groupby('year_month')['amount_spent'].sum()

ax2 = df[
         df['lowercase_country'] != 'united kingdom'
        ].groupby('invoice_num')['year_month'].unique().value_counts().sort_index()

# Removing the index brackets
ax2.index = [index[0] for index in ax2.index]
ax = ax1 / ax2

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Average Ticket', fontsize=15)
axes.set_xlabel('Month', fontsize=13)
axes.set_ylabel('Amount', fontsize=13)

axes.set_xticklabels(label, rotation=45)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()




Althongh revenue and number of orders in the UK are higher than abroad, the opposite is true for the average ticket. This indicates that the amount spent on orders outside the UK are higher.

Looking at all the sales over the months, we can see a significant drop in April.

There are some key dates that move retail and April is just between them:

  • Frebruary: Valentine's Day

  • May: Mother's Day

  • June: Father's Day

  • August/September: Back-to-school (most countries are in the northern hemisphere)

  • October: Halloween

  • November: Black Friday

  • December: Christmas

In [49]:
plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Day', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()
In [50]:
# Comparing UK orders to orders outside UK per day
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per day (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()




There is a substancial demand between the 5th and 10th, but it is not enough to say the trend with certainty.

For this analysis, I will use the Polynomial Regression approach, taking the number of orders per day as a series. The goal is to plot just a trend line.

In [51]:
fig, axes = plt.subplots(2, 2, figsize=(10, 7))

for i in range(4):
    
    # Where each subplot will be
    if i < 2:
        plot = (0,i)
    else:
        plot = (1,(i-2))

    # Order per day (UK only)
    y = df[
           df['lowercase_country'] == 'united kingdom'
          ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
    
    # Removing the index brackets
    x = [index[0] for index in y.index]
    
    sns.lineplot(x=x, y=y, color='orange', ax=axes[plot])
    
    # transforming the data to include another axis
    x = np.array(x)
    x = x[:, np.newaxis]
    y = y[:, np.newaxis]
    
    polynomial_features= PolynomialFeatures(degree=(i+1))
    x_poly = polynomial_features.fit_transform(x)
    
    model = LinearRegression()
    model.fit(x_poly, y)
    y_poly_pred = model.predict(x_poly)
    
    axes[plot].plot(x, y_poly_pred, 'r--')
    
    axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
    axes[plot].set_ylabel('')
    
    axes[plot].set_ylim(ymin=0)

plt.tight_layout()




The dashed red line shows the trend. The second and third-degree are almost the same. I will use the third-degree. Higher degrees will just overfit it.

In [52]:
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (UK only)
y = df[
       df['lowercase_country'] == 'united kingdom'
      ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
x = [index[0] for index in y.index]

sns.lineplot(x=x, y=y, color='orange', ax=axes)

# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]

polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)

model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)

axes.plot(x, y_poly_pred, 'r--')

axes.set_title('Orders per Day (UK Only)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)

plt.ylim(ymin=0)

plt.tight_layout()




In the UK, the number of orders starts high at the begining of the month and falls over the days.

In [53]:
fig, axes = plt.subplots(2, 2, figsize=(10, 7))

for i in range(4):
    
    # Where each subplot will be
    if i < 2:
        plot = (0,i)
    else:
        plot = (1,(i-2))

    # Order per day (UK only)
    y = df[
           df['lowercase_country'] != 'united kingdom'
          ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()
    
    # Removing the index brackets
    x = [index[0] for index in y.index]
    
    sns.lineplot(x=x, y=y, color='steelblue', ax=axes[plot])
    
    # transforming the data to include another axis
    x = np.array(x)
    x = x[:, np.newaxis]
    y = y[:, np.newaxis]
    
    polynomial_features= PolynomialFeatures(degree=(i+1))
    x_poly = polynomial_features.fit_transform(x)
    
    model = LinearRegression()
    model.fit(x_poly, y)
    y_poly_pred = model.predict(x_poly)
    
    axes[plot].plot(x, y_poly_pred, 'r--')
    
    axes[plot].set_xlabel('Degree ' + str(i+1), fontsize=13)
    axes[plot].set_ylabel('')
    
    axes[plot].set_ylim(ymin=0)

plt.tight_layout()




Third-degree fits better the number of orders per day.

In [54]:
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day (without UK)
y = df[
       df['lowercase_country'] != 'united kingdom'
      ].groupby('invoice_num')['month_day'].unique().value_counts().sort_index()

# Removing the index brackets
x = [index[0] for index in y.index]

sns.lineplot(x=x, y=y, color='steelblue', ax=axes)

# transforming the data to include another axis
x = np.array(x)
x = x[:, np.newaxis]
y = y[:, np.newaxis]

polynomial_features= PolynomialFeatures(degree=3)
x_poly = polynomial_features.fit_transform(x)

model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)

axes.plot(x, y_poly_pred, 'r--')

axes.set_title('Orders per Day (Without UK)', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['Orders', 'Trend Line'], fontsize=12, framealpha=1)

plt.ylim(ymin=0)

plt.tight_layout()




Outside the UK, we can see that the number of orders remains stable until the 20th and then starts to fall.

Although a third-degree polynomial cannot fully express the sales forecast, it did a good job helping us understand the chaotic behavior of orders per day.

In [55]:
plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Day of Week', fontsize=15)
plt.xlabel('Day', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

# WeekDay goes from 0-Monday to 6-Sunday
label = ['Mon','Tue','Wed','Thur','Fri','Sun']
tick = range(len(label))

plt.xticks(ticks=tick, labels=label, fontsize=12)
plt.tight_layout()
In [56]:
# Comparing UK orders to orders outside UK per day of week
fig, axes = plt.subplots(figsize=(10, 6))

# Order per day of week (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per day of week (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['week_day'].unique().value_counts().sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day of Week', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

label = ['','Mon','Tue','Wed','Thur','Fri', 'Sat', 'Sun']

axes.set_xticklabels(label, fontsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()




The peak in the number of orders is on Thursday. Although Saturday appears on the chart, there are no recorded sales, it is just the trend line between Friday and Sunday.

In [57]:
plt.figure(figsize=(10,6))

ax = df.groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.barplot(x=ax.index, y=ax, color='steelblue')

plt.title('Orders per Hour', fontsize=15)
plt.xlabel('Hour', fontsize=13)
plt.ylabel('Number of Orders', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()




Sales are concetrated around noon. Indicates that there is a preference to buy at lunch.


In [58]:
# Comparing UK orders to orders outside UK per hour
fig, axes = plt.subplots(figsize=(10, 6))

# Order per hour (UK only)
ax = df[
        df['lowercase_country'] == 'united kingdom'
       ].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='orange', ax=axes)

# Order per hour (without UK)
ax = df[
        df['lowercase_country'] != 'united kingdom'
       ].groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index()

# Removing the index brackets
ax.index = [index[0] for index in ax.index]

sns.lineplot(x=ax.index, y=ax, color='steelblue', ax=axes)

axes.set_title('Orders per Day', fontsize=15)
axes.set_xlabel('Day', fontsize=13)
axes.set_ylabel('Number of Orders', fontsize=13)

axes.tick_params(axis="x", labelsize=12)

axes.legend(['UK Only', 'Without UK'], fontsize=12, framealpha=1)

plt.tight_layout()




Sales in the UK take place between 6 am and 8 pm, while external sales take place between 7 am and 7 pm.



These are the top 20 stock codes and how much do they represent in revenue (percentage).

In [59]:
common_codes = []

for code in df['lowercase_stockcode'].value_counts()[:20].index:
    
    common_codes.append((code,
                         products[
                                  (products[~products.isna()]['lowercase_stockcode'] == code)
                                 ]['lowercase_description'].iloc[0],
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])
Out[59]:
lowercase_stockcode description amount percertage
0 85123a white hanging heart t-light holder 99846.98 1.024308
1 22423 regency cakestand 3 tier 164762.19 1.690259
2 85099b jumbo bag red retrospot 92356.03 0.947460
3 47566 party bunting 98302.98 1.008469
4 20725 lunch bag red retrospot 35187.31 0.360979
5 84879 assorted colour bird ornament 58959.73 0.604855
6 22720 set of 3 cake tins pantry design 37413.44 0.383816
7 22197 small popcorn holder 50987.47 0.523069
8 21212 pack of 72 retrospot cake cases 21059.72 0.216047
9 20727 lunch bag black skull. 22219.01 0.227940
10 22383 lunch bag suki design 22318.64 0.228962
11 22457 natural slate heart chalkboard 27995.36 0.287198
12 23203 mailout 40991.38 0.420522
13 post postage 66230.64 0.679446
14 22386 jumbo bag pink polkadot 41619.66 0.426967
15 22469 heart of wicker small 31899.85 0.327254
16 22960 jam making set with jars 36116.09 0.370507
17 21931 jumbo storage bag suki 30167.79 0.309485
18 22086 paper chain kit 50's christmas 63791.94 0.654427
19 22411 jumbo shopper vintage red paisley 28012.48 0.287374
In [60]:
fig, axes = plt.subplots(2, 1, figsize=(12, 12))

# The 20 most common stock codes
ax = df['lowercase_stockcode'].value_counts()[:20]

sns.barplot(x=ax.index, y=ax, 
            palette=sns.cubehelix_palette(20, start=.5, rot=-.75),
            ax=axes[0])

axes[0].set_title('Most Common Stock Codes', fontsize=15)
axes[0].set_xlabel('Stock Code', fontsize=13)
axes[0].set_ylabel('Quantity', fontsize=13)

axes[0].tick_params(axis="x", labelsize=12)

# Percentage of revenue (20 most common stock codes)
common_codes = []

for code in df['lowercase_stockcode'].value_counts()[:20].index:
    
    common_codes.append((code,
                         df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

ax = pd.DataFrame(common_codes, columns=['lowercase_stockcode', 'percertage'])

sns.barplot(x='lowercase_stockcode', y='percertage', data=ax,
            palette=sns.cubehelix_palette(20),
            ax=axes[1])

axes[1].set_title('Percentage of Revenue', fontsize=15)
axes[1].set_xlabel('Stock Code', fontsize=13)
axes[1].set_ylabel('Percentage', fontsize=13)

axes[1].tick_params(axis="x", labelsize=12)

plt.tight_layout()




Being the most frequent does not mean necessarily represent a large part of the revenue. The product 85123A - "white hanging heart t-light holder" is the best seller and representes practically 1.0% of the revenue while the product 22423 - "regency cakestand 3 tier" is the second place and represents more than 1.6% of the revenue.




The top 20 stock codes that contribute the most to revenue:


In [61]:
top_revenue = []

for code in df.groupby(
                       'lowercase_stockcode'
                      )[
                        'amount_spent'
                       ].sum().sort_values(ascending=False)[:20].index:
    
    top_revenue.append((code,
                        products[
                                 (products[~products.isna()]['lowercase_stockcode'] == code)
                                ]['lowercase_description'].iloc[0],
                        df[df['lowercase_stockcode'] == code]['amount_spent'].sum(),
                        df[df['lowercase_stockcode'] == code]['amount_spent'].sum() / df['amount_spent'].sum() * 100))
    

pd.DataFrame(top_revenue, columns=['lowercase_stockcode', 'description', 'amount', 'percertage'])
Out[61]:
lowercase_stockcode description amount percertage
0 dot dotcom postage 206245.48 2.115827
1 22423 regency cakestand 3 tier 164762.19 1.690259
2 85123a white hanging heart t-light holder 99846.98 1.024308
3 47566 party bunting 98302.98 1.008469
4 85099b jumbo bag red retrospot 92356.03 0.947460
5 23084 rabbit night light 66756.59 0.684841
6 post postage 66230.64 0.679446
7 22086 paper chain kit 50's christmas 63791.94 0.654427
8 84879 assorted colour bird ornament 58959.73 0.604855
9 79321 chilli lights 53768.06 0.551595
10 22502 picnic basket wicker small 51041.37 0.523622
11 22197 small popcorn holder 50987.47 0.523069
12 23298 spotty bunting 42700.02 0.438050
13 22386 jumbo bag pink polkadot 41619.66 0.426967
14 23203 mailout 40991.38 0.420522
15 21137 black record cover frame 40596.96 0.416475
16 22720 set of 3 cake tins pantry design 37413.44 0.383816
17 23284 doormat keep calm and come in 36565.39 0.375116
18 22960 jam making set with jars 36116.09 0.370507
19 82484 wood black board ant white finish 35859.27 0.367872
In [62]:
plt.figure(figsize=(12,6))

# top 20 stock codes in revenue
ax = df.groupby(
                'lowercase_stockcode'
               )[
                 'amount_spent'
                ].sum().sort_values(ascending=False)[:20] / df['amount_spent'].sum() * 100

sns.barplot(x=ax.index, y=ax, palette='GnBu')

plt.title('Percentage of Revenue', fontsize=15)
plt.xlabel('Stock Code', fontsize=13)
plt.ylabel('Percentage', fontsize=13)

plt.xticks(fontsize=12)
plt.tight_layout()




DOT means "dotcom postage", that is, it is not a product but a service. The real top product is the stock code 22423 - "regency cakestand 3 tier" and stands out compared to others.






We previously performed a superficial analysis dealing with stock codes. It's time to clean up this mess, separating what is a product from other operations. In this way, it is possible to group products into categories.

In [63]:
products
Out[63]:
lowercase_stockcode lowercase_description
0 85123a white hanging heart t-light holder
1 71053 white metal lantern
2 84406b cream cupid hearts coat hanger
3 84029g knitted union flag hot water bottle
4 84029e red woolly hottie white heart.
... ... ...
536908 23090 missing
537621 85123a cream hanging heart t-light holder
538554 85175 NaN
538919 23169 smashed
540421 23843 paper craft , little birdie

5635 rows × 2 columns



Remember that we interpreted stock codes with lengths of 5 and 6 as products.

In [64]:
# stock_code lengths different from 5 and 6
desc_stockcode(products).dropna()
Out[64]:
lowercase_stockcode lowercase_description
0 d discount
1 m manual
2 s samples
3 b adjust bad debt
4 c2 carriage
6 dot dotcom postage
8 post postage
10 pads pads to match all cushions
11 cruk cruk commission
12 15056bl edwardian parasol black
13 79323gr unsaleable, destroyed.
14 79323lp unsaleable, destroyed.
15 dcgs0076 sunjar led night night light
16 dcgs0003 boxed glass ashtray
17 dcgs0070 camouflage dog collar
21 dcgs0069 ooh la la dogs collar
23 dcgssboy boys party bag
24 dcgs0004 haynes camper shoulder bag
25 dcgs0073 ebay
28 dcgs0069 ebay
29 dcgs0068 ebay
30 dcgs0067 ebay
31 dcgs0003 ebay
32 amazonfee amazon fee
33 dcgssgirl girls party bag
35 bank charges bank charges
36 gift_0001_40 dotcomgiftshop gift voucher £40.00
37 gift_0001_50 dotcomgiftshop gift voucher £50.00
38 gift_0001_30 dotcomgiftshop gift voucher £30.00
39 gift_0001_20 dotcomgiftshop gift voucher £20.00
40 gift_0001_10 dotcomgiftshop gift voucher £10.00
41 gift_0001_20 to push order througha s stock was



Of all the stock codes above, only these represent products:

  • 15056bl - edwardian parasol black
  • dcgs0076 - sunjar led night night light
  • dcgs0003 - boxed glass ashtray
  • dcgs0070 - camouflage dog collar
  • dcgs0069 - ooh la la dogs collar
  • dcgssboy - boys party bag
  • dcgs0004 - haynes camper shoulder bag
  • dcgssgirl - girls party bag

I will drop the remaining stock codes.

In [65]:
products_clean = products.dropna()
products_clean
Out[65]:
lowercase_stockcode lowercase_description
0 85123a white hanging heart t-light holder
1 71053 white metal lantern
2 84406b cream cupid hearts coat hanger
3 84029g knitted union flag hot water bottle
4 84029e red woolly hottie white heart.
... ... ...
535336 22576 check
536908 23090 missing
537621 85123a cream hanging heart t-light holder
538919 23169 smashed
540421 23843 paper craft , little birdie

4675 rows × 2 columns

In [66]:
# List of stock codes to drop (not a product)
codesToDrop = ['d', 'm', 's', 'b', 'c2', 'dot', 'post', 'pads', 'cruk',
               '79323gr', '79323lp', 'dcgs0073', 'dcgs0069', 'dcgs0068',
               'dcgs0067', 'dcgs0003', 'amazonfee', 'bank charges',
               'gift_0001_40', 'gift_0001_50', 'gift_0001_30', 'gift_0001_20',
               'gift_0001_10', 'gift_0001_20']

products_clean = products_clean[~products_clean['lowercase_stockcode'].isin(codesToDrop)]
In [67]:
products_clean
Out[67]:
lowercase_stockcode lowercase_description
0 85123a white hanging heart t-light holder
1 71053 white metal lantern
2 84406b cream cupid hearts coat hanger
3 84029g knitted union flag hot water bottle
4 84029e red woolly hottie white heart.
... ... ...
535336 22576 check
536908 23090 missing
537621 85123a cream hanging heart t-light holder
538919 23169 smashed
540421 23843 paper craft , little birdie

4649 rows × 2 columns



We need to remove unnecessary things from product description before our analysis.

Punctuation, color names and numbers will not be very effective when we try to correlate products by their description. Suppose a product called "Action Figure: Red Swamp Monster" and another called "15cm Action Figure". Between these two descriptions what matters is the fact that they are Action Figures, not the colon, color or their size.

Removing stopwords will eliminate generic words that can be commom in many descriptions. Stemming the words will allow we to group variations of words with the same origin and meaning.

In [68]:
# Stemming
ps = PorterStemmer()

def cleanText(text):
    # Stripping punctuation from the string
    # This performs raw string operations in C
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Stripping numbers
    text = ''.join([i for i in text if not i.isdigit()])
    # Removing color names
    text = ' '.join([i for i in text.split() if(i not in CSS3_NAMES_TO_HEX)])
    text = text.strip().split()
    # Removing stopwords and stemming
    text = [ps.stem(word) for word in text if(word not in stopwords.words('english'))]
    
    return ' '.join(text)
In [69]:
products_clean['cleanned_description'] = products_clean['lowercase_description'].apply(cleanText)
C:\Users\Joelma\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.



Tokenize allows to count how many words are in each description and search for products with a specific word.

In [70]:
products_clean['tokenized_description'] = products_clean['cleanned_description'].apply(lambda text: text.split())
C:\Users\Joelma\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [71]:
products_clean
Out[71]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
0 85123a white hanging heart t-light holder hang heart tlight holder [hang, heart, tlight, holder]
1 71053 white metal lantern metal lantern [metal, lantern]
2 84406b cream cupid hearts coat hanger cream cupid heart coat hanger [cream, cupid, heart, coat, hanger]
3 84029g knitted union flag hot water bottle knit union flag hot water bottl [knit, union, flag, hot, water, bottl]
4 84029e red woolly hottie white heart. woolli hotti heart [woolli, hotti, heart]
... ... ... ... ...
535336 22576 check check [check]
536908 23090 missing miss [miss]
537621 85123a cream hanging heart t-light holder cream hang heart tlight holder [cream, hang, heart, tlight, holder]
538919 23169 smashed smash [smash]
540421 23843 paper craft , little birdie paper craft littl birdi [paper, craft, littl, birdi]

4649 rows × 4 columns



The frequency of the number of words in each description can help define what is a product and what is dirty data.

In [72]:
plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()



The description pattern with 0, 1 and 2 words is a hint that there are "defective descriptions" between them.

In [73]:
# Wordless descriptions
products_clean[
               products_clean['tokenized_description'].map(len) == 0
              ].head()
Out[73]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
7313 21275 ? []
21518 20956 ? []
38261 21479 ? []
43662 22837 ? []
50806 21258 ? []



Wordless descriptions are just numbers and punctuation. Taking stock code 21275 at random, we can see there is a real description associated with this "noise".

In [74]:
products_clean[products_clean['lowercase_stockcode'] == '21275']
Out[74]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
7313 21275 ? []
17390 21275 zinc top 2 door wooden shelf zinc top door wooden shelf [zinc, top, door, wooden, shelf]
In [75]:
# Removing wordless descriptions
products_clean = products_clean[products_clean['tokenized_description'].map(len) != 0]
In [76]:
products_clean[products_clean['tokenized_description'].map(len) == 0]
Out[76]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
In [77]:
# Description with 1 word
products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].head()
Out[77]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
144 21832 chocolate calculator calcul [calcul]
1010 71406c black orange squeezer squeezer [squeezer]
6391 22734 amazon amazon [amazon]
6392 22139 amazon amazon [amazon]
8591 62018 sombrero sombrero [sombrero]
In [78]:
# Removing duplicate descriptions
products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].drop_duplicates(subset=['cleanned_description'])
Out[78]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
144 21832 chocolate calculator calcul [calcul]
1010 71406c black orange squeezer squeezer [squeezer]
6391 22734 amazon amazon [amazon]
8591 62018 sombrero sombrero [sombrero]
12953 21336 gold washbag washbag [washbag]
13217 84968f check check [check]
13264 35833g damages damag [damag]
21338 22423 faulty faulti [faulti]
39047 85135b found found [found]
55189 84050 counted count [count]
65077 46000s dotcom dotcom [dotcom]
73589 85161 samples/damages samplesdamag [samplesdamag]
107041 21769 showroom showroom [showroom]
112036 20681 mia mia [mia]
112219 21736 adjustment adjust [adjust]
115805 20893 damages/display damagesdisplay [damagesdisplay]
115813 21320 broken broken [broken]
128876 22459 ?lost lost [lost]
140281 82494l cracked crack [crack]
140283 22719 sold as 22467 sold [sold]
165021 84763 display display [display]
165448 22458 missing miss [miss]
166646 84531a returned return [return]
170542 84228 crushed crush [crush]
174044 21656 samples sampl [sampl]
177705 23200 mailout mailout [mailout]
192289 15058a wet/rusty wetrusti [wetrusti]
192293 37370 damages/dotcom? damagesdotcom [damagesdotcom]
201488 22700 smashed smash [smash]
270339 82583 fba fba [fba]
285726 84352 damages/samples damagessampl [damagessampl]
332404 22355 test test [test]
355423 84422 dagamed dagam [dagam]
384867 21090 wet/mouldy wetmouldi [wetmouldi]
394643 23322 mouldy mouldi [mouldi]
417108 23595 re-adjustment readjust [readjust]
418065 23118 breakages breakag [breakag]
420390 20713 marked as 23343 mark [mark]
476866 72817 dotcomstock dotcomstock [dotcomstock]
514649 84859c wet? wet [wet]
535329 21693 mixed up mix [mix]



The stock code 62018 - sombrero is the only product from the list above.

In [79]:
# List of descriptions to drop (not a product)
descToDrop = ['amazon', 'check', 'damag', 'faulti', 'found', 'count',
              'dotcom', 'samplesdamag', 'showroom', 'mia', 'adjust',
              'damagesdisplay', 'broken', 'lost', 'crack', 'sold',
              'display', 'miss', 'return', 'crush', 'sampl', 'mailout',
              'wetrusti', 'damagesdotcom', 'smash', 'fba', 'damagessampl',
              'test', 'dagam', 'wetmouldi', 'mouldi', 'readjust', 'breakag',
              'mark', 'dotcomstock', 'wet', 'mix']

products_clean = products_clean[~products_clean['cleanned_description'].isin(descToDrop)]
In [80]:
products_clean[
               products_clean['tokenized_description'].map(len) == 1
              ].drop_duplicates(subset=['cleanned_description'])
Out[80]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
144 21832 chocolate calculator calcul [calcul]
1010 71406c black orange squeezer squeezer [squeezer]
8591 62018 sombrero sombrero [sombrero]
12953 21336 gold washbag washbag [washbag]
In [121]:
# Description with 2 words
print(
products_clean[
               products_clean['tokenized_description'].map(len) == 2
              ]['cleanned_description'].value_counts().index.to_list()
      )
['sweetheart bracelet', 'feltcraft hairband', 'paper parasol', 'diamant necklac', 'vintag ear', 'edwardian parasol', 'kitchen scale', 'cherri light', 'refectori clock', 'shark helicopt', 'space cadet', 'thrown away', 'dragonfli helicopt', 'polkadot bowl', 'disco handbag', 'stonecryst ear', 'polkadot plate', 'fli swat', 'polkadot wrap', 'wicker star', 'frangipani necklac', 'fish gnome', 'teacoffeesugar jar', 'ubolight triobas', 'polkadot cup', 'polkadot beaker', 'floral monster', 'photo frame', 'ninja rabbit', 'cat bowl', 'geisha girl', 'henchick basket', 'babi bunt', 'teddi bear', 'birdhous decor', 'dog bowl', 'crystalglass bracelet', 'breakfast tray', 'gemston bracelet', 'fabric mirror', 'pud spoon', 'stand gnome', 'egg spoon', 'aperitif glass', 'rose washbag', 'frost base', 'parti bag', 'tumbler baroqu', 'tlight set', 'wrap carousel', 'champagn glass', 'rosebud ring', 'mug dotcomgiftshopcom', 'bendi straw', 'retrospot bowl', 'doormat flock', 'classic frame', 'string curtain', 'bag stone', 'basket toadstool', 'chrysanthemum journal', 'feather penlight', 'riviera necklac', 'flower handbag', 'chrysanthemum sketchbook', 'feather curtain', 'lipstick pen', 'hole punch', 'doormat heart', 'paisley journal', 'water bunni', 'frangipani hairclip', 'wrap pear', 'hand sign', 'box ribbon', 'newspap stand', 'carniv bracelet', 'lovebird lantern', 'flower sticker', 'sweeti sticker', 'retrospot purs', 'toast dinosaur', 'riviera handbag', 'frog candl', 'knit hen', 'daisi sketchbook', 'calcul ruler', 'heart calcul', 'parti sunglass', 'orbit bracelet', 'purs heart', 'mirror cornic', 'chambr hook', 'paisley notebook', 'retrospot lamp', 'necklacebracelet butterfli', 'capiz chandeli', 'skull sticker', 'lariat cm', 'woodland sticker', 'vintag mug', 'butterfli sticker', 'glass candlehold', 'cinderella chandeli', 'plasmatron lamp', 'water dinosaur', 'daisi notebook', 'doormat airmail', 'space owl', 'water eleph', 'harmonica box', 'grow plant', 'tile tray', 'fern journal', 'bathroom hook', 'button box', 'retrospot plate', 'fern notebook', 'candlepot jardin', 'photo cube', 'votiv candl', 'dolphin windmil', 'wrap flock', 'urban ribbon', 'retrospot tape', 'retrospot mug', 'flamingo light', 'roccoco chandeli', 'ribbon empir', 'cinammon wreath', 'bingo set', 'wrap cowboy', 'sandalwood fan', 'dog cannist', 'glassshellpearl necklac', 'cat plate', 'flower mug', 'parti bunt', 'throw away', 'toybox wrap', 'grand chocolatecandl', 'glamor mug', 'feather penhot', 'polkadot pen', 'doormat friendship', 'embroid quilt', 'congratul bunt', 'tile hook', 'doggi rubber', 'pictur domino', 'rocket balloon', 'rain poncho', 'wrap appl', 'retrospot wrap', 'spaceboy beaker', 'potter mug', 'feather pencoal', 'owl doorstop', 'space frog', 'spotti bunt', 'cacti candl', 'photo album', 'easter hensflow', 'polkadot mug', 'daisi journal', 'retrospot apron', 'sage incens', 'paisley sketchbook', 'doormat topiari', 'wine goblet', 'pinkamethystgold necklac', 'chrysanthemum notebook', 'doormat retrospot', 'tissu ream', 'retrospot umbrella', 'butterfli washbag', 'heart handbag', 'chilli light', 'strawberri handbag', 'thrown awayrusti', 'daisi mobil', 'pompom curtain', 'retrospot washbag', 'ribbon purs', 'fern sketchbook', 'look mirror', 'skull tape', 'toilett bottl', 'paint rabbit', 'john lewi', 'toast love', 'christma tree', 'voil lampshad', 'damagescredit aso', 'anim sticker', 'polkadot purs', 'pad mobil', 'retrospot cup', 'pack charg', 'fan frame', 'string butterfliespink', 'wrap christma', 'sweeti necklac', 'incens tin', 'wheelbarrow children', 'decoupagegreet card', 'heart sticker']



Based on the words that indicates the stock code is not a product, we can make a list and drop all records with those words.

In [82]:
# If there is at least one word from the list, it will return "True"
def isInDesc(desc, wordsList):
    count = 0
    for word in desc:
        count += (word in wordsList)
    if count > 0:
        return True
    else:
        return False
In [120]:
# Listing possible words that represent that the stock code is not a product
print(
products_clean[products_clean[
                              'tokenized_description'
                             ].apply(lambda desc: isInDesc(desc, descToDrop))
              ]['cleanned_description'].to_list()
      )
['check cat doorstop', 'robot mug display box', 'set tone egg shape mix bowl', 'set colour mix bowl', 'cruso check lampshad', 'pair pad hanger check', 'check bag w handl xcm', 'small dolli mix design bowl', 'mix nut light bowl', 'sunset check hammock', 'crystal stud ear clear display']
In [84]:
# Words chosen from the list above
wordsToDrop = ['dotcom', 'amazon', 'adjust', 'mouldi', 'label',
               'wrongli', 'sold', 'damag', 'stock', 'barcod',
               'wet', 'wrong', 'rusti', 'alloc', 'crush', 'code',
               'crack', 'mark', 'unsal', 'c', 'sampl', 'found']

# Removing records
products_clean = products_clean[~products_clean[
                                                'tokenized_description'
                                               ].apply(lambda desc: isInDesc(desc, wordsToDrop))]
In [85]:
plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()



After cleaning, we only have products in the dataframe. Is there more than one description per product?

In [86]:
# Products with more than one description
products_clean[
               'lowercase_stockcode'
              ].value_counts()[products_clean['lowercase_stockcode'].value_counts() > 1]
Out[86]:
23131     4
23236     4
23196     4
23240     3
22937     3
         ..
23061     2
23041     2
22502     2
79030d    2
23497     2
Name: lowercase_stockcode, Length: 240, dtype: int64



Looking at stock code 23196, we can see that the descriptions are similar, but there are grammatical errors and/or some adjectives.

In [87]:
products_clean[products_clean['lowercase_stockcode'] == '23196']
Out[87]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
237422 23196 retro leaves magnetic notepad retro leav magnet notepad [retro, leav, magnet, notepad]
238991 23196 reto leaves magnetic shopping list reto leav magnet shop list [reto, leav, magnet, shop, list]
246802 23196 leaves magnetic shopping list leav magnet shop list [leav, magnet, shop, list]
252851 23196 vintage leaf magnetic notepad vintag leaf magnet notepad [vintag, leaf, magnet, notepad]



If there were different stock codes for the same product with diferent colors or some other characteristic, it would be great. We can't change that, but we can reduce it to just one description per product. We do not want to lose information, so we will keep the description with greater number of words and discard the rest.

In [88]:
# Size of description
pd.DataFrame(zip(products_clean['lowercase_stockcode'],
                 products_clean['tokenized_description'].map(len)),
             columns=['stock_code', 'desc_length'],
             index=products_clean.index)
Out[88]:
stock_code desc_length
0 85123a 4
1 71053 2
2 84406b 5
3 84029g 6
4 84029e 3
... ... ...
510117 20711 2
512588 23617 5
527065 90214u 5
537621 85123a 5
540421 23843 4

4063 rows × 2 columns

In [89]:
# Sorting by stock code
descSize = pd.DataFrame(zip(products_clean['lowercase_stockcode'],
                            products_clean['tokenized_description'].map(len)),
                        columns=['stock_code', 'desc_length'],
                        index=products_clean.index).sort_values('stock_code')
In [90]:
# Keeps the longer description

index = 0
indexToKeep = []

while(index != descSize.shape[0]):
    
    # Calculate the range with the same stock code
    count = 0
    count += descSize[descSize['stock_code'] == descSize.iloc[index]['stock_code']]['stock_code'].count()
    
    # Finds the description with more words in the range
    indexToKeep.append(descSize['desc_length'].iloc[index:(index+count)].idxmax())
    
    index += count
In [91]:
products_clean = products_clean.loc[indexToKeep]
In [92]:
plt.figure(figsize=(10,6))

products_clean['tokenized_description'].map(len).hist(bins=30, color='steelblue')

plt.xlabel('Number of Words', fontsize=13)
plt.xticks(fontsize=12)
plt.tight_layout()



We now have clean descriptions with no duplicates.



It's time to group products by description and identify which category each cluster represents. For this task, I will choose a simple path using the K-Means approach.

Feature extraction is not an easy thing when it comes to product clustering. I will use 2 methods: CountVectorizer and TF-IDF.


CountVectorizer will only convert our text into a vector of token counts. There is no type of weight based on the "relevance" of a word in the vector, being one of the most simple ways to approach this problem.

In [93]:
vectorizer = CountVectorizer()

# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])
Out[93]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)
In [119]:
#vectorizer.vocabulary_
In [95]:
# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])

# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
Out[95]:
abc abstract acapulco account acryl address ade adult advent afghan ... xcm xma yellowblu yelloworang yellowpink your yuletid zero zinc zincglass
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3795 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3796 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3797 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3798 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3799 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3800 rows × 1791 columns



There is no clue as to how many groups we are looking for, how many product categories there are. Let's use the Elbow Method to help us find a good K value.

In [96]:
# Determine k using elbow method

Sum_of_squared_distances = []

# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(vector)
    Sum_of_squared_distances.append(km.inertia_)

plt.figure(figsize=(12,6))    
    
plt.plot(K, Sum_of_squared_distances, 'bx-')

plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)

plt.xticks(fontsize=12)
plt.tight_layout()



It is not easy to say exactly where the "elbow" is, but choosing the number of clusters equal to 35 (K = 35) is quite fair.

In [97]:
K = 35

km = KMeans(n_clusters=K)
km.fit(vector)
Out[97]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=35, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [98]:
km.labels_
Out[98]:
array([12, 12, 12, ..., 33, 10, 10])
In [99]:
plt.figure(figsize=(12,6))

sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()
In [100]:
prod_clusters = pd.DataFrame({
                              'lowercase_description': products_clean['lowercase_description'],
                              'cleanned_description': products_clean['cleanned_description'],
                              'category': km.labels_
                             })
In [101]:
prod_clusters
Out[101]:
lowercase_description cleanned_description category
31 inflatable political globe inflat polit globe 12
103332 groovy cactus inflatable groovi cactu inflat 12
5452 doggy rubber doggi rubber 12
5451 hearts wrapping tape heart wrap tape 6
12492 spots on red bookcover tape spot bookcov tape 12
... ... ... ...
170783 haynes camper shoulder bag hayn camper shoulder bag 4
40052 camouflage dog collar camouflag dog collar 12
21326 sunjar led night night light sunjar led night night light 33
84016 boys party bag boy parti bag 10
84017 girls party bag girl parti bag 10

3800 rows × 3 columns



It is difficult to measure whether the clusterings are correct or not. I will create a dataframe where each row is a group/category and the columns are samples of product descriptions from the same cluster.

This way, our metrics will be common sense. The more it make sense, the better.

In [102]:
# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)

top_desc_cluster = pd.DataFrame({
                                 'category': 0,
                                 'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
                                }, index=[0])

for cluster in range(1, K):
    
    top_desc_cluster = pd.concat([top_desc_cluster,
                                  pd.DataFrame({
                                 'category': cluster,
                                 'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
                                }, index=[0])],
                                  ignore_index=True)
In [103]:
top_desc_cluster
Out[103]:
category desc_1 desc_2 desc_3 desc_4 desc_5
0 0 clear stationery box set box of 9 pebble candles vintage keepsake box pink flower vintage keepsake box paris days vintage keepsake box travelogue
1 1 colouring pencils brown tube colouring pencils brown tube assorted colours silk fan popart wooden pencils asst teatime round pencil sharpener
2 2 glitter hanging butterfly string pink poodle hanging decoration ivory hanging decoration heart ivory hanging decoration egg ivory hanging decoration bird
3 3 vintage notebook paris days vintage notebook travelogue vintage notebook beauty girl vintage photo album paris days set/4 modern vintage cotton napkins
4 4 the king gift bag funky monkey gift bag medium swirly circular rubbers in bag jumbo bag toys jumbo bag woodland animals
5 5 asstd design racing car pen ice cream design garden parasol key fob , garage design skulls design flannel cakes and rabbits design flannel
6 6 hearts wrapping tape red purse with pink heart pink heart red handbag red heart luggage tag zinc heart lattice 2 wall planter
7 7 porcelain budah incense holder assorted laquered incense holders pink tall porcelain t-light holder white tall porcelain t-light holder zinc heart lattice t-light holder
8 8 small folding scissor(pointed edge) small chinese style scissor ass col small sand gecko p'weight small blue provencal ceramic ball small glass sundae dish clear
9 9 wrap english rose origami rose incense in tube blue stones on wire for candle blue rose patch purse pink butterfl blue rose fabric mirror
10 10 party time pencil erasers party food shopper bag party invites woodland party invites dinosaurs party invites balloon girl
11 11 large chinese style scissor ass col large sand frog p'weight large black diamante hairslide large blue provencal ceramic ball large pink glass sundae dish
12 12 inflatable political globe groovy cactus inflatable doggy rubber spots on red bookcover tape army camo bookcover tape
13 13 blue glass gems in bag clear milkshake glass silver aperitif glass gold aperitif glass antique all glass candlestick
14 14 beware of the cat metal sign cook with wine metal sign you're confusing me metal sign bathroom metal sign party metal sign
15 15 black ear muff headphones gold ear muff headphones camouflage ear muff headphones antique raspberry flower earrings antique mid blue flower earrings
16 16 fairy cake design umbrella wrap pink fairy cakes fairy cake flannel assorted colour pack of 72 retrospot cake cases pack of 72 skull cake cases
17 17 piece of camo stationery set origami vanilla incense/candle set origami jasmine incense/candle set origami lavender incense/candl set origami rose incense/candle set
18 18 home sweet home metal sign doormat fancy font home sweet home doormat spotty home sweet home home sweet home mug paperweight home sweet home
19 19 blue polkadot garden parasol pink polkadot garden parasol pop art push down rubber garden path pocket book garden path sketchbook
20 20 ribbons purse mummy mouse red gingham ribbon 6 ribbons shimmering pinks 6 ribbons empire 6 ribbons rustic charm
21 21 dolly mixture children's umbrella moody girl door hanger keep out girls door hanger tv dinner tray dolly girl magic drawing slate dolly girl
22 22 flowers handbag blue and orange flower fairy,5 summer b'draw liners flower fairy incense bouquet rose flower candle+incense 16x16cm toast its - fairy flower
23 23 mini funky design tapes mini highlighter pens gold mini tape measure silver mini tape measure black mini tape measure
24 24 disco ball christmas decoration crazy daisy heart decoration etched glass star tree decoration glitter snow pear tree decoration swiss chalet tree decoration
25 25 blue polkadot passport cover cherry blossom passport cover first class passport cover queen of the skies passport cover tropical passport cover
26 26 tumbler, baroque new baroque jewellery box pink new baroque flock candlestick blue new baroque flock candlestick new baroque black boxes
27 27 key ring baseball boot assorted key ring baseball boot union jack bling key ring stand letter "a" bling key ring letter "b" bling key ring
28 28 raspberry ant copper flower necklac midnight blue copper flower necklac copper/olive green flower necklace midnight blue drop crystal necklace black drop crystal necklace
29 29 balloon art make your own flowers art lights,funk monkey large white/pink rose art flower medium white/pink rose art flower small white/pink rose art flower
30 30 rattle snake eggs blue polkadot egg cup red egg spoon blue egg spoon 2 picture book eggs easter chicks
31 31 wrap 50's christmas bird box christmas tree decoration christmas toilet roll green goose feather christmas tree white goose feather christmas tree
32 32 fan black frame pastel pink photo album pastel blue photo album black photo album maxwell 2 tone blue 60 page photo a
33 33 red toadstool led night light star portable table light snowflake portable table light set 10 night owl lights pig keyring with light & sound
34 34 red stripe ceramic drawer knob blue stripe ceramic drawer knob blue spot ceramic drawer knob red spot ceramic drawer knob white spot red ceramic drawer knob



There is some correlation between the products of the same cluster, but it is far from what we want. I will move on to the next approach.


In [105]:
vectorizer = TfidfVectorizer()

# Tokenize and build vocab
vectorizer.fit(products_clean['cleanned_description'])
Out[105]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)
In [123]:
#vectorizer.vocabulary_
In [107]:
# Encode document
vector = vectorizer.transform(products_clean['cleanned_description'])

# Summarize encoded vector
pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
Out[107]:
abc abstract acapulco account acryl address ade adult advent afghan ... xcm xma yellowblu yelloworang yellowpink your yuletid zero zinc zincglass
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3795 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3796 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3797 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3798 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3799 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3800 rows × 1791 columns

In [108]:
# Determine k using elbow method

Sum_of_squared_distances = []

# There are 3800 product description
# It is reasonable to think that there can be 100 groups of 38 products each
K = range(1, 101)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(vector)
    Sum_of_squared_distances.append(km.inertia_)

plt.figure(figsize=(12,6))    
    
plt.plot(K, Sum_of_squared_distances, 'bx-')

plt.xlabel('k', fontsize=13)
plt.ylabel('Sum of Squared Distances', fontsize=13)
plt.title('Elbow Method For Optimal k', fontsize=15)

plt.xticks(fontsize=12)
plt.tight_layout()
In [109]:
# This time, 36 seems to be reasonable number
K = 36

km = KMeans(n_clusters=K)
km.fit(vector)
Out[109]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=36, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [110]:
km.labels_
Out[110]:
array([ 5,  5,  5, ...,  5, 23, 23])
In [111]:
plt.figure(figsize=(12,6))

sns.barplot(x=pd.DataFrame(km.labels_)[0].value_counts().index, y=pd.DataFrame(km.labels_)[0].value_counts())
plt.xticks(fontsize=12)
plt.tight_layout()
In [112]:
prod_clusters = pd.DataFrame({
                              'lowercase_description': products_clean['lowercase_description'],
                              'cleanned_description': products_clean['cleanned_description'],
                              'category': km.labels_
                             })
In [113]:
prod_clusters
Out[113]:
lowercase_description cleanned_description category
31 inflatable political globe inflat polit globe 5
103332 groovy cactus inflatable groovi cactu inflat 5
5452 doggy rubber doggi rubber 5
5451 hearts wrapping tape heart wrap tape 7
12492 spots on red bookcover tape spot bookcov tape 10
... ... ... ...
170783 haynes camper shoulder bag hayn camper shoulder bag 5
40052 camouflage dog collar camouflag dog collar 5
21326 sunjar led night night light sunjar led night night light 5
84016 boys party bag boy parti bag 23
84017 girls party bag girl parti bag 23

3800 rows × 3 columns

In [114]:
# Top descriptions per cluster
# The lowercase_description column position is 0 (first column)

top_desc_cluster = pd.DataFrame({
                                 'category': 0,
                                 'desc_1': prod_clusters[prod_clusters['category'] == 0].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == 0].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == 0].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == 0].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == 0].iloc[4, 0]
                                }, index=[0])

for cluster in range(1, K):
    
    top_desc_cluster = pd.concat([top_desc_cluster,
                                  pd.DataFrame({
                                 'category': cluster,
                                 'desc_1': prod_clusters[prod_clusters['category'] == cluster].iloc[0, 0],
                                 'desc_2': prod_clusters[prod_clusters['category'] == cluster].iloc[1, 0],
                                 'desc_3': prod_clusters[prod_clusters['category'] == cluster].iloc[2, 0],
                                 'desc_4': prod_clusters[prod_clusters['category'] == cluster].iloc[3, 0],
                                 'desc_5': prod_clusters[prod_clusters['category'] == cluster].iloc[4, 0]
                                }, index=[0])],
                                  ignore_index=True)
In [115]:
top_desc_cluster
Out[115]:
category desc_1 desc_2 desc_3 desc_4 desc_5
0 0 silver looking mirror goldie looking mirror blue rose fabric mirror pink rose fabric mirror silver fabric mirror
1 1 crazy daisy heart decoration etched glass star tree decoration glitter snow pear tree decoration swiss chalet tree decoration dark bird house tree decoration
2 2 blue stones on wire for candle red retrospot luggage tag red retrospot purse red retrospot bowl red retrospot childrens umbrella
3 3 vintage union jack bunting vintage union jack cushion cover vintage union jack memoboard vintage union jack doorstop vintage union jack apron
4 4 french style embossed heart cabinet french lavender scent heart classic french style basket green classic french style basket brown classic french style basket natural
5 5 inflatable political globe groovy cactus inflatable doggy rubber army camo bookcover tape colouring pencils brown tube
6 6 zinc heart lattice 2 wall planter set of skull wall stickers mirrored dove wall decoration mirrored wall art stars mirrored wall art photo frames
7 7 hearts wrapping tape wrap, carousel wrap bad hair day wrap pink flock wrap english rose
8 8 wrap 50's christmas disco ball christmas decoration set/9 christmas t-lights scented bird box christmas tree decoration christmas toilet roll
9 9 blue polkadot garden parasol pink polkadot garden parasol blue polkadot passport cover blue polkadot luggage tag blue polkadot purse
10 10 spots on red bookcover tape cream slice flannel chocolate spot cream slice flannel pink spot large cake towel, chocolate spots large cake towel pink spots
11 11 bling key ring stand letter "a" bling key ring letter "b" bling key ring letter "d" bling key ring letter "e" bling key ring
12 12 black ear muff headphones gold ear muff headphones camouflage ear muff headphones midnight blue crystal drop earrings black crystal drop earrings
13 13 midnight blue copper flower necklac copper/olive green flower necklace midnight blue drop crystal necklace black drop crystal necklace silver/mop orbit necklace
14 14 mini funky design tapes asstd design racing car pen vintage red mug vintage red teatime mug vintage billboard love/hate mug
15 15 ice cream design garden parasol mini highlighter pens gold mini tape measure silver mini tape measure black mini tape measure
16 16 flowers handbag blue and orange flower fairy,5 summer b'draw liners flower fairy incense bouquet rose flower candle+incense 16x16cm toast its - fairy flower
17 17 fairy cake design umbrella wrap pink fairy cakes pack of 72 retrospot cake cases pack of 72 skull cake cases ivory paper cup cake cases
18 18 rattle snake eggs 2 picture book eggs easter chicks 2 picture book eggs easter ducks 2 picture book eggs easter bunny yellow easter egg hunt start post
19 19 midnight blue glass/silver bracelet black/white glass/silver bracelet amber glass/silver bracelet silver m.o.p orbit bracelet gold m.o.p orbit bracelet
20 20 small folding scissor(pointed edge) small chinese style scissor red purse with pink heart pink heart red handbag red heart luggage tag
21 21 love potion masala incense red flock love heart photo frame toast its - i love you love large wood letters red heart shape love bucket
22 22 skulls greeting card fancy font birthday card, elephant birthday card vintage kid dolly card cowboys and indians birthday card
23 23 party time pencil erasers party food shopper bag party invites woodland party invites dinosaurs party invites balloon girl
24 24 large black diamante hairslide pink diamante pen in gift box green diamante pen in gift box silver diamante pen in gift box blue diamante pen in gift box
25 25 piece of camo stationery set origami vanilla incense/candle set origami jasmine incense/candle set origami lavender incense/candl set origami rose incense/candle set
26 26 victorian metal postcard spring home sweet home metal sign beware of the cat metal sign cook with wine metal sign you're confusing me metal sign
27 27 s/3 pot pouri cushions blue colours cherry blossom passport cover first class passport cover queen of the skies passport cover tropical passport cover
28 28 clear stationery box set box of 9 pebble candles vintage keepsake box pink flower vintage keepsake box paris days vintage keepsake box travelogue
29 29 pink tall porcelain t-light holder white tall porcelain t-light holder zinc heart lattice t-light holder hanging bauble t-light holder small hanging bauble t-light holder large
30 30 assorted colours silk fan assorted incense pack assorted laquered incense holders assorted colour mini cases funky washing up gloves assorted
31 31 blue glass gems in bag clear milkshake glass large pink glass sundae dish small glass sundae dish clear small pink glass sundae dish
32 32 pack of 12 skull tissues pack of 12 red retrospot tissues pack of 12 woodland tissues pack of 12 suki tissues pack of 12 blue paisley tissues
33 33 white skull hot water bottle fawn blue hot water bottle chick grey hot water bottle retrospot heart hot water bottle pink heart dots hot water bottle
34 34 vintage kitchen print puddings red polkadot pudding bowl blue polkadot pudding bowl red pudding spoon blue pudding spoon
35 35 blue tiled tray blue tile hook number tile cottage garden 0 number tile cottage garden 1 number tile cottage garden 2
In [116]:
products_clean[products_clean['lowercase_description'] == 'colouring pencils brown tube']
Out[116]:
lowercase_stockcode lowercase_description cleanned_description tokenized_description
741 10133 colouring pencils brown tube colour pencil tube [colour, pencil, tube]
2461 10135 colouring pencils brown tube colour pencil tube [colour, pencil, tube]



The TF-IDF approach seems to be better than CountVectorizer. Clusters appear to be more concise. There are more than one product with the same description, but I will assume they are different.

In both approachs, there are clusters that stand out for the discrepant number of products. Next time, I will apply the same method to these clusters and create subcategories.

In [ ]: