Skip to content

Module 2 final project for Flatiron School data science immersive program

License

Notifications You must be signed in to change notification settings

gdurante2019/Mod2-Final-Project-Durante

Repository files navigation

Blog post URL: https://github.com/gdurante2019/gdurante2019.github.io/blob/master/_posts/2019-07-01-what_do_we_know_about_discounts_driving_sales.markdown

Project Description

Goals

The goal of the project is to use statistical analysis of data in the Northwind SQL database to answer at least four questions of business relevance to Northwind, an importer of food and beverage products. To answer these questions using statistical analysis, it will be necessary to perform hypothesis testing.

The first of these four questions is a required question: Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?

Deliverables

The following deliverables are required:

  1. A Jupyter Notebook containing any code written for this project.
  2. A Blog Post explaining process, methodology, and findings.
  3. An "Executive Summary" PowerPoint Presentation that explains the hypothesis tests run, findings, and relevance to company stakeholders.
  4. A Video Walkthrough of the “Executive Summary” presentation.

Questions Addressed

  1. (Required) Do discounts have a statistically significant effect on the average quantity of products per order?
  2. Do discounts have a statistically significant effect on the average revenue per order?
  3. Does the level of discount (e.g., 5%, 10%, 15%...) have a statistically significant effect on the average quantity of products per order?
  4. Does the level of discount (e.g., 5%, 10%, 15%...) have a statistically significant effect on the average revenue per order?
  5. Do the revenue distributions differ in a statistically significant manner from one product category to the next? Put another way, do some product categories have statistically significantly higher revenues per order than others?

Importing data

Import libraries

import pandas as pd
from sqlite3 import *
from sqlalchemy import *
import pprint as pp
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Create connection object

engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
con = engine.connect()

Inspect the database schema (table names and column names)

# Get names of tables in database

inspector = inspect(engine)
print("Table names")
table_names = inspector.get_table_names()
pp.pprint(inspector.get_table_names())
print()
Table names
['Category',
 'Customer',
 'CustomerCustomerDemo',
 'CustomerDemographic',
 'Employee',
 'EmployeeTerritory',
 'Order',
 'OrderDetail',
 'Product',
 'Region',
 'Shipper',
 'Supplier',
 'Territory',
 'placeholder']
# Get column names for a specific table
    
def print_col_names(var_table):
    var_table = var_table
    print(f"Column names for table {var_table}:")
    for i in inspector.get_columns(var_table):
        print(i['name'])
    print()
print_col_names('Customer')
Column names for table Customer:
Id
CompanyName
ContactName
ContactTitle
Address
City
Region
PostalCode
Country
Phone
Fax
print_col_names('Order')
Column names for table Order:
Id
CustomerId
EmployeeId
OrderDate
RequiredDate
ShippedDate
ShipVia
Freight
ShipName
ShipAddress
ShipCity
ShipRegion
ShipPostalCode
ShipCountry
print_col_names('OrderDetail')
Column names for table OrderDetail:
Id
OrderId
ProductId
UnitPrice
Quantity
Discount
print_col_names('Product')
Column names for table Product:
Id
ProductName
SupplierId
CategoryId
QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
Discontinued
print_col_names('Category')
Column names for table Category:
Id
CategoryName
Description

Importing data and setting up dataframes

# Create a connection object and a cursor object to execute SQL commands
engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
con = engine.connect()

# Set up query
query = """
        SELECT o.id, o.CustomerID, o.OrderDate, 
        od.ProductID, od.UnitPrice as [OrderUnitPrice], 
        od.Quantity as [OrderQty], od.Discount, 
        p.ProductName, p.CategoryId, 
        cat.CategoryName, cat.Description as [CatDescription]
        FROM [Order] as o 
        JOIN OrderDetail as oD
        ON o.ID = od.OrderID   
        JOIN Product as p
        ON od.ProductID = p.ID
        JOIN Category as cat
        ON cat.id = p.CategoryId
        ORDER BY ProductID
        """

# Execute query and create df from results
res = con.execute(query)
df = pd.read_sql(query, con)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
1 10294 RATTC 2012-08-30 1 14.4 18 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
2 10317 LONEP 2012-09-30 1 14.4 20 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
4 10354 PERIC 2012-11-14 1 14.4 12 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales

Function to pull original data from sqlite db and load into pandas dataframe

# Function load_df() to pull original data from sqlite db and load into pandas 
# dataframe.  This function is used as a subfunction later in the notebook 
# to enable reconstruction of the database if needed.


def load_df(query):
    
    # Create a connection object and a cursor object to execute SQL commands
    engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
    con = engine.connect()
    
    # Set up query
    query = """
            SELECT o.id, o.CustomerID, o.OrderDate, 
            od.ProductID, od.UnitPrice as [OrderUnitPrice], 
            od.Quantity as [OrderQty], od.Discount, 
            p.ProductName, p.CategoryId, 
            cat.CategoryName, cat.Description as [CatDescription]
            FROM [Order] as o 
            JOIN OrderDetail as oD
            ON o.ID = od.OrderID   
            JOIN Product as p
            ON od.ProductID = p.ID
            JOIN Category as cat
            ON p.CategoryId = cat.Id
            ORDER BY ProductID
            """
    
    # Execute query and create df from results
    res = con.execute(query)
    df = pd.read_sql(query, con)
    return df
df = load_df(query)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
1 10294 RATTC 2012-08-30 1 14.4 18 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
2 10317 LONEP 2012-09-30 1 14.4 20 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
4 10354 PERIC 2012-11-14 1 14.4 12 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales

First question: Quantity and discount/no discount

Do discounts result in a statistically significant effect on quantity of products per order?

Null and alternative hypotheses

  • Null hypothesis: No statistically significant difference between the means of quantities of discounted products purchased per order versus quantities of non-discounted products purchased per order

    • Ho: $\mu_1 = \mu_2$
  • Alternative hypothesis: Statistically significant differences between the means of quantities of discounted products purchased per order versus quantities of non-discounted products purchased per order are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2$

EDA

Create dataframes looking at subsets of data

Create dataframes that contain products ordered without discounts and products ordered that were discounted:

df_no_disc = df.loc[df['Discount'] == 0]
df_no_disc.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription
1 10294 RATTC 2012-08-30 1 14.4 18 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
2 10317 LONEP 2012-09-30 1 14.4 20 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
4 10354 PERIC 2012-11-14 1 14.4 12 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
6 10406 QUEE 2013-01-07 1 14.4 10 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
7 10413 LAMAI 2013-01-14 1 14.4 24 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
df_disc = df.loc[df['Discount'] > 0]
df_disc.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
9 10522 LEHMS 2013-04-30 1 18.0 40 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales
10 10526 WARTH 2013-05-05 1 18.0 8 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales

Create plotting functions for visualizations

# Functions for plotting average order quantities

def hist_plot1_qty(series1, label=None, color='darkcyan', alpha=0.7, bins=30, xlim=[0, 150]):
    plt.hist(series1, label=label, color=color, alpha=alpha, bins=bins)
    plt.title(f"Order quantity by occurrence")
    plt.xlabel("Order Quantity")
    plt.ylabel("Number")
    plt.legend(loc='best')
    plt.xlim(xlim)
    plt.show()
    
# function for plotting two series    
def hist_plot2_qty(series1, series2, label=None, label1=None, color1='b', color2='r', alpha=0.4, bins=30, xlim=[0, 150]):
    plt.hist(series1, label=label, color=color1, alpha=alpha, bins=bins)
    plt.hist(series2, label=label1, color=color2, alpha=alpha, bins=bins)
    plt.title(f"Order quantity by occurrence")
    plt.xlabel("Order Quantity")
    plt.ylabel("Number")
    plt.legend(loc='best')
    plt.xlim(xlim)  
    plt.show()
    
# function for plotting three different series 
def hist_plot3_qty(var1, var2, var3, label=None, label1=None, label2=None, color='b', alpha=0.5, bins=30, xlim=[0, 150]):
    var1 = var1
    var2 = var2
    var3 = var3
    alpha = alpha
    plt.hist(var1, label=label, color='r', alpha=0.5, bins=bins)
    plt.hist(var2, label=label1, color='b', alpha=0.5, bins=bins)
    plt.hist(var3, label=label2, color='g', alpha=0.5, bins=bins)
    plt.title(f"Order quantity by occurrence")
    plt.xlabel("Order Quantity")
    plt.ylabel("Number")
    plt.legend(loc='best')
    plt.xlim(xlim)
    plt.show()

Histogram of total average order quantity by average order quantity for no discount vs. some discount

# plt.figure(figsize=(8, 6))
hist_plot2_qty(df_no_disc.OrderQty, df_disc.OrderQty, label='No discounts (X = 0% discount)', label1='All discounts (X = 1%-25%+ discount)')

png

The histogram shows the distribution of order quantities values (x-axis). The distribution of both order quantities without discounts and total order quantities (whether discounted or not) appear very similar, with the smallest order amounts appearing at the far left hand side of graph and the highest order order amounts appearing at the far right.

If discounts have an effect on the quantity of products sold, we would expect to see the distribution of quantities sold from discounts look different from the distribution of quantities sold that didn't involve discounts. If the shape of the distribution changes, such that a greater proportion of quantities appear in the middle or far right-hand-side of the graph, then that difference in distribution suggests that discounts increase the quantity of products in a given order.

Testing for significance: sampling, test setup and results

To test whether or not the presence of a discount likely has an effect on the quantity of products sold, we can use a two-sided t-test. Two prerequisites for performing this test are that 1) both datasets have approximately normal distributions and 2) the variances of the two datasets are equal.

Test chosen: t-test (2-sided, 2-sample)

We can see that the above distributions are not normal. Before proceding with the 2-sides, 2-sample t-test, we must first address this issue.

Fortunately, we can obtain a normal distribution (per the Central Limit Theorem) by sampling each dataset multiple times and computing each sample's mean. With a large enough number of samples (n > 30) we can expect that the distribution of the means will be normal or very nearly normal. We can take these normal distributions and run statistical tests on them.

Each time the cells below are run, 100 samples are taken 300 times for each of the two populations (orders with no discounts and orders with some level of discount). The mean and standard deviation are computed for each of the 300 sampling events, and each of the resulting means and standard deviations is appended to the respective relevant list.

Multiple sampling results in different results each time; however, because there are well over 30 means and standard deviations, we can assume that the distributions of these means and standard deviations are normal, per the Central Limit Theorem. We can now perform hypothesis tests requiring normal distributions of statistics on these means and standard deviations.

Sampling function to obtain normal distribution of means

def sampling_mean(var, sampling_events=300, size=100):   # var is the dataframe and column of interest
    import numpy as np
    var = var
    samp_mean = []
    samp_std = []
    for i in range(sampling_events):
        sample = np.random.choice(var, size=size, replace=False)
        samp_mean.append(sample.mean())
        samp_std.append(sample.std())
    return samp_mean, samp_std
sampling_no_disc_mean, sampling_no_disc_std = sampling_mean(df_no_disc.OrderQty)
print(np.mean(sampling_no_disc_mean))
print(np.std(sampling_no_disc_std))
21.6004
2.3409277308454226

Histograms of sampling distributions

hist_plot1_qty(sampling_no_disc_mean, label="Order qty, no discounts", xlim=None)

png

sampling_w_disc_mean, sampling_w_disc_std = sampling_mean(df_disc.OrderQty)
print(np.mean(sampling_w_disc_mean))
print(np.std(sampling_w_disc_std))
27.1053
2.24426228479168
hist_plot1_qty(sampling_w_disc_mean, label="Order qty, w/discounts", xlim=None)

png

Barlett's Test for variance

The t-test performed above requires not only that the samples come from approximately normally distributed populations, but also that they come from populations that have approximately equal variances. We use Bartlett’s test on normal populations to test the null hypothesis that the samples are from populations with equal variances.

import scipy.stats

scipy.stats.bartlett(sampling_no_disc_mean, sampling_w_disc_mean)
BartlettResult(statistic=2.564061882357949, pvalue=0.10931738849981974)

When the p-value is > 0.05, we cannot reject the null hypothesis that the samples come from populations with similar or equal variances.

One interesting point is that there appears to be more fluctuation in terms of means and distributions for order quantities than for order revenues (presumably because the revenue numbers are much larger than the quantity numbers). Sometimes the distributions that result from resampling pass the Bartlett test, and sometimes they don't. One thing that seems fairly consistent, though, is that the distributions are reasonably normal and the shapes for each are similar--meaning that it the distributions might not always pass Bartlett's test, but when they don't, they aren't so very far off the mark.

Because of this, I would still like to run the t-test to find out what results we get.

T-test for two independent samples

import scipy.stats as scs

scs.ttest_ind(sampling_no_disc_mean, sampling_w_disc_mean)
Ttest_indResult(statistic=-35.273313157553034, pvalue=3.1316578958420464e-148)

The p-value is <<< 0.05, so if the other conditions for this test (normality, similar variances) are met, we should reject the null hypothesis that the samples come from populations with the same means. Now, depending on the particular samples that are pulled when the model is run, the distributions may or may not pass Bartlett's test. If we do not pass Bartlett's test on a particular running of the model, we can take a look at the distributions to see how similar or dissimilar they are to each other and figure out if an alterative test would be better.

If we decide to perform this test and we get a p-value of << 0.05, then it's likely that there is an actual statistically significant difference between the populations that is not due to chance. But what is the effect size of that difference? Enter Cohen's d.

Effect size: Cohen's d

Cohen's d allows us to view the magnitude of the effect of a statistically significant difference. From the "Trending Sideways: Practical Engineering and Statistics Updates" blog page, Cohen’s d is "simply a measure of the distance between two means, measured in standard deviations." So, we could expect that the majority of analyses will yield Cohen's d results in a range from 0 to 3 or 4 (representing the differences between two means in terms of standard deviations).

# From learn.co Section 20 Effect Size lesson, with conditional print statements added

def Cohen_d(group1, group2, print_stmt=True):

    # Compute Cohen's d.

    # group1: Series or NumPy array  ----> in this case, the array is the list of sample means from resampling
    # group2: Series or NumPy array  ----> in this case, the array is the list of sample means from resampling

    # returns a floating point number 

    diff = group1.mean() - group2.mean()

    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()

    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    
    if print_stmt:           # changing print_stmt argument in the function call to False will turn off these print statements
        if abs(d) >= 0 and abs(d) < .4:
            print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **small** effect size.")
        elif abs(d) >= .4 and abs(d) < .7:
            print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **medium** effect size.")
        else:
            print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **large** effect size.")
    
    return d
group1 = np.array(sampling_w_disc_mean)
group2 = np.array(sampling_no_disc_mean)

d = Cohen_d(group1, group2)
d
This Cohen's d result (2.88486608) represents a difference of 2.88 pooled standard deviations between the two groups, and suggests a **large** effect size.





2.884866082701662

The functions below allow the creation of plots showing the overlap of two randomly generated normal distributions look like at a given Cohen's d. Following each pair's Cohen's d result is a visualization of the amount of overlap the given Cohen's d provides. The functions below are adapted from the Effect Sizes lesson in Section 20 of Module 2 of the Flatiron Full-Time Online Data Science Bootcamp.

def evaluate_PDF(var, x=4):
    '''Input: a random variable object, standard deviation
       output : x and y values for the normal distribution
       '''
    
    # Identify the mean and standard deviation of random variable 
    mean = var.mean()
    std = var.std()

    # Use numpy to calculate evenly spaced numbers over the specified interval (4 sd) and generate 100 samples.
    xs = np.linspace(mean - x*std, mean + x*std, 100)
    
    # Calculate the peak of normal distribution i.e. probability density. 
    ys = var.pdf(xs)

    return xs, ys # Return calculated values

def plot_pdfs(cohen_d=2):
    """Plot randomly generated, normally distributed PDFs for 
    distributions that differ by some number of stds.
    
    cohen_d: number of standard deviations between the means
    """
    group1 = scipy.stats.norm(cohen_d,  1)
    group2 = scipy.stats.norm(0, 1)
    xs, ys = evaluate_PDF(group1)
    plt.fill_between(xs, ys, label='Group1', color='#ff2289', alpha=0.7)

    xs, ys = evaluate_PDF(group2)
    plt.fill_between(xs, ys, label='Group2', color='#376cb0', alpha=0.7)
plot_pdfs(d)

png

Red = group1 = all discounted products;
Blue = group2 = all non-discounted products

This graph provides a way to visualize the difference of the means, as provided by the Cohen's d test result. Cohen's d is a measure of the distance between the means, depicted in units of pooled standard deviation. In this case,

Findings and interpretation of test results for Question 1

Finding

There is a statistically significant difference in the quantity of products per order for non-discounted products vs. discounted products.

Guidelines for interpreting Cohen's d

Cohen offered basic guidelines for interpreting the result of his test (along with a caution not to simply mindlessly apply the guidance, but to evaluate it in the larger context of the field of analysis).

In general, d = 0.2 would be considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect. In the function above, I have included conditional print statments to accompany the result of running the function.

Sign matters when interpreting Cohen's d

Note that Cohen's d can be positive or negative, depending on which dataset is assigned to group1 and which to group2. In the Cohen's d calculation in the previous cell:

  • 'group1' contained the distribution of means from resampling of the dataset containing only revenues from discounted products, and
  • 'group2' contained the distribution of means from resampling of the dataset containing revenues only from non-discounted products.

In conclusion

A positive Cohen's d tells us that the means in the order quantities from discounted products are d pooled standard deviations higher than the means from the order quantities from non-discounted products.

Second question: Revenues and discount / no discount

Do discounts result in a statistically significant effect on per-order revenue?

Null and alternative hypotheses

  • Null hypothesis: No statistically significant difference between the means of per-order revenues resulting from discounts and per-order revenues from non-discounted products

    • Ho: $\mu_1 = \mu_2$
  • Alternative hypothesis: There are statistically significant differences between the means of revenues resulting from discounts and revenues from non-discounted products that are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2$

EDA

Before proceding, we need to add a "Revenue" column

Add Revenue, Revenue Percentage, and Revenue Fraction (revenue per row divided by total revenues)

# Add Revenue, Revenue Percentage, and Revenue Fraction (actual revenue per row divided by total revenues)

def add_rev_col():
    
    # add a revenue column
    df["Revenue"] = df["OrderUnitPrice"] * (1 - df["Discount"]) * df["OrderQty"]
        
    # add columns showing percentage / fraction of total revenue 
    df["RevPercentTotal"] = (df["Revenue"] / df["Revenue"].sum())*100  
    df["RevFractionTotal"] = (df["Revenue"] / df["Revenue"].sum())  
    
    return df
df = add_rev_col()
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
1 10294 RATTC 2012-08-30 1 14.4 18 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 259.2 0.020477 0.000205
2 10317 LONEP 2012-09-30 1 14.4 20 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 288.0 0.022753 0.000228
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
4 10354 PERIC 2012-11-14 1 14.4 12 0.00 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 172.8 0.013652 0.000137

Create sub-dataframes -- all sales; non-discounted product sales; discounted products sales

total_revenue = df.loc[df.Discount >= 0.00]
revenue_no_discount = df.loc[df.Discount == 0.00]
revenue_all_discounts = df.loc[df.Discount >=0.01]
total_revenue.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.2 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
1 10294 RATTC 2012-08-30 1 14.4 18 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 259.2 0.020477 0.000205
2 10317 LONEP 2012-09-30 1 14.4 20 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 288.0 0.022753 0.000228
revenue_no_discount.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
1 10294 RATTC 2012-08-30 1 14.4 18 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 259.2 0.020477 0.000205
2 10317 LONEP 2012-09-30 1 14.4 20 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 288.0 0.022753 0.000228
4 10354 PERIC 2012-11-14 1 14.4 12 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 172.8 0.013652 0.000137
revenue_all_discounts.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145

Create visualization functions

# Functions for plotting

def hist_plot1(series1, label=None, color='darkcyan', alpha=0.7, bins=50, xlim=[0, 3500], print_stmt=True):
    plt.hist(series1, label=label, color=color, alpha=alpha, bins=bins)
    plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
    plt.xlabel("Revenue")
    plt.ylabel("Number")
    plt.legend(loc="best")
    plt.xlim(xlim)
    plt.show()
    if print_stmt:
        print(f"Total revenues (discount and no discount):  {int(total_revenue.Revenue.sum())}")
        print(f"Total revenues from all discounted products:  {int(revenue_all_discounts.Revenue.sum())}")
        print(f"Percentage of total revenue from discounts:  {round((revenue_all_discounts.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")

def hist_plot2(series1, series2, label=None, label1=None, color1='b', color2='r', alpha=0.4, bins=100, xlim=[0, 3500], ylim=None):
    plt.hist(series1, label=label, color=color1, alpha=alpha, bins=bins)
    plt.hist(series2, label=label1, color=color2, alpha=alpha, bins=bins)
    plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
    plt.xlabel("Revenue")
    plt.ylabel("Number")
    plt.legend(loc='best')
    plt.xlim(xlim)  
    plt.ylim(ylim)
    plt.show()
hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label="Avg order revenue, no discount", label1="Avg order revenue, all discounts", color1='g', color2='b', xlim=[0, 3500])

png

hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label="Avg order revenue, no discount", label1="Avg order revenue, all discounts", color1='g', color2='b', xlim=[1500, 10000], ylim=[0,20])

png

The histogram shows the distribution of revenue values (x-axis). The distribution of both revenues without discounts and all revenues coming from discounts appear very similar, with the smallest order amounts appearing most frequently (far left hand side of graph) and the highest revenue order amounts appearing at the far right.

If discounts have an effect on sales revenues, we would expect to see the distribution of revenues from discounts look different from the distribution of revenues due to sales that didn't involve discounts. If the shape of the distribution changes, such that a greater proportion of revenues appear in the middle or far right-hand-side of the graph, then that difference in distribution suggests that discounts actually increase revenues (since higher revenues would occur more frequently in that distribution than in the no-discount distribution).

Testing for significance: sampling, test setup and results

As for question one, we'd like to perform a two-sided two-sample test. We need to check for normality and check variances between the two distributions.

Test chosen: t-test (2-sided, 2-sample)

We can see that the above distributions are not normal. As we did for Question 1, we will perform resampling for each dataset to obtain a reasonaly normal distribution of means.

Each time the cells below are run, 100 samples are taken 300 times for each of the two populations (revenues from no discounts, revenues from all discounts). The mean and standard deviation are computed for each of the 300 sampling events, and each of the resulting means and standard deviations is appended to the respective relevant list.

Sampling to obtain normal distribution of means

# using sampling_mean(var, sampling_events=300, size=100) function from Question 1

sampling_no_disc_mean, sampling_no_disc_std = sampling_mean(revenue_no_discount.Revenue)
print(np.mean(sampling_no_disc_mean))
print(np.std(sampling_no_disc_std))
570.5213759999999
381.14048065742645
hist_plot1(sampling_no_disc_mean, label="Avg order revenue, no discounts", xlim=None)

png

Total revenues (discount and no discount):  1265793
Total revenues from all discounted products:  515094
Percentage of total revenue from discounts:  40.693%
sampling_w_disc_mean, sampling_w_disc_std = sampling_mean(revenue_all_discounts.Revenue)
print(np.mean(sampling_w_disc_mean))
print(np.std(sampling_w_disc_std))
618.2533294666667
331.2482261270112
hist_plot1(sampling_w_disc_mean, label="Avg order revenue, all discounts", xlim=None)

png

Total revenues (discount and no discount):  1265793
Total revenues from all discounted products:  515094
Percentage of total revenue from discounts:  40.693%

Barlett's Test for variance

The t-test performed above requires not only that the samples come from approximately normally distributed populations, but also that they come from populations that have approximately equal variances. We use Bartlett’s test on normal populations to test the null hypothesis that the samples are from populations with equal variances.

import scipy.stats

scipy.stats.bartlett(sampling_no_disc_mean, sampling_w_disc_mean)
BartlettResult(statistic=0.04396538104808006, pvalue=0.8339180343436614)

The p-value is > 0.05, so we cannot reject the null hypothesis that the samples come from populations with similar or equal variances. Therefore, the null hypothesis that the variances of the two data set populations are approximately equal.

T-test for two independent samples

import scipy.stats as scs

scs.ttest_ind(sampling_no_disc_mean, sampling_w_disc_mean)
Ttest_indResult(statistic=-6.189709371451119, pvalue=1.1197350179704822e-09)

The p-value is << 0.05, so we should reject the null hypothesis that the samples come from populations with the same means. Therefore, we can say that it's unlikely that the differences in sample means from one dataset to another are due to random chance.

So, it's likely that there is an actual statistically significant difference between the two populations. But what is the effect size of that difference? Enter Cohen's d.

Effect size: Cohen's d

Using function from Cohen's d section in Question 1:

group1 = np.array(sampling_w_disc_mean)
group2 = np.array(sampling_no_disc_mean)

Cohen_d(group1, group2)
This Cohen's d result (0.50623208) represents a difference of 0.51 pooled standard deviations between the two groups, and suggests a **medium** effect size.





0.506232078277466

Using evaluate_PDF and plot_pdfs functions defined in Question 1:

plot_pdfs(d)

png

Red = all discounted products; Blue = all non-discounted products

This graph provides a way to visualize the difference of the means, as provided by the Cohen's d test result. Cohen's d is a measure of the distance between the means, depicted in units of pooled standard deviation. In this case,

Findings and interpretation of test results for Question 2

Findings

There is a statistically significant difference in the average revenues per order for non-discounted products vs. discounted products. The average revenue per order was higher when discounts were present than when discounts were not present.

Guidelines for interpreting Cohen's d

As described in the Cohen's d section in Question 1 above, Cohen suggested rules of thumb for how to interpret Cohen's d. In general, d = 0.2 would be considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect.

Sign matters when interpreting Cohen's d

Note that Cohen's d can be positive or negative, depending on which dataset is assigned to group1 and which to group2. In the Cohen's d calculation in the previous cell:

  • 'group1' contained the distribution of means from resampling of the dataset containing only revenues from discounted products, and
  • 'group2' contained the distribution of means from resampling of the dataset containing revenues only from non-discounted products.

In Conclusion

A positive Cohen's d from this ordering tells us that the means in the revenues from discounted products are d pooled standard deviations higher than the means from the revenues from non-discounted products.

Third question: Per-order quantity vs. level of discount

Does the discount level (e.g., 5%, 10%, 15%...) affect the average order quantity?

While we were able to reject the null hypothesis in the first question (that order quantities are the same whether products are discounted or not), we should dig a bit deeper to find out whether some discount levels result in a greater quantities per order than others.

Null and alternative hypotheses for third question

  • Null hypothesis: No significant difference among the means of product order quantities from the various discount values

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$)
  • Alternative hypothesis: Differences among the means of product order quantities resulting from various discount values are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$

EDA

Create sub-dataframes for discount levels

# Create dataframes with different discount levels (1-5%, 5-10%, 10-15%, 15-20%, and 20% and up)
    
def create_disc_dfs():
    '''
    This function creates dataframes for each category of discount in 5% increments, up to 
    20% (which is represented as 20% and up).  The function creates the dataframes 
    for future use.  
    '''
    
    df_disc_01to05 = df.loc[(df["Discount"] >= .01) & (df["Discount"] < .05)]
    df_disc_05to10 = df.loc[(df["Discount"] >= .05) & (df["Discount"] < .10)]
    df_disc_10to15 = df.loc[(df["Discount"] >= .10) & (df["Discount"] < .15)]
    df_disc_15to20 = df.loc[(df["Discount"] >= .15) & (df["Discount"] < .2)]
    df_disc_20andup = df.loc[df["Discount"] >= .20]
    return df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup

# when running this function, run the following line to extract all of the dataframes:
# df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup = create_disc_dfs()
df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup = create_disc_dfs()

Uncomment the lines in each of the cells below to see the top 5 rows of the dataframe and info on the dataframe columns and datatypes

df_disc_01to05.head(3)
# rev_disc_01to05.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
135 11077 RATTC 2014-05-06 6 25.00 1 0.02 Grandma's Boysenberry Spread 2 Condiments Sweet and savory sauces, relishes, spreads, an... 24.5000 0.001936 0.000019
329 11077 RATTC 2014-05-06 14 23.25 1 0.03 Tofu 7 Produce Dried fruit and bean curd 22.5525 0.001782 0.000018
378 11077 RATTC 2014-05-06 16 17.45 2 0.03 Pavlova 3 Confections Desserts, candies, and sweet breads 33.8530 0.002674 0.000027
df_disc_05to10.head(3)
# rev_disc_05to10.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
26 10905 WELLI 2014-02-24 1 18.0 20 0.05 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 342.00 0.027019 0.000270
54 10632 WANDK 2013-08-14 2 19.0 30 0.05 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 541.50 0.042780 0.000428
61 10787 LAMAI 2013-12-19 2 19.0 15 0.05 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 270.75 0.021390 0.000214
df_disc_10to15.head(3)
# rev_disc_10to15.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
33 11025 WARTH 2014-04-15 1 18.0 10 0.1 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 162.0 0.012798 0.000128
50 10485 LINOD 2013-03-25 2 15.2 20 0.1 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 273.6 0.021615 0.000216
84 10485 LINOD 2013-03-25 3 8.0 20 0.1 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 144.0 0.011376 0.000114
df_disc_15to20.head(3)
# rev_disc_15to20.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
10 10526 WARTH 2013-05-05 1 18.0 8 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 122.4 0.009670 0.000097
df_disc_20andup.head(3)
# rev_disc_20andup.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
9 10522 LEHMS 2013-04-30 1 18.0 40 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 576.0 0.045505 0.000455
16 10646 HUNGO 2013-08-27 1 18.0 15 0.25 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 202.5 0.015998 0.000160
df_disc_10to15.head(3)
# rev_disc_10to15.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
33 11025 WARTH 2014-04-15 1 18.0 10 0.1 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 162.0 0.012798 0.000128
50 10485 LINOD 2013-03-25 2 15.2 20 0.1 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 273.6 0.021615 0.000216
84 10485 LINOD 2013-03-25 3 8.0 20 0.1 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 144.0 0.011376 0.000114
df_disc_15to20.head(3)
# rev_disc_15to20.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
10 10526 WARTH 2013-05-05 1 18.0 8 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 122.4 0.009670 0.000097
df_disc_20andup.head(3)
# rev_disc_20andup.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
9 10522 LEHMS 2013-04-30 1 18.0 40 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 576.0 0.045505 0.000455
16 10646 HUNGO 2013-08-27 1 18.0 15 0.25 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 202.5 0.015998 0.000160
df_disc_05to10.groupby(['CategoryName']).count()  # shows sample size for each category
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CatDescription Revenue RevPercentTotal RevFractionTotal
CategoryName
Beverages 26 26 26 26 26 26 26 26 26 26 26 26 26
Condiments 19 19 19 19 19 19 19 19 19 19 19 19 19
Confections 29 29 29 29 29 29 29 29 29 29 29 29 29
Dairy Products 41 41 41 41 41 41 41 41 41 41 41 41 41
Grains/Cereals 17 17 17 17 17 17 17 17 17 17 17 17 17
Meat/Poultry 17 17 17 17 17 17 17 17 17 17 17 17 17
Produce 11 11 11 11 11 11 11 11 11 11 11 11 11
Seafood 26 26 26 26 26 26 26 26 26 26 26 26 26

Visualizations

Summary of quantities by level of discount

df.groupby('Discount')['OrderQty'].sum().plot(kind='bar', 
            title ="Total Number of Items Sold by Discount Level", figsize=(8, 5), fontsize=12)
plt.show()

png

Around 28,500 product units sold (approximately 55%) were not discounted, whereas approximately 23,000 (around 45%) product units were sold at a discount. Of these ~23,000 product units sold at a discount, product units sold at a 5% discount comprised the largest subset at just over 5000 units sold at this discount level. The remaining four discount levels--10%, 15%, 20%, and 25%--represent almost identical unit sales of approximately 4500 units each.

Visualizations of order quantities versus levels of discounts

As in the first case, we can create some initial visualizations to see if different discount levels result in differences in average order quantities.

hist_plot2_qty(df_no_disc.OrderQty, df_disc.OrderQty, label="Avg order qty, no discounts",  label1="Avg order qty, all discounts")

png

hist_plot2_qty(df_no_disc.OrderQty, df_disc_01to05.OrderQty, label="Avg order qty, no discounts",  label1="Avg order qty, Discounts = 1-4.9%", xlim=([0,150]))

# there are very few data points in the df_disc_01to05 category; going forward, I will delete this one.

png

hist_plot2_qty(df_no_disc.OrderQty, df_disc_05to10.OrderQty, label="Avg order qty, no discounts",  label1="Avg order qty, Discounts = 5-9.9%", xlim=([0,150]))

png

hist_plot2_qty(df_no_disc.OrderQty, df_disc_10to15.OrderQty, label="Avg order qty, no discounts",  label1="Avg order qty, Discounts = 10-14.9%", xlim=([0,150]))

png

hist_plot2_qty(df_no_disc.OrderQty, df_disc_15to20.OrderQty,  label="Avg order qty, no discounts",  label1="Avg order qty, Discounts = 15-19.9%", xlim=([0,150]))

png

hist_plot2_qty(df_no_disc.OrderQty, df_disc_20andup.OrderQty,  label="Avg order qty, no discounts",  label1="Avg order qty, Discounts = 20% and up", xlim=([0,150]))

png

Testing for significance: sampling, test setup and results

A review of the histogram plots for order quantities deriving from various discount levels shows that, much like the overall order quantity frequency histograms, the histograms of these revenue-from-discount subsets reveal non-normally distributed data. To run the statistical tests needed to test the hypotheses in the second question, I will perform multiple sampling of the datasets, then performing statistical tests on the distribution of the means. Per the Central Limit Theorem, these distributions should be approximately normal, and so can be used for other tests (specifically Tukey's test).

First test chosen: ANOVA

This test will tell us that at least one of the datasets is significantly different than the others. (It doesn't tell us which one, though.)

As with the t-test above, I will first perform resampling and create distributions of the means of these resampling events to form approximately normal distributions to perform the ANOVA test.

Multiple sampling of datasets to obtain normal distribution of means

no_discounts = sampling_mean(revenue_no_discount.OrderQty)
print(f"The mean of the distribution is {round(np.mean(no_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(no_discounts[1]), 2)}")
hist_plot1_qty(no_discounts[0],  label="Avg order qty, no discount", xlim=([0, 40]))
The mean of the distribution is 21.76
The std deviation of the distribution is 2.29

png

all_discounts = sampling_mean(revenue_all_discounts.OrderQty)
print(f"The mean of the distribution is {round(np.mean(all_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(all_discounts[1]), 2)}")
hist_plot1_qty(all_discounts[0], label="Avg order qty, all discounts", xlim=([0, 50]))
The mean of the distribution is 27.12
The std deviation of the distribution is 2.25

png

# testing that function results can be plugged into ttest_ind

scs.ttest_ind(no_discounts[0], all_discounts[0])
Ttest_indResult(statistic=-36.465964150081504, pvalue=3.945476681113779e-154)
Sub-dataframes based on revenues from a given discount level

As a reminder, here are the dataframes I've built so far:

  • total_revenue -- Total revenues, with and without discounts
  • revenue_no_discount -- All revenues resulting from sales of non-discounted products
  • revenue_all_discounts -- All revenues resulting from sales of products at all discount levels
  • df_disc_01to05.Revenue -- Products discounted 1-4.9%
  • df_disc_05to10.Revenue -- Products discounted 5-9.9%
  • df_disc_10to15.Revenue -- Products discounted 10-14.9%
  • df_disc_15to20.Revenue -- Products discounted 15-19.9%
  • df_disc_20andup.Revenue -- Products discounted 20% or more
total_rev_means = sampling_mean(total_revenue.OrderQty)
np.mean(total_rev_means[0]), np.std(total_rev_means[1])
(23.843966666666667, 2.5112563991838885)
revs_no_disc_means = sampling_mean(revenue_no_discount.OrderQty)
np.mean(revs_no_disc_means[0]), np.std(revs_no_disc_means[1])
(21.763033333333333, 2.3470803659571597)
revs_all_disc_means = sampling_mean(revenue_all_discounts.OrderQty)
np.mean(revs_all_disc_means[0]), np.std(revs_all_disc_means[1])
(27.08096666666667, 2.2441359474713334)

Note--I'm not including revenues from 1-4.9% discounts because there are only 7 data points--not enough revenue points in this sample to compute these statistics

df_5to10_disc_means = sampling_mean(df_disc_05to10.OrderQty)
np.mean(df_5to10_disc_means[0]), np.std(df_5to10_disc_means[1])
(27.908199999999997, 1.5019613711301814)
df_10to15_disc_means = sampling_mean(df_disc_10to15.OrderQty)
np.mean(df_10to15_disc_means[0]), np.std(df_10to15_disc_means[1])
(25.406566666666663, 1.999842545129951)
df_15to20_disc_means = sampling_mean(df_disc_15to20.OrderQty)
np.mean(df_15to20_disc_means[0]), np.std(df_15to20_disc_means[1])
(28.39096666666667, 1.170205827015813)
df_20andup_disc_means = sampling_mean(df_disc_20andup.OrderQty)
np.mean(df_20andup_disc_means[0]), np.std(df_20andup_disc_means[1])
(27.48933333333333, 1.7520488624991175)

One-way ANOVA test using scipy.stats:

scipy.stats.f_oneway(*args)

A p-value of < $\alpha$ (0.05) signifies that there is a < 5% chance that we would see the differences we would observe among the sample means distributions. A p-value of >= 0.05 signifies that there is a greater than 5% chance (p-value probability) that the differences observed among the distributions is due to chance.

scs.f_oneway(revs_no_disc_means[0], df_5to10_disc_means[0], df_10to15_disc_means[0], df_15to20_disc_means[0], df_20andup_disc_means[0])
F_onewayResult(statistic=1017.8604514479314, pvalue=0.0)

The extremely low p-value of the ANOVA test tell us that we should reject the null hypothesis (that there is no statistically significant difference between the distributions of means) and strongly suggests that at least one of the distributions of means differs signficantly from at least one other.

Because ANOVA doesn't tell us which variables differ from others, I'll run Tukey's test to find out.

Second test chosen: Tukey's Range Test

This test can tell us which variables are significantly different from one another.

Per the codeacademy Hypothesis Testing presentation by Hillary Green-Lerman, three things are required for pairwise_tukeyhsd:

  1. A vector of all data (concatenated using np.concatenate)
  2. A vector of labels for the data
  3. A level of significance (usually 0.05)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
no_disc = np.array(revs_no_disc_means[0])           # the number of means in these lists 
disc_05to10 = np.array(df_5to10_disc_means[0])    # equals the sampling events input to 
disc_10to15 = np.array(df_10to15_disc_means[0])  # the sampling function that was run to 
disc_15to20 = np.array(df_15to20_disc_means[0])   # obtain this data
disc_20andup = np.array(df_20andup_disc_means[0])

v = np.concatenate([no_disc, disc_05to10, disc_10to15, disc_15to20, disc_20andup])
labels = ['no_disc'] * len(no_disc) + ['disc_05to10'] * len(disc_05to10) + ['disc_10to15'] * len(disc_10to15) + ['disc_15to20']*len(disc_15to20) + ['disc_20andup']*len(disc_20andup)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

type(tukey_results)
    Multiple Comparison of Means - Tukey HSD,FWER=0.05   
=========================================================
   group1       group2    meandiff  lower   upper  reject
---------------------------------------------------------
disc_05to10  disc_10to15  -2.5016  -2.8315 -2.1718  True 
disc_05to10  disc_15to20   0.4828   0.1529  0.8126  True 
disc_05to10  disc_20andup -0.4189  -0.7487  -0.089  True 
disc_05to10    no_disc    -6.1452   -6.475 -5.8153  True 
disc_10to15  disc_15to20   2.9844   2.6545  3.3143  True 
disc_10to15  disc_20andup  2.0828   1.7529  2.4126  True 
disc_10to15    no_disc    -3.6435  -3.9734 -3.3137  True 
disc_15to20  disc_20andup -0.9016  -1.2315 -0.5718  True 
disc_15to20    no_disc    -6.6279  -6.9578 -6.2981  True 
disc_20andup   no_disc    -5.7263  -6.0562 -5.3964  True 
---------------------------------------------------------





statsmodels.sandbox.stats.multicomp.TukeyHSDResults

Once again, for average order quantity (much more so than for average revenues per order), the relatively small number of potential point samples represented by quantity (as opposed to revenues) can result in differences in running these tests from one set of sampling runs to the next. The first time I ran the sampling function and then ran Tukey's test, every single pairing was found to be significant (True for "reject null hypothesis" for all pairings). The second time I ran the sampling functions and the tests, the result was to reject the null for all pairs except for disc_05to10 / disc_20andup.

In any case, what I find is that the average order quantity is statistically significantly different for most, if not all, pairs. What this means is that each of these distributions of average order quantity by discount level (except for the disc_15ot20 / no_disc pair) is statistically significant different from each other at the 0.05 level.

Effect size: Cohen's d pairwise comparisons and visualization

To find out the effect sizes of the various discount levels, we can run Cohen's d to see which ones are most important. Except for one pairing, all were significant, so will need to do Cohen's d on each of the other pairs.

d = Cohen_d(disc_05to10, disc_10to15)
plot_pdfs(d)
This Cohen's d result (1.7861853) represents a difference of 1.79 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_10to15

The average order quantities from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 10-14.9%.

That is a rather strange outcome; whereas one might expect such a result for revenues, where two somewhat opposing forces are at work (higher discounts might drive customers to purchase more, but at the same time reduce margins for the seller), it is not clear why we would see higher quantities sold for a discount of 5% as opposed to a discount of 10%.

It could be that Northwind is only offering the larger discounts on more expensive products: the larger discount might prompt a customer to buy the product (as opposed to not buying it), but might not be enough in and of itself to prompt the customer to buy more units than planned.

d = Cohen_d(disc_05to10, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-0.36344668) represents a difference of -0.36 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_15to20

The average order quantity on products discounted 5-9.9% is slightly lower than is the average order quantity on products discounted 15-19.9%.

Directionally, this result makes sense (the greater the discount, the larger the quantity sold) but some of the same comments I made in the previous pairwise comparison might apply here also. While in the previous case, perhaps the 10% discount wasn't enough to spur additional quantity buying (but might have induced a customer to buy at least one unit). But as the discount gets bigger, it might start driving increased quantity purchasing, even on more expensive products. This might represent a continuum, as seen in the next graph...

d = Cohen_d(disc_05to10, disc_20andup)
plot_pdfs(d)

# note that this pairing was not found to be statistically significant in Tukey's test.
This Cohen's d result (0.26794938) represents a difference of 0.27 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_20andup

The average order quantity on products discounted 5-9.9% is slightly higher than the average order quantity on products discounted 20% and up. This is somewhat counterintuitive; all things being equal, one would expect to see more product units purchased at a 20%+ discount than at a 5% discount. However, it is likely that a lot fewer products get the 20%+ discount treatment, so there just isn't as much volume available at that discount level. It could also be that the larger discounts are relatively more available on more expensive products, and so one might not see as much of an increase in quantity purchased as with lower-cost items.

d = Cohen_d(disc_05to10, no_disc)
plot_pdfs(d)
This Cohen's d result (4.14557783) represents a difference of 4.15 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = no_disc

The the average order quantity on products discounted 5-9.9% are substantially higher than the average order quantity from non-discounted products, which is not at all surprising.

d = Cohen_d(disc_10to15, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-2.23519295) represents a difference of -2.24 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = disc_15to20

The average order quantity on products discounted 10-14.9% is significantly lower than the average order quantity on products discounted 15-20%. This is unsurprising.

d = Cohen_d(disc_10to15, disc_20andup)
plot_pdfs(d)
This Cohen's d result (-1.32737613) represents a difference of -1.33 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = disc_20andup

The average order quantity on products discounted 10-14.9% is significantly lower than the average order quantity on products discounted 20% and up. Again, unsurprising, and consistent with the previous two Cohen's d results and graphs.

d = Cohen_d(disc_10to15, no_disc)
plot_pdfs(d)
This Cohen's d result (2.44776268) represents a difference of 2.45 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = no_disc

The average order quantity on products discounted 10-14.9% is significantly higher than the average order quantity on non-discounted products. Again, consistent with the previous findings described above.

d = Cohen_d(disc_15to20, disc_20andup)  
plot_pdfs(d)
This Cohen's d result (0.59911024) represents a difference of 0.6 pooled standard deviations between the two groups, and suggests a **medium** effect size.

png

Magenta = group1 = disc_15to20 ; blue = group2 = disc_20andup

The means of revenues from discounts of 15-19.9% are slightly lower than the means of revenues from discounts of 20% and up.

The average order quantity on products discounted 15-19.9% is somewhat lower than the average order quantity on products discounted 20% and up. Again, consistent with the previous findings described above.

d = Cohen_d(disc_15to20, no_disc)
plot_pdfs(d)
This Cohen's d result (4.66508814) represents a difference of 4.67 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_15to20 ; blue = group2 = no_disc

Big difference between average order quantity for substantially discounted products (magenta) and non-discounted products (blue).

d = Cohen_d(disc_20andup, no_disc)
plot_pdfs(d)
This Cohen's d result (3.48630871) represents a difference of 3.49 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_20andup ; blue = group2 = no_disc

And once again, an even bigger difference between average order quantity for substantially discounted products (magenta) and non-discounted products (blue).

Findings and interpretation of test results for Question 3

Results regarding average order quantity relative to discount offered are, with a few exceptions, intuitive.

  • Not surprising: Larger discounts tend to correlate in a statistically significant way to greater order unit quantities.
  • Somewhat surprising: Products in the 5-9.9% discount band (which is comprised almost entirely of 5% discounts) tend to have higher effect sizes (measured by Cohen's d) than those at 10-15%. They even have a slightly higher effect size than discounts at 20% and above (though Cohen's d for this pairing is small).
  • Future exploration could focus on why these patterns appear. A few possibilities:
    • Smaller discounts are likely to be applied to a broader range of products than deeper discounts, so there are just a lot more products that can be purchased at that level
    • It could be that larger discounts are applied relatively more often to more expensive products than less expensive ones. Rather than driving additional unit purchases, these larger discounts might be applied to more expensive products to induce a purchase that might not otherwise have happened, and/or to hasten a sale in the case of perishable products.

These findings, combined with those on average order revenue by discount levels, provide some intriguing insights into how pricing occurs and sales are made. Some focused exploration on discounts by product categories and specific products will likely lead to targeted strategies for improving sales in certain categories and/or geographies.

I've concluded the following after running Tukey's Test and Cohen's d for each pairwise comparison:

  • Discounts at the 5-10% level appear to drive the highest revenues per order, relative to all other discount levels (or no discount at all)
    • While higher discounts are likely to drive more purchases, the per-unit revenue will decrease with larger discounts.
    • Perhaps this relatively small discount level is enough to get customers to buy more while preserving higher revenue per unit

Fourth question

Does the level of discount affect average revenues per order?

Null and alternative hypotheses for fourth question

  • Null hypothesis: No significant difference among the means of per-order revenues from the various discount values

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$)
  • Alternative hypothesis: Differences among the means of per-order revenues resulting from various discount values are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$

EDA

Summary of revenues by level of discount

df.groupby('Discount')['Revenue'].sum().plot(kind='bar', 
            title ="Revenues by Discount Level", figsize=(8, 5), fontsize=12)
plt.show()

png

Almost 60% of product revenues are from non-discounted products--which means that over 40% of product revenues are the result of discounted products. Of discounted products, those products discounted by 5% bring in the most revenue of all discounted products, with 25% representing the next highest revenue amount (~$100000). The remaining discount amounts with significant revenues are 10%, 15% and 20%. All three are virtually equal, but 10% appears to represent slightly greater revenues than those from 15% or 20% discounts.

Visualizations of discount levels vs. revenues from no discounts

As in the first case, we can create some initial visualizations to see if the magnitude of the discount results in differences in the sales revenues distributions.

# function for plotting three different series 

def hist_plot3(var1, var2, var3, label=None, label1=None, label2=None, color='b', alpha=0.5, bins=100, xlim=[0, 3500]):
    var1 = var1
    var2 = var2
    var3 = var3
    alpha = alpha
    plt.hist(var1, label=label, color='r', alpha=0.5, bins=100)
    plt.hist(var2, label=label1, color='b', alpha=0.5, bins=100)
    plt.hist(var3, label=label2, color='b', alpha=0.5, bins=100)
    plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
    plt.xlabel("Revenue")
    plt.ylabel("Number")
    plt.legend(loc='best')
    plt.xlim(xlim)
    plt.show()
hist_plot3(total_revenue.Revenue, revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label='Avg order revenue, discounts and no discounts', label1='Avg order revenue, no discounts', label2='Avg order revenue, all discounts')

print(f"Total revenues (discount and no discount):  {int(total_revenue.Revenue.sum())}")
print(f"Total revenues, no discount:  {int(revenue_no_discount.Revenue.sum())}")
print(f"Total revenues from all discounted products:  {int(revenue_all_discounts.Revenue.sum())}")

png

Total revenues (discount and no discount):  1265793
Total revenues, no discount:  750698
Total revenues from all discounted products:  515094
hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, ALL discounts', alpha=0.5, bins=100)

png

Discounts = 1% to 4.99%: Effect on total revenues

Because the revenues from this group are so small, I will be ignoring them for the purposes of this analysis.

Discounts = 5% to 9.99%: Effect on total revenues
hist_plot2(revenue_no_discount.Revenue, df_disc_05to10.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 5-9.9%', alpha=0.5)

print(f"Revenues from discount level of 5-9.99%:  {int(df_disc_05to10.Revenue.sum())}")
print(f"Discount 5-9.99% as a percentage of total revenue:  {round((df_disc_05to10.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 5-9.99% as a percentage of all discount revenue:  {round((df_disc_05to10.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")

png

Revenues from discount level of 5-9.99%:  147681
Discount 5-9.99% as a percentage of total revenue:  11.667%
Discount 5-9.99% as a percentage of all discount revenue:  28.671%
Discounts = 10% to 14.99%: Effect on total revenues
hist_plot2(revenue_no_discount.Revenue, df_disc_10to15.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 10-14.9%', alpha=.5)

print(f"Revenues from discount level of 10-14.99%:  {int(df_disc_10to15.Revenue.sum())}")
print(f"Discount 10-14.99% as a percentage of total revenue:  {round((df_disc_10to15.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 10-14.99% as a percentage of all discount revenue:  {round((df_disc_10to15.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")

png

Revenues from discount level of 10-14.99%:  91499
Discount 10-14.99% as a percentage of total revenue:  7.229%
Discount 10-14.99% as a percentage of all discount revenue:  17.764%
Discounts = 15% to 19.99%: Effect on total revenues
hist_plot2(total_revenue.Revenue, df_disc_15to20.Revenue,  label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 15-19.9%',alpha=0.5, bins=100)

print(f"Revenues from discount level of 15-19.99%:  {int(df_disc_15to20.Revenue.sum())}")
print(f"Discount 15-19.99% as a percentage of total revenue:  {round((df_disc_15to20.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 15-19.99% as a percentage of all discount revenue:  {round((df_disc_15to20.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")

png

Revenues from discount level of 15-19.99%:  87506
Discount 15-19.99% as a percentage of total revenue:  6.913%
Discount 15-19.99% as a percentage of all discount revenue:  16.988%
Discounts = 20% or more: Effect on total revenues
hist_plot2(total_revenue.Revenue, df_disc_20andup.Revenue,  label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 20% and up', alpha=0.5, bins=100)

print(f"Revenues from discount level of 20% and up:  {int(df_disc_20andup.Revenue.sum())}")
print(f"Discount 20% and up as a percentage of total revenue:  {round((df_disc_20andup.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 20% and up as a percentage of all discount revenue:  {round((df_disc_20andup.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")

png

Revenues from discount level of 20% and up:  188119
Discount 20% and up as a percentage of total revenue:  14.862%
Discount 20% and up as a percentage of all discount revenue:  36.521%
Totals and percentages of discount revenue to all revenues
df.groupby(['Discount']).sum().sort_values(['Discount'], ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty CategoryId Revenue RevPercentTotal RevFractionTotal
Discount
0.25 1648801 5820 4345.02 4349 643 98938.5675 7.816331 0.078163
0.20 1711705 6445 3787.81 4351 661 89181.1040 7.045473 0.070455
0.15 1672996 6351 3607.22 4456 595 87506.1740 6.913150 0.069132
0.10 1833523 7164 4354.68 4366 742 91499.1390 7.228602 0.072286
0.06 11077 60 34.00 2 4 63.9200 0.005050 0.000050
0.05 1972417 7311 5697.32 5182 783 147617.3745 11.662047 0.116620
0.04 11077 20 81.00 1 3 77.7600 0.006143 0.000061
0.03 33231 94 73.95 5 15 120.9105 0.009552 0.000096
0.02 22154 52 37.00 4 10 59.7800 0.004723 0.000047
0.01 11077 73 15.00 2 8 29.7000 0.002346 0.000023
0.00 14042897 54519 34467.91 28599 5448 750698.6100 59.306584 0.593066
print(f"The mean of the distribution is {round(revenue_no_discount.Revenue.mean(), 2)}")
hist_plot1(revenue_no_discount.Revenue, label="Average order revenues, no discounts") # plotting with xlim in place
The mean of the distribution is 570.01

png

Total revenues (discount and no discount):  1265793
Total revenues from all discounted products:  515094
Percentage of total revenue from discounts:  40.693%
# hist_plot1(df_disc_05to10.Revenue, xlim=None)  # plotting without xlim to see outliers
print(f"The mean of the distribution is {round(df_disc_05to10.Revenue.mean(), 2)}")
hist_plot1(df_disc_05to10.Revenue, label="Average order revenues, discounts of 5-9.9%", print_stmt=False) # plotting with xlim in place
The mean of the distribution is 793.99

png

# hist_plot1(df_disc_10to15.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_10to15.Revenue.mean(), 2)}")
hist_plot1(df_disc_10to15.Revenue, label="Average order revenues, discounts of 10-14.9%", print_stmt=False)
The mean of the distribution is 528.9

png

# hist_plot1(df_disc_15to20.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_15to20.Revenue.mean(), 2)}")
hist_plot1(df_disc_15to20.Revenue, label="Average order revenues, discounts of 15-19.9%", print_stmt=False)
The mean of the distribution is 557.36

png

# hist_plot1(df_disc_20andup.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_20andup.Revenue.mean(), 2)}")
hist_plot1(df_disc_20andup.Revenue, label="Average order revenues, discounts of 20% and up", xlim=[0, 3000], print_stmt=False)
The mean of the distribution is 597.21

png

Testing for significance: sampling, test setup and results

A review of the histogram plots for revenues deriving from various discount levels shows that, much like the overall revenue frequency histograms, the histograms of these revenue-from-discount subsets reveal non-normally distributed data. To run the statistical tests needed to test the hypotheses in the second question, I will perform multiple sampling of the datasets, then performing statistical tests on the distribution of the means. Per the Central Limit Theorem, these distributions should be approximately normal, and so can be used for other tests (specifically Tukey's test).

First test chosen: ANOVA

This test will tell us that at least one of the datasets is significantly different than the others. (It doesn't tell us which one, though.)

As with the t-test above, I will first perform resampling and create distributions of the means of these resampling events to form approximately normal distributions to perform the ANOVA test.

Multiple sampling of datasets to obtain normal distribution of means

no_discounts = sampling_mean(revenue_no_discount.Revenue)
print(f"The mean of the distribution is {round(np.mean(no_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(no_discounts[1]), 2)}")
hist_plot1(no_discounts[0], label="Average order revenues, no discounts", xlim=([0, 1200]))
The mean of the distribution is 581.46
The std deviation of the distribution is 384.4

png

Total revenues (discount and no discount):  1265793
Total revenues from all discounted products:  515094
Percentage of total revenue from discounts:  40.693%
all_discounts = sampling_mean(revenue_all_discounts.Revenue)
print(f"The mean of the distribution is {round(np.mean(all_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(all_discounts[1]), 2)}")
hist_plot1(all_discounts[0], label="Average order revenues, all discounts", xlim=([0, 1200]))
The mean of the distribution is 612.94
The std deviation of the distribution is 335.39

png

Total revenues (discount and no discount):  1265793
Total revenues from all discounted products:  515094
Percentage of total revenue from discounts:  40.693%
# testing that function results can be plugged into ttest_ind

scs.ttest_ind(no_discounts[0], all_discounts[0])
Ttest_indResult(statistic=-4.003612486005798, pvalue=7.022917511073499e-05)
Sub-dataframes based on revenues from a given discount level

As a reminder, here are the dataframes I've built so far:

  • total_revenue -- Total revenues, with and without discounts
  • revenue_no_discount -- All revenues resulting from sales of non-discounted products
  • revenue_all_discounts -- All revenues resulting from sales of products at all discount levels
  • df_disc_01to05.Revenue -- Products discounted 1-4.9%
  • df_disc_05to10.Revenue -- Products discounted 5-9.9%
  • df_disc_10to15.Revenue -- Products discounted 10-14.9%
  • df_disc_15to20.Revenue -- Products discounted 15-19.9%
  • df_disc_20andup.Revenue -- Products discounted 20% or more
total_rev_means = sampling_mean(total_revenue.Revenue)
np.mean(total_rev_means[0]), np.std(total_rev_means[1])
(583.0617516000001, 351.03901350261066)
revs_no_disc_means = sampling_mean(revenue_no_discount.Revenue)
np.mean(revs_no_disc_means[0]), np.std(revs_no_disc_means[1])
(564.6655116666666, 345.1591629191503)
revs_all_disc_means = sampling_mean(revenue_all_discounts.Revenue)
np.mean(revs_all_disc_means[0]), np.std(revs_all_disc_means[1])
(622.9363593333333, 385.2859280957691)

Note--I'm not including revenues from 1-4.9% discounts because there are only 7 data points--not enough revenue points in this sample to compute these statistics

df_5to10_disc_means = sampling_mean(df_disc_05to10.Revenue)
np.mean(df_5to10_disc_means[0]), np.std(df_5to10_disc_means[1])
(785.3587769166667, 388.22407082181945)
df_10to15_disc_means = sampling_mean(df_disc_10to15.Revenue)
np.mean(df_10to15_disc_means[0]), np.std(df_10to15_disc_means[1])
(529.1909292, 51.08710103154266)
df_15to20_disc_means = sampling_mean(df_disc_15to20.Revenue)
np.mean(df_15to20_disc_means[0]), np.std(df_15to20_disc_means[1])
(555.7451172499999, 51.47884666219702)
df_20andup_disc_means = sampling_mean(df_disc_20andup.Revenue)
np.mean(df_20andup_disc_means[0]), np.std(df_20andup_disc_means[1])
(595.8233765, 239.43192635432803)

One-way ANOVA test using scipy.stats:

scipy.stats.f_oneway(*args)

A p-value of < $\alpha$ (0.05) signifies that there is a < 5% chance that we would see the differences we would observe among the sample means distributions. A p-value of >= 0.05 signifies that there is a greater than 5% chance (p-value probability) that the differences observed among the distributions is due to chance.

scs.f_oneway(revs_no_disc_means[0], df_5to10_disc_means[0], df_10to15_disc_means[0], df_15to20_disc_means[0], df_20andup_disc_means[0])
F_onewayResult(statistic=628.196879270906, pvalue=3.506107e-318)

The extremely low p-value of the ANOVA test tell us that we should reject the null hypothesis (that there is no statistically significant difference between the distributions of means) and strongly suggests that at least one of the distributions of means differs signficantly from at least one other.

Because ANOVA doesn't tell us which variables differ from others, I'll run Tukey's test to find out.

Second test chosen: Tukey's Range Test

This test can tell us which variables are significantly different from one another.

Per the codeacademy Hypothesis Testing presentation by Hillary Green-Lerman, three things are required for pairwise_tukeyhsd:

  1. A vector of all data (concatenated using np.concatenate)
  2. A vector of labels for the data
  3. A level of significance (usually 0.05)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
no_disc = np.array(revs_no_disc_means[0])           # the number of means in these lists 
disc_05to10 = np.array(df_5to10_disc_means[0])    # equals the sampling events input to 
disc_10to15 = np.array(df_10to15_disc_means[0])  # the sampling function that was run to 
disc_15to20 = np.array(df_15to20_disc_means[0])   # obtain this data
disc_20andup = np.array(df_20andup_disc_means[0])

v = np.concatenate([no_disc, disc_05to10, disc_10to15, disc_15to20, disc_20andup])
labels = ['no_disc'] * len(no_disc) + ['disc_05to10'] * len(disc_05to10) + ['disc_10to15'] * len(disc_10to15) + ['disc_15to20']*len(disc_15to20) + ['disc_20andup']*len(disc_20andup)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

type(tukey_results)
      Multiple Comparison of Means - Tukey HSD,FWER=0.05      
==============================================================
   group1       group2     meandiff   lower     upper   reject
--------------------------------------------------------------
disc_05to10  disc_10to15  -256.1678 -272.0348 -240.3009  True 
disc_05to10  disc_15to20  -229.6137 -245.4806 -213.7468  True 
disc_05to10  disc_20andup -189.5354 -205.4023 -173.6685  True 
disc_05to10    no_disc    -220.6933 -236.5602 -204.8264  True 
disc_10to15  disc_15to20   26.5542   10.6873   42.4211   True 
disc_10to15  disc_20andup  66.6324   50.7655   82.4993   True 
disc_10to15    no_disc     35.4746   19.6077   51.3415   True 
disc_15to20  disc_20andup  40.0783   24.2114   55.9452   True 
disc_15to20    no_disc      8.9204   -6.9465   24.7873  False 
disc_20andup   no_disc     -31.1579  -47.0248  -15.291   True 
--------------------------------------------------------------





statsmodels.sandbox.stats.multicomp.TukeyHSDResults

On most occasions of running the sampling function to get these results, the result I get is that we can reject the null hypothesis for all pairs except the disc_15to20 / no_disc pair. What this means is that each of these distributions of means (except for the disc_15ot20 / no_disc pair) are statistically significantly different from each other at the 0.05 level.

I have noticed that if I reduce the number of sampling events in the function, I will occasionally see slightly different results. Sometimes, I will get a "reject the null hypothesis" result for every single pair. Other times, I will get "False" for a few of the pairs (usually for the 15-20% / no discount pair, but have also seen it for one or two others). Increasing the number of sampling events from 200 to 300 in the function seems to eliminate that variability.

A few comments on Tukey results

Two really interesting things that emerge from running the Tukey test are:

  1. The differences between means of the disc_15to20 distribution and the means of the no_disc (no discounts group) are NOT statistically significant. 15-19% seems to be a fairly large discount, so I am a bit surprised to see that this pair was not significantly different.

  2. ALL other pair comparisons result in statistically significant differences.

    • While perhaps this is not surprising, it does make me wonder if there is a linear relationship between discount levels, or if it's just the fact that there is a discount.
    • It's important to remind ourselves that sales from discounts of 1-4.9% were so low (only \$288 dollars of over \$1.2 million total revenues) that they were not included. There were only 7 data points in that category. It suggests that either such small discounts don't work, or that the company (or its salespeople) just don't offer such discounts.

Effect size: Cohen's d pairwise comparisons and visualization

To find out the effect sizes of the various discount levels, we can run Cohen's d to see which ones are most important. Except for one pairing, all were significant, so will need to do Cohen's d on each of the other pairs.

d = Cohen_d(disc_05to10, disc_10to15)
plot_pdfs(d)
This Cohen's d result (3.51500644) represents a difference of 3.52 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_10to15

The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 10-14.9%.

A possible mechanism with two opposing forces might explain why we see this outcome:

  1. A greater discount may drive customers to purchase more... BUT...
  2. The larger discount may ultimately put more downward pressure on the total revenue per sale...the customer may or may not be buying more stuff, but what they are buying, they are getting for a lower price --> lower revenue per unit
d = Cohen_d(disc_05to10, disc_15to20)
plot_pdfs(d)
This Cohen's d result (3.15453316) represents a difference of 3.15 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_15to20

The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 15-19.9%.

d = Cohen_d(disc_05to10, disc_20andup)
plot_pdfs(d)
This Cohen's d result (2.18613461) represents a difference of 2.19 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = disc_20andup

The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 20% and up.

d = Cohen_d(disc_05to10, no_disc)
plot_pdfs(d)
This Cohen's d result (2.40795871) represents a difference of 2.41 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_05to10 ; blue = group2 = no_disc

The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from non-discounted products.

d = Cohen_d(disc_10to15, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-0.72665591) represents a difference of -0.73 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = disc_15to20

The means of revenues from discounts of 10-14.9% are somewhat lower than the means of revenues from discounts of 15-20%. Another slightly counter-intuitive result.

d = Cohen_d(disc_10to15, disc_20andup)
plot_pdfs(d)
This Cohen's d result (-1.11771677) represents a difference of -1.12 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = disc_20andup

The means of revenues from discounts of 10-14.9% are lower than the means of revenues from discounts of 20% and up.

d = Cohen_d(disc_10to15, no_disc)
plot_pdfs(d)
This Cohen's d result (-0.53255028) represents a difference of -0.53 pooled standard deviations between the two groups, and suggests a **medium** effect size.

png

Magenta = group1 = disc_10to15 ; blue = group2 = no_disc

The means of revenues from discounts of 10-14.9% are slightly lower than the means of revenues from non-discounted products. This is a somewhat counterintuitive result that warrants further investigation.

d = Cohen_d(disc_15to20, disc_20andup)  
plot_pdfs(d)
This Cohen's d result (-0.67352883) represents a difference of -0.67 pooled standard deviations between the two groups, and suggests a **medium** effect size.

png

Magenta = group1 = disc_15to20 ; blue = group2 = disc_20andup

The means of revenues from discounts of 15-19.9% are slightly lower than the means of revenues from discounts of 20% and up.

d = Cohen_d(disc_15to20, no_disc)
plot_pdfs(d)

# note that this pairing was not found to be statistically significant in Tukey's test.
This Cohen's d result (-0.13411241) represents a difference of -0.13 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

Magenta = group1 = disc_15to20 ; blue = group2 = no_disc

The means of revenues from discounts of 15-19.9% are very similar to the means of revenues from non-discounted products.

d = Cohen_d(disc_20andup, no_disc)
plot_pdfs(d)
This Cohen's d result (0.38229256) represents a difference of 0.38 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

Magenta = group1 = disc_20andup ; blue = group2 = no_disc

The means of revenues from discounts of 20% and up are slightly higher than the means of revenues from non-discounted products.

Findings and interpretation of test results for Question 4

I've concluded the following after running Tukey's Test and Cohen's d for each pairwise comparison:

  • Discounts at the 5-10% level appear to drive the highest revenues per order, relative to all other discount levels (or no discount at all)

    • While higher discounts are likely to drive more purchases, the per-unit revenue will decrease with larger discounts.
    • Perhaps this relatively small discount level is enough to get customers to buy more while preserving higher revenue per unit
  • Generally, as the discount increases, these differences shrink (though they are still quite significant)

    • One notable exception is that in the "5-10% discount vs. no discount" pair, the difference between the means is not as great as it is for the "5-10% discount vs. 10-15% discount" pair
    • It is possible that, in the cases where the difference between a given discount level and no discount is less than one would expect, there are some high-volume and/or high-priced products that are never discounted that drive the no-discount revenue means upward.
  • More investigation is needed to derive insights and make recommendations; questions to pursue include:

    • Does the pattern of revenue by discounts hold across all categories, or is there a subset of products that create the outcomes we see?
    • Similarly, is there a relationship between geography/sales region and the discounts that drive more revenues?
    • Do we see certain discount levels drive more revenues at certain times of the year?
    • Is there a linear or non-linear relationship between discount level and revenues?
    • Does removal of outliers dramatically affect the relationships observed?

Fifth question

Does the distribution of revenues per order vary in a statistically significant way across product categories?

This analysis can show us whether the distribution of revenues per order varies in a significant way across product categories. This can help us understand if orders in certain product categories generate higher revenues on average.

Null and alternative hypotheses for third question

  • Null hypothesis: No significant difference among the means of order revenues distributions by product category

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$
  • Alternative hypothesis: Differences among the means of order revenues distributions by product category are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
# df.groupby(['CategoryId', 'CategoryName', 'CatDescription']).sum().sort_values(['CategoryId'])

df.groupby(['CategoryId', 'CategoryName', 'Discount']).count().head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty ProductName CatDescription Revenue RevPercentTotal RevFractionTotal
CategoryId CategoryName Discount
1 Beverages 0.00 246 246 246 246 246 246 246 246 246 246 246
0.05 26 26 26 26 26 26 26 26 26 26 26
0.10 25 25 25 25 25 25 25 25 25 25 25
0.15 36 36 36 36 36 36 36 36 36 36 36
0.20 39 39 39 39 39 39 39 39 39 39 39
0.25 32 32 32 32 32 32 32 32 32 32 32
2 Condiments 0.00 131 131 131 131 131 131 131 131 131 131 131
0.02 1 1 1 1 1 1 1 1 1 1 1
0.05 19 19 19 19 19 19 19 19 19 19 19
0.10 23 23 23 23 23 23 23 23 23 23 23
df.groupby(['CategoryId', 'CategoryName', 'Discount']).sum().head(20)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty Revenue RevPercentTotal RevFractionTotal
CategoryId CategoryName Discount
1 Beverages 0.00 2623997 10381 7161.95 5116 153060.2500 12.092044 0.120920
0.05 277313 1038 1507.75 898 43567.3800 3.441904 0.034419
0.10 264816 1219 812.20 728 13027.6350 1.029207 0.010292
0.15 384608 1348 609.65 885 12889.9950 1.018334 0.010183
0.20 418197 1353 1005.40 1140 29569.3200 2.336031 0.023360
0.25 343213 1017 714.70 765 15753.6000 1.244564 0.012446
2 Condiments 0.00 1399149 5724 2834.55 2793 60577.8000 4.785759 0.047858
0.02 11077 6 25.00 1 24.5000 0.001936 0.000019
0.05 202371 953 337.90 719 11891.5300 0.939453 0.009395
0.10 243158 995 449.30 536 8934.7050 0.705858 0.007059
0.15 180730 890 373.65 519 10639.0250 0.840503 0.008405
0.20 148734 563 340.50 312 6105.2000 0.482322 0.004823
0.25 118286 426 244.40 418 7874.3250 0.622086 0.006221
3 Confections 0.00 2131272 7572 4636.56 4618 107128.4700 8.463348 0.084633
0.03 11077 16 17.45 2 33.8530 0.002674 0.000027
0.04 11077 20 81.00 1 77.7600 0.006143 0.000061
0.05 309325 843 705.63 795 18598.5775 1.469322 0.014693
0.10 318611 1062 584.23 545 10200.7440 0.805878 0.008059
0.15 308602 951 617.37 818 15433.5690 1.219281 0.012193
0.20 180335 660 314.23 338 4716.2640 0.372594 0.003726

EDA

Bar graphs

Product discount by product category

df.groupby('CategoryName')['Discount'].sum().plot(kind='bar', 
            title ="Discount Level in Percentages by Product Category", figsize=(8, 5), fontsize=12)
plt.show()

png

The category with the greatest discounts is Beverages, with Seafood, Dairy Products, and Confections nearly tied in 2nd place with approximately 20% discounts in each category.

Let's also look at revenues by product category and see how that compares to percentage discount:

Total revenues by product category

df.groupby('CategoryName')['Revenue'].sum().plot(kind='bar', 
            title ="Revenues by Category Name", figsize=(8, 5), fontsize=12)
plt.show()

png

Number of orders by product category

df.groupby('CategoryName')['Discount'].count().plot(kind='bar', 
            title ="Orders by Product Category", figsize=(8, 5), fontsize=12)
plt.show()

png

Comments on initial EDA

Interestingly, Beverages has the highest revenues at around \$260,000 of any category, followed closely by Dairy Products at around \$230,000, then Confections at around \$170,000, and Meat/Poultry at around \$160,000. Although Seafood tends to have fairly significant discounts at around 20%, it is 5th of 8 in terms of revenu, at around \$140,000.

Some observations and comments:

  • For a category such as seafood, which is likely to include both highly perishable items and shelf-stable items (e.g., canned tuna, salmon, clams), it may not be surprising to see that discounts make up a higher proportion of revenues than is the case for less perishable items.
  • On the other hand, Produce has fairly low discounts and relatively low revenues. I can think of a couple of possible questions that might warrant further exploration: - Although perishable, produce quantities and discounts might have been optimized for its highly perishable nature; in other words, maybe Northwind only has access to a certain amount of produce from suppliers, and this amount tends to sell out (rather than having to be discarded).
    - While there may be some truth to this potential explanation, as a category, Produce experiences relatively high rates of loss due to spoilage compared to other categories, so there might be some opportunities to get more revenue by offering greater discounts on selected items (thereby selling produce that might otherwise have gone to waste).
  • Condiments and Cereals/Grains are almost carbon copies of each other, in terms of both discount level and revenues. It is likely that the relatively low cost and shelf-stable nature of these products leads to relatively modest discounts (since spoilage rates are lower, so not as much urgency to induce purchases before items spoil) and relatively modest revenues, due to their general affordability.

Setting up product category dataframes

df_bev = df.loc[df["CategoryName"] == "Beverages"]
df_bev.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.2 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
1 10294 RATTC 2012-08-30 1 14.4 18 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 259.2 0.020477 0.000205
2 10317 LONEP 2012-09-30 1 14.4 20 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 288.0 0.022753 0.000228
df_cond = df.loc[df["CategoryName"] == "Condiments"]
df_cond.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
82 10289 BSBEV 2012-08-26 3 8.0 30 0.0 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 240.0 0.018960 0.000190
83 10405 LINOD 2013-01-06 3 8.0 50 0.0 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 400.0 0.031601 0.000316
84 10485 LINOD 2013-03-25 3 8.0 20 0.1 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 144.0 0.011376 0.000114
df_confect = df.loc[df["CategoryName"] == "Confections"]
df_confect.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
336 10255 RICSU 2012-07-12 16 13.9 35 0.00 Pavlova 3 Confections Desserts, candies, and sweet breads 486.5 0.038434 0.000384
337 10263 ERNSH 2012-07-23 16 13.9 60 0.25 Pavlova 3 Confections Desserts, candies, and sweet breads 625.5 0.049416 0.000494
338 10287 RICAR 2012-08-22 16 13.9 40 0.15 Pavlova 3 Confections Desserts, candies, and sweet breads 472.6 0.037336 0.000373
df_dairy = df.loc[df["CategoryName"] == "Dairy Products"]
df_dairy.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
216 10248 VINET 2012-07-04 11 14.0 12 0.0 Queso Cabrales 4 Dairy Products Cheeses 168.0 0.013272 0.000133
217 10296 LILAS 2012-09-03 11 16.8 12 0.0 Queso Cabrales 4 Dairy Products Cheeses 201.6 0.015927 0.000159
218 10327 FOLKO 2012-10-11 11 16.8 50 0.2 Queso Cabrales 4 Dairy Products Cheeses 672.0 0.053089 0.000531
df_grains = df.loc[df["CategoryName"] == "Grains/Cereals"]
df_grains.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
535 10251 VICTE 2012-07-08 22 16.8 6 0.05 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 95.76 0.007565 0.000076
536 10435 CONSH 2013-02-04 22 16.8 12 0.00 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 201.60 0.015927 0.000159
537 10553 WARTH 2013-05-30 22 21.0 24 0.00 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 504.00 0.039817 0.000398
df_meat = df.loc[df["CategoryName"] == "Meat/Poultry"]
df_meat.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
178 10420 WELLI 2013-01-21 9 77.6 20 0.10 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 1396.8 0.110350 0.001103
179 10515 QUICK 2013-04-23 9 97.0 16 0.15 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 1319.2 0.104219 0.001042
180 10687 HUNGO 2013-09-30 9 97.0 50 0.25 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 3637.5 0.287369 0.002874
df_produce = df.loc[df["CategoryName"] == "Produce"]
df_produce.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
136 10262 RATTC 2012-07-22 7 24.0 15 0.00 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 360.0 0.028441 0.000284
137 10385 SPLIR 2012-12-17 7 24.0 10 0.20 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 192.0 0.015168 0.000152
138 10459 VICTE 2013-02-27 7 24.0 16 0.05 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 364.8 0.028820 0.000288
df_seafood = df.loc[df["CategoryName"] == "Seafood"]
df_seafood.head(3)
# df_seafood.info()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
183 10273 QUICK 2012-08-05 10 24.8 24 0.05 Ikura 8 Seafood Seaweed and fish 565.44 0.044671 0.000447
184 10276 TORTU 2012-08-08 10 24.8 15 0.00 Ikura 8 Seafood Seaweed and fish 372.00 0.029389 0.000294
185 10357 LILAS 2012-11-19 10 24.8 30 0.20 Ikura 8 Seafood Seaweed and fish 595.20 0.047022 0.000470

Histograms of product categories

print()
print("Category:  Beverages")
hist_plot1(df_bev.Revenue, label="Avg order revenue, discount and no discount, Beverages", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_bev.Revenue.mean(), 2)}")
hist_plot1(df_bev.Revenue, label="Avg order revenue, discount and no discount, Beverages", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_bev.Revenue.mean(), 2)}")
print()
print()

# note the large outliers
Category:  Beverages

png

The mean of the distribution is 663.04

png

The mean of the distribution is 663.04
print()
print()
print("Category:  Condiments")
# hist_plot1(df_cond.Revenue, color='tab:blue')
# print(f"The mean of the distribution is {round(df_cond.Revenue.mean(), 2)}")
hist_plot1(df_cond.Revenue, label="Avg order revenue, discount and no discount, Condiments", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_cond.Revenue.mean(), 2)}")
print()
print()
Category:  Condiments

png

The mean of the distribution is 490.96
print()
print()
print("Category:  Confections")
hist_plot1(df_confect.Revenue, label="Avg order revenue, discount and no discount, Confections", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_confect.Revenue.mean(), 2)}")
# hist_plot1(df_confect.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_confect.Revenue.mean(), 2)}")
print()
print()
Category:  Confections

png

The mean of the distribution is 501.07
print()
print()
print("Category:  Dairy Products")
hist_plot1(df_dairy.Revenue, label="Avg order revenue, all, Dairy Products", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_dairy.Revenue.mean(), 2)}")
# hist_plot1(df_dairy.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_dairy.Revenue.mean(), 2)}")
print()
print()
Category:  Dairy Products

png

The mean of the distribution is 640.73
print()
print()
print("Category:  Meat/Poultry")
hist_plot1(df_meat.Revenue, label="Avg order revenue, discount/no discount, Meat/Poultry",color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_meat.Revenue.mean(), 2)}")
# hist_plot1(df_meat.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_meat.Revenue.mean(), 2)}")
print()
print()
Category:  Meat/Poultry

png

The mean of the distribution is 942.33
print()
print()
print("Category:  Grains/Cereals")
hist_plot1(df_grains.Revenue, label="Avg order revenue, discount/no discount, Grains/Cereals", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_grains.Revenue.mean(), 2)}")
# hist_plot1(df_grains.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_grains.Revenue.mean(), 2)}")
print()
print()
Category:  Grains/Cereals

png

The mean of the distribution is 488.49
print()
print()
print("Category:  Produce")
hist_plot1(df_produce.Revenue, label="Avg order revenue, discount/no discount, Produce", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_produce.Revenue.mean(), 2)}")
# hist_plot1(df_produce.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_produce.Revenue.mean(), 2)}")
print()
print()
Category:  Produce

png

The mean of the distribution is 735.18
print()
print()
print("Category:  Seafood")
# hist_plot1(df_seafood.Revenue, color='tab:blue')
# print(f"The mean of the distribution is {round(df_seafood.Revenue.mean(), 2)}")
hist_plot1(df_seafood.Revenue, label="Avg order revenue, discount/no discount, Seafood", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_seafood.Revenue.mean(), 2)}")
print()
print()
Category:  Seafood

png

The mean of the distribution is 397.76

Testing for significance: sampling, test setup, and results

Pulling samples for each product category and taking means

As with question 2, to ensure reasonably normal distributions for hypothesis testing, I will perform resampling for each product category to get a distribution of means.

bev_means = sampling_mean(df_bev.Revenue)
mean_bev_means = np.mean(bev_means[0])
std_bev_means = np.std(bev_means[1])
print(f"The mean of the means of beverage order revenues resampling is {round(mean_bev_means,2)}")
print(f"The std deviation of the means of beverage order revenues resampling is {round(std_bev_means,2)}")
hist_plot1(bev_means[0], label="Resampling means, order revenues, Beverages", xlim=([0, 1400]), print_stmt=False)
The mean of the means of beverage order revenues resampling is 663.44
The std deviation of the means of beverage order revenues resampling is 486.78

png

cond_means = sampling_mean(df_cond.Revenue)
mean_cond_means = np.mean(cond_means[0])
std_cond_means = np.std(cond_means[1])
print(f"The mean of the means of Condiments order revenues resampling is {round(mean_cond_means,2)}")
print(f"The std deviation of the means of Condiments order revenues resampling is {round(std_cond_means,2)}")
hist_plot1(cond_means[0], label="Resampling means, order revenues, Condiments",  xlim=([0, 1000]), print_stmt=False)
The mean of the means of Condiments order revenues resampling is 491.51
The std deviation of the means of Condiments order revenues resampling is 40.06

png

confect_means = sampling_mean(df_confect.Revenue)
mean_confect_means = np.mean(confect_means[0])
std_confect_means = np.std(confect_means[1])
print(f"The mean of the means of Confections order revenues resampling is {round(mean_confect_means,2)}")
print(f"The std deviation of the means of Confections order revenues resampling is {round(std_confect_means,2)}")
hist_plot1(confect_means[0], label="Resampling means, order revenues, Confections", xlim=([0, 1000]), print_stmt=False)
The mean of the means of Confections order revenues resampling is 499.85
The std deviation of the means of Confections order revenues resampling is 100.95

png

meat_means = sampling_mean(df_meat.Revenue)
mean_meat_means = np.mean(meat_means[0])
std_meat_means = np.std(meat_means[1])
round(mean_meat_means,2)
print(f"The mean of the means of Meat/Poultry order revenues resampling is {round(mean_meat_means,2)}")
print(f"The std deviation of the means of Meat/Poultry order revenues resampling is {round(std_meat_means,2)}")
hist_plot1(meat_means[0], label="Resampling means, order revenues, Meat/Poultry", xlim=([0, 2000]), print_stmt=False)
The mean of the means of Meat/Poultry order revenues resampling is 938.78
The std deviation of the means of Meat/Poultry order revenues resampling is 184.78

png

dairy_means = sampling_mean(df_dairy.Revenue)
mean_dairy_means = np.mean(dairy_means[0])
std_dairy_means = np.std(dairy_means[1])
round(mean_dairy_means,2)
print(f"The mean of the means of Dairy Products order revenues resampling is {round(mean_dairy_means,2)}")
print(f"The std deviation of the means of Dairy Products order revenues resampling is {round(std_dairy_means,2)}")
hist_plot1(dairy_means[0], label="Resampling means, order revenues, Dairy Products", xlim=([0, 1200]), print_stmt=False)
The mean of the means of Dairy Products order revenues resampling is 645.36
The std deviation of the means of Dairy Products order revenues resampling is 119.31

png

grains_means = sampling_mean(df_grains.Revenue)
mean_grains_means = np.mean(grains_means[0])
std_grains_means = np.std(grains_means[1])
round(mean_grains_means,2)
print(f"The mean of the means of Grains/Cereals Products order revenues resampling is {round(mean_grains_means,2)}")
print(f"The std deviation of the means of Grains/Cereals Products order revenues resampling is {round(std_grains_means,2)}")
hist_plot1(grains_means[0], label="Resampling means, order revenues, Grains/Cereals", xlim=([0, 1000]), print_stmt=False)
The mean of the means of Grains/Cereals Products order revenues resampling is 487.07
The std deviation of the means of Grains/Cereals Products order revenues resampling is 80.64

png

produce_means = sampling_mean(df_produce.Revenue)
mean_produce_means = np.mean(produce_means[0])
std_produce_means = np.std(produce_means[1])
round(mean_produce_means,2)
print(f"The mean of the means of Produce order revenues resampling is {round(mean_produce_means,2)}")
print(f"The std deviation of the means of Produce order revenues resampling is {round(std_produce_means,2)}")
hist_plot1(produce_means[0], label="Resampling means, order revenues, Produce", xlim=([0, 1400]), print_stmt=False)
The mean of the means of Produce order revenues resampling is 734.53
The std deviation of the means of Produce order revenues resampling is 92.73

png

seafood_means = sampling_mean(df_seafood.Revenue)
mean_seafood_means = np.mean(seafood_means[0])
std_seafood_means = np.std(seafood_means[1])
round(mean_seafood_means,2)
print(f"The mean of the means of Seafood order revenues resampling is {round(mean_seafood_means,2)}")
print(f"The std deviation of the means of Seafood order revenues resampling is {round(std_seafood_means,2)}")
hist_plot1(seafood_means[0], label="Resampling means, order revenues, Seafood", xlim=([0, 800]), print_stmt=False)
The mean of the means of Seafood order revenues resampling is 398.49
The std deviation of the means of Seafood order revenues resampling is 61.55

png

Average order revenue by product category

means = [mean_bev_means, mean_cond_means, mean_confect_means, mean_meat_means, 
         mean_dairy_means, mean_grains_means, mean_produce_means, mean_seafood_means]

names = ['Beverages', 'Condiments', 'Confections', 
         'Meat/Poultry', 'Dairy Products', 'Grains/Cereals',
        'Produce','Seafood']

df_means = pd.DataFrame(means, names)

df_means.plot(kind='bar', color='tab:blue', alpha=.9, legend=None, title ="Average Order Revenue by Product Category", figsize=(8, 5), fontsize=12)
plt.show()

png

ANOVA test on means distributions

scs.f_oneway(bev_means[0], cond_means[0], confect_means[0], meat_means[0], dairy_means[0], grains_means[0], produce_means[0], seafood_means[0])
F_onewayResult(statistic=1776.047633256583, pvalue=0.0)

The ANOVA test on the sample means shows an infinitesimally small p-value, meaning that we should reject the null hypothesis that there is no statistically significant difference among the revenue distributions of the different product categories.

To find out which distributions are different enough from one another to be statistically significant, I will run Tukey's test.

Tukey's test on means distributions

bev_means_np = np.array(bev_means[0])
cond_means_np = np.array(cond_means[0]) 
confect_means_np = np.array(confect_means[0])
dairy_means_np = np.array(dairy_means[0])
meat_means_np = np.array(meat_means[0]) 
grains_means_np = np.array(grains_means[0])
produce_means_np = np.array(produce_means[0]) 
seafood_means_np = np.array(seafood_means[0])

v = np.concatenate([bev_means_np, cond_means_np, confect_means_np,
                    dairy_means_np, meat_means_np, grains_means_np, 
                    produce_means_np, seafood_means_np])
labels = ['Beverages'] * len(bev_means_np) + ['Condiments'] * len(cond_means_np) + ['Confections'] * len(confect_means_np) + ['Dairy Products']*len(dairy_means_np) + ['Meat/Poultry']*len(meat_means_np) + ['Cereals/Grains']*len(grains_means_np) + ['Produce']*len(produce_means_np) + ['Seafood']*len(seafood_means_np)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

type(tukey_results)
        Multiple Comparison of Means - Tukey HSD,FWER=0.05        
==================================================================
    group1         group2      meandiff   lower     upper   reject
------------------------------------------------------------------
  Beverages    Cereals/Grains -176.3666 -194.1524 -158.5807  True 
  Beverages      Condiments   -171.9365 -189.7223 -154.1507  True 
  Beverages     Confections    -163.593 -181.3788 -145.8072  True 
  Beverages    Dairy Products  -18.0862  -35.872   -0.3004   True 
  Beverages     Meat/Poultry   275.3426  257.5568  293.1284  True 
  Beverages       Produce      71.0916   53.3058   88.8774   True 
  Beverages       Seafood     -264.9544 -282.7402 -247.1686  True 
Cereals/Grains   Condiments     4.4301   -13.3557  22.2159  False 
Cereals/Grains  Confections    12.7736   -5.0122   30.5594  False 
Cereals/Grains Dairy Products  158.2803  140.4945  176.0662  True 
Cereals/Grains  Meat/Poultry   451.7091  433.9233  469.4949  True 
Cereals/Grains    Produce      247.4582  229.6724  265.244   True 
Cereals/Grains    Seafood      -88.5879 -106.3737  -70.8021  True 
  Condiments    Confections     8.3435   -9.4423   26.1293  False 
  Condiments   Dairy Products  153.8503  136.0644  171.6361  True 
  Condiments    Meat/Poultry   447.279   429.4932  465.0649  True 
  Condiments      Produce      243.0281  225.2423  260.8139  True 
  Condiments      Seafood      -93.018  -110.8038  -75.2321  True 
 Confections   Dairy Products  145.5068  127.7209  163.2926  True 
 Confections    Meat/Poultry   438.9355  421.1497  456.7214  True 
 Confections      Produce      234.6846  216.8988  252.4704  True 
 Confections      Seafood     -101.3615 -119.1473  -83.5756  True 
Dairy Products  Meat/Poultry   293.4288  275.643   311.2146  True 
Dairy Products    Produce      89.1778    71.392   106.9637  True 
Dairy Products    Seafood     -246.8682  -264.654 -229.0824  True 
 Meat/Poultry     Produce     -204.2509 -222.0368 -186.4651  True 
 Meat/Poultry     Seafood      -540.297 -558.0828 -522.5112  True 
   Produce        Seafood      -336.046 -353.8319 -318.2602  True 
------------------------------------------------------------------





statsmodels.sandbox.stats.multicomp.TukeyHSDResults

Effect size: Cohen's D

We can get a sense of the magnitude of the statistically significant differences among the product categories by computing and plotting Cohen's d for each pair.

d_bev_cond = Cohen_d(bev_means_np, cond_means_np)
plot_pdfs(d_bev_cond)
This Cohen's d result (1.62585469) represents a difference of 1.63 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_bev_dairy = Cohen_d(bev_means_np, dairy_means_np)
plot_pdfs(d_bev_dairy)
This Cohen's d result (0.16055547) represents a difference of 0.16 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

d_bev_meat = Cohen_d(bev_means_np, meat_means_np)
plot_pdfs(d_bev_meat)
This Cohen's d result (-2.27595947) represents a difference of -2.28 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_bev_grains = Cohen_d(bev_means_np, grains_means_np)
plot_pdfs(d_bev_grains)
This Cohen's d result (1.65508548) represents a difference of 1.66 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_bev_produce = Cohen_d(bev_means_np, produce_means_np)
plot_pdfs(d_bev_produce)
This Cohen's d result (-0.66677525) represents a difference of -0.67 pooled standard deviations between the two groups, and suggests a **medium** effect size.

png

d_bev_seafood = Cohen_d(bev_means_np, seafood_means_np)
plot_pdfs(d_bev_seafood)
This Cohen's d result (2.49154765) represents a difference of 2.49 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_cond_confect = Cohen_d(cond_means_np, confect_means_np)
plot_pdfs(d_cond_confect)
This Cohen's d result (-0.19889735) represents a difference of -0.2 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

d_cond_dairy = Cohen_d(cond_means_np, dairy_means_np)
plot_pdfs(d_cond_dairy)
This Cohen's d result (-3.02144627) represents a difference of -3.02 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_cond_meat = Cohen_d(cond_means_np, meat_means_np)
plot_pdfs(d_cond_meat)
This Cohen's d result (-6.63885587) represents a difference of -6.64 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_cond_grains = Cohen_d(cond_means_np, grains_means_np)
plot_pdfs(d_cond_grains)
This Cohen's d result (0.12488345) represents a difference of 0.12 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

d_cond_produce = Cohen_d(cond_means_np, produce_means_np)
plot_pdfs(d_cond_produce)
This Cohen's d result (-6.81656087) represents a difference of -6.82 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_cond_seafood = Cohen_d(cond_means_np, seafood_means_np)
plot_pdfs(d_cond_seafood)
This Cohen's d result (2.67215594) represents a difference of 2.67 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_confect_dairy = Cohen_d(confect_means_np, dairy_means_np)
plot_pdfs(d_confect_dairy)
This Cohen's d result (-2.54615497) represents a difference of -2.55 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_confect_meat = Cohen_d(confect_means_np, meat_means_np)
plot_pdfs(d_confect_meat)
This Cohen's d result (-6.07982689) represents a difference of -6.08 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_confect_grains = Cohen_d(confect_means_np, grains_means_np)
plot_pdfs(d_confect_grains)
This Cohen's d result (0.29065039) represents a difference of 0.29 pooled standard deviations between the two groups, and suggests a **small** effect size.

png

d_confect_produce = Cohen_d(confect_means_np, produce_means_np)
plot_pdfs(d_confect_produce)
This Cohen's d result (-5.32251585) represents a difference of -5.32 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_confect_seafood = Cohen_d(confect_means_np, seafood_means_np)
plot_pdfs(d_confect_seafood)
This Cohen's d result (2.33474402) represents a difference of 2.33 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_dairy_meat = Cohen_d(dairy_means_np, meat_means_np)
plot_pdfs(d_dairy_meat)
This Cohen's d result (-3.77394204) represents a difference of -3.77 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_dairy_grains = Cohen_d(dairy_means_np, grains_means_np)
plot_pdfs(d_dairy_grains)
This Cohen's d result (3.01034195) represents a difference of 3.01 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_dairy_produce = Cohen_d(dairy_means_np, produce_means_np)
plot_pdfs(d_dairy_produce)
This Cohen's d result (-1.69218869) represents a difference of -1.69 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_dairy_seafood = Cohen_d(dairy_means_np, seafood_means_np)
plot_pdfs(d_dairy_seafood)
This Cohen's d result (4.73531703) represents a difference of 4.74 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_meat_grains = Cohen_d(meat_means_np, grains_means_np)
plot_pdfs(d_meat_grains)
This Cohen's d result (6.58125633) represents a difference of 6.58 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_meat_produce = Cohen_d(meat_means_np, produce_means_np)
plot_pdfs(d_meat_produce)
This Cohen's d result (2.97186024) represents a difference of 2.97 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_meat_seafood = Cohen_d(meat_means_np, seafood_means_np)
plot_pdfs(d_meat_seafood)
This Cohen's d result (7.91121656) represents a difference of 7.91 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_grains_produce = Cohen_d(grains_means_np, produce_means_np)
plot_pdfs(d_grains_produce)
This Cohen's d result (-6.51463463) represents a difference of -6.51 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_grains_seafood = Cohen_d(grains_means_np, seafood_means_np)
plot_pdfs(d_grains_seafood)
This Cohen's d result (2.38169432) represents a difference of 2.38 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

d_produce_seafood = Cohen_d(produce_means_np, seafood_means_np)
plot_pdfs(d_produce_seafood)
This Cohen's d result (8.99337848) represents a difference of 8.99 pooled standard deviations between the two groups, and suggests a **large** effect size.

png

Findings and interpretation

The results of Tukey's test suggest that the magnitude of the differences of the distribution of means for each product category is statistically significant for all but a few category pairings: Cereals/Grains and Condiments; Cereals/Grains and Confections; Condiments and Confections; and possibly one or two other pairings (have seen Beverages and Dairy Products not being statistically significant on occasion). This result is interesting--but not really surprising--in light of what is shown in the bar graph of revenues by category: the three categories represented by the first three pairing above are the most shelf-stable, and relatively inexpensive, of the 8 product categories. They also are discounted relatively less frequently than most other categories (with the possible exception of produce), probably due to lower rates of spoilage, therefore requiring fewer discounts to move products before they spoil.

In addition, we see some very large differences in average order value among certain product categories. The bar chart showing the average order revenue by category shows that in an intuitive manner; Tukey's test and Cohen's d tell us which differences are statistically significant (all but 4) and to what degree.

This last point on the value of Cohen's d is important: while you can see differences in the original bar chart across product categories, it's interesting to note that the difference shown by Cohen's d are sometimes more or less dramatic than the bar chart implies.

As a reminder, here is the bar chart of average order revenue by product category:

df_means.plot(kind='bar', legend=None, title ="Average Order Revenue by Product Category", figsize=(8, 5), fontsize=12)
plt.show()

png

While there is a notable difference on the bar chart between, say, Produce and Condiments, the Cohen's d value for this pair is 6.65! That's close to 7 pooled standard deviations separating the means of these two categories. That suggests that the difference between these two is even more significant than this bar chart would suggest. This is instructive because it can help us identify under- or over-performing categories on which to focus specific incentives.

Future potential work: Additional questions

I have a number of general questions that I would like to explore. Below are a few that I had started to investigate, but did not have time to complete.

Sixth possible question

We have seen that there are substantial differences in the revenues and discounts offered across the 8 categories. But what about within categories? Do the discounts within categories vary by product?

Null and alternative hypotheses for fourth question

  • Null hypothesis: No significant difference among the means of discount values by product within a category

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$)
  • Alternative hypothesis: Differences among the means of discount values by product within a category are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$

EDA

High-level overview: Beverages

Looking at the distribution of revenues in this category by discount...

df_bev = df.loc[df["CategoryName"] == "Beverages"]
df_bev.groupby(['Discount']).sum().sort_values(['Discount'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty CategoryId Revenue RevPercentTotal RevFractionTotal
Discount
0.00 2623997 10381 7161.95 5116 246 153060.250 12.092044 0.120920
0.05 277313 1038 1507.75 898 26 43567.380 3.441904 0.034419
0.10 264816 1219 812.20 728 25 13027.635 1.029207 0.010292
0.15 384608 1348 609.65 885 36 12889.995 1.018334 0.010183
0.20 418197 1353 1005.40 1140 39 29569.320 2.336031 0.023360
0.25 343213 1017 714.70 765 32 15753.600 1.244564 0.012446
df_bev_dsc = df_bev.drop(['Id', 'ProductId', 'OrderUnitPrice', 'OrderQty', 'RevPercentTotal', 'RevFractionTotal', 'CustomerId', 'OrderDate', 'CategoryName', 'CatDescription'], axis=1)
df_bev_dsc.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Discount ProductName CategoryId Revenue
0 0.20 Chai 1 518.4
1 0.00 Chai 1 259.2
2 0.00 Chai 1 288.0
3 0.15 Chai 1 183.6
4 0.00 Chai 1 172.8

We can see that the largest revenue category by far, with over \$150,000 is 0% discount, followed by approximately \$40,000 for the 5% discount grouping (almost all of the discounts in the 5-9.9% discount group are 5% discounts).

Detailed breakdown: Beverages

Let's take a look at the details of discounts in Beverages, sorting by discount level and product name:

df_bev = df.loc[df['CategoryName']=="Beverages"]

df_bev.groupby(['Discount', 'ProductName']).sum().sort_values(['Discount']).head(20)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty CategoryId Revenue RevPercentTotal RevFractionTotal
Discount ProductName
0.00 Chai 235242 22 374.40 391 22 6681.600 0.527859 0.005279
Steeleye Stout 266316 875 424.80 531 25 8812.800 0.696228 0.006962
Sasquatch Ale 128134 408 156.80 269 12 3542.000 0.279825 0.002798
Rhönbräu Klosterbier 308816 2175 212.35 631 29 4642.250 0.366746 0.003667
Laughing Lumberjack Lager 75320 469 98.00 115 7 1610.000 0.127193 0.001272
Lakkalikööri 277308 1976 446.40 591 26 10234.800 0.808568 0.008086
Outback Lager 245338 1610 324.00 438 23 6129.000 0.484202 0.004842
Guaraná Fantástica 353001 792 139.50 704 33 3011.400 0.237906 0.002379
Côte de Blaye 150244 532 3530.90 320 14 79577.000 6.286731 0.062867
Chartreuse verte 169736 624 266.40 343 16 5677.200 0.448509 0.004485
Chang 202413 38 342.00 408 19 7125.000 0.562888 0.005629
Ipoh Coffee 212129 860 846.40 375 20 16017.200 1.265389 0.012654
0.05 Outback Lager 31973 210 45.00 63 3 897.750 0.070924 0.000709
Steeleye Stout 11046 35 18.00 18 1 307.800 0.024317 0.000243
Sasquatch Ale 31849 102 36.40 135 3 1675.800 0.132391 0.001324
Rhönbräu Klosterbier 32633 225 23.25 210 3 1546.125 0.122147 0.001221
Ipoh Coffee 10340 43 36.80 40 1 1398.400 0.110476 0.001105
Lakkalikööri 10273 76 14.40 33 1 451.440 0.035665 0.000357
Côte de Blaye 52977 190 1212.10 145 5 34294.525 2.709331 0.027093
Chartreuse verte 21942 78 36.00 82 2 1402.200 0.110776 0.001108

Now let's sort by product name and revenue, to see which products bring in the most revenue.

Note: because CategoryId for all beverages is 1, the number in the CategoryId column represents the total number of orders of that product. For example, there were 24 orders for Cote de Blaye, 28 orders for Ipoh Coffee, etc.

We see Cote de Blaye is at the top of the list for revenues by a large margin, followed by Ipoh Coffee, Chang, Lakkalikööri, and Steeleye Stout.

df_bev.groupby(['ProductName']).sum().sort_values(['Revenue'], ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty Discount CategoryId Revenue RevPercentTotal RevFractionTotal
ProductName
Côte de Blaye 255583 912 5902.40 623 1.10 24 141396.735 11.170605 0.111706
Ipoh Coffee 298531 1204 1205.20 580 1.40 28 23526.700 1.858653 0.018587
Chang 470965 88 786.60 1057 4.50 44 16355.960 1.292151 0.012922
Lakkalikööri 415841 2964 662.40 981 2.05 39 15760.440 1.245104 0.012451
Steeleye Stout 383553 1260 612.00 883 1.70 36 13644.000 1.077901 0.010779
Chai 406841 38 651.60 828 2.95 38 12788.100 1.010284 0.010103
Chartreuse verte 318790 1170 500.40 793 2.00 30 12294.540 0.971291 0.009713
Outback Lager 416162 2730 552.00 817 2.45 39 10672.650 0.843159 0.008432
Rhönbräu Klosterbier 490967 3450 339.45 1155 2.40 46 8177.490 0.646037 0.006460
Sasquatch Ale 202659 646 246.40 506 0.95 19 6350.400 0.501693 0.005017
Guaraná Fantástica 544786 1224 216.00 1125 2.90 51 4504.365 0.355853 0.003559
Laughing Lumberjack Lager 107466 670 137.20 184 0.60 10 2396.800 0.189352 0.001894
df_bev.groupby(['ProductName', 'Discount']).sum().sort_values(['ProductName'], ascending=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id ProductId OrderUnitPrice OrderQty CategoryId Revenue RevPercentTotal RevFractionTotal
ProductName Discount
Chai 0.00 235242 22 374.40 391 22 6681.600 0.527859 0.005279
0.05 10905 1 18.00 20 1 342.000 0.027019 0.000270
0.10 11025 1 18.00 10 1 162.000 0.012798 0.000128
0.15 53177 5 82.80 98 5 1407.600 0.111203 0.001112
0.20 42354 4 68.40 170 4 2318.400 0.183158 0.001832
0.25 54138 5 90.00 139 5 1876.500 0.148247 0.001482
Chang 0.00 202413 38 342.00 408 19 7125.000 0.562888 0.005629
0.05 32270 6 57.00 50 3 902.500 0.071299 0.000713
0.10 10485 2 15.20 20 1 273.600 0.021615 0.000216
0.15 53993 10 87.40 125 5 1744.200 0.137795 0.001378
0.20 106974 20 174.80 247 10 3432.160 0.271147 0.002711
0.25 64830 12 110.20 207 6 2878.500 0.227407 0.002274
Chartreuse verte 0.25 21120 78 32.40 41 2 499.500 0.039461 0.000395
0.20 21501 78 36.00 13 2 187.200 0.014789 0.000148
0.15 42407 156 64.80 80 4 1009.800 0.079776 0.000798
0.05 21942 78 36.00 82 2 1402.200 0.110776 0.001108
0.00 169736 624 266.40 343 16 5677.200 0.448509 0.004485
0.10 42084 156 64.80 234 4 3518.640 0.277979 0.002780
Côte de Blaye 0.00 150244 532 3530.90 320 14 79577.000 6.286731 0.062867
0.05 52977 190 1212.10 145 5 34294.525 2.709331 0.027093
0.10 21213 76 527.00 19 2 4505.850 0.355971 0.003560
0.20 20777 76 421.60 99 2 16695.360 1.318964 0.013190
0.25 10372 38 210.80 40 1 6324.000 0.499608 0.004996
Guaraná Fantástica 0.00 353001 792 139.50 704 33 3011.400 0.237906 0.002379
0.05 31105 72 10.80 102 3 348.840 0.027559 0.000276
0.10 31552 72 12.60 80 3 307.800 0.024317 0.000243
0.15 42213 96 17.10 60 4 218.025 0.017224 0.000172
0.20 32203 72 13.50 63 3 226.800 0.017918 0.000179
0.25 54712 120 22.50 116 5 391.500 0.030929 0.000309
Ipoh Coffee 0.20 21975 86 92.00 40 2 1472.000 0.116291 0.001163
... ... ... ... ... ... ... ... ... ...
Lakkalikööri 0.15 63760 456 97.20 162 6 2249.100 0.177683 0.001777
0.20 21584 152 36.00 94 2 1353.600 0.106937 0.001069
0.25 21262 152 32.40 41 2 499.500 0.039461 0.000395
Laughing Lumberjack Lager 0.00 75320 469 98.00 115 7 1610.000 0.127193 0.001272
0.15 10939 67 14.00 40 1 476.000 0.037605 0.000376
0.20 10923 67 14.00 24 1 268.800 0.021236 0.000212
0.25 10284 67 11.20 5 1 42.000 0.003318 0.000033
Outback Lager 0.20 32933 210 45.00 82 3 984.000 0.077738 0.000777
0.15 21557 140 30.00 58 2 739.500 0.058422 0.000584
0.25 42497 280 57.00 69 4 729.000 0.057592 0.000576
0.05 31973 210 45.00 63 3 897.750 0.070924 0.000709
0.00 245338 1610 324.00 438 23 6129.000 0.484202 0.004842
0.10 41864 280 51.00 107 4 1193.400 0.094281 0.000943
Rhönbräu Klosterbier 0.00 308816 2175 212.35 631 29 4642.250 0.366746 0.003667
0.05 32633 225 23.25 210 3 1546.125 0.122147 0.001221
0.10 53081 375 37.20 103 5 684.945 0.054112 0.000541
0.15 21306 150 13.95 26 2 163.370 0.012907 0.000129
0.20 64671 450 46.50 181 6 1122.200 0.088656 0.000887
0.25 10460 75 6.20 4 1 18.600 0.001469 0.000015
Sasquatch Ale 0.25 10548 34 14.00 10 1 105.000 0.008295 0.000083
0.20 21138 68 25.20 32 2 313.600 0.024775 0.000248
0.00 128134 408 156.80 269 12 3542.000 0.279825 0.002798
0.05 31849 102 36.40 135 3 1675.800 0.132391 0.001324
0.15 10990 34 14.00 60 1 714.000 0.056407 0.000564
Steeleye Stout 0.20 21164 70 32.40 95 2 1195.200 0.094423 0.000944
0.00 266316 875 424.80 531 25 8812.800 0.696228 0.006962
0.05 11046 35 18.00 18 1 307.800 0.024317 0.000243
0.10 31858 105 50.40 95 3 1409.400 0.111345 0.001113
0.15 31877 105 50.40 105 3 1392.300 0.109994 0.001100
0.25 21292 70 36.00 39 2 526.500 0.041594 0.000416

67 rows × 8 columns

# df_bev.groupby(['ProductName', 'Discount']).sum().sort_values(['ProductName'], ascending=True)

df_bev_dsc.groupby(['Discount']).sum().sort_values(['Discount']).plot(kind='bar', 
            title ="Beverages--Revenues by Discount", legend=False, figsize=(8, 5), fontsize=12)

plt.show()

png

Observations and comments on revenues by discount within the Beverages category

We can see that virtually every product in this category is sold at almost all discount levels (including 0%). With more time, I would like to explore the distributions of revenues by discount level for each named product, and run tests (Tukey's test, among others) to figure out whether those distributions differ significantly across product name.

For example, within the Beverage category, I noticed an interesting distribution of revenues by discount level for Cote de Blaye (which is an expensive wine product and is the largest single product by revenue in the Beverages category).

df.loc[df.ProductName == "Côte de Blaye"].head(4)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
966 10329 SPLIR 2012-10-15 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.2 0.316418 0.003164
967 10351 ERNSH 2012-11-11 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.2 0.316418 0.003164
968 10353 PICCO 2012-11-13 38 210.8 50 0.20 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 8432.0 0.666144 0.006661
969 10360 BLONP 2012-11-22 38 210.8 10 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2108.0 0.166536 0.001665

Grouping by discount level and sorting on revenues:

df_bev.groupby(['ProductName']).sum()
df_bev.loc[df['ProductName'] == 'Côte de Blaye']
df_bev_cote = df_bev.loc[df['ProductName'] == 'Côte de Blaye']
# df_bev_cote.groupby(['Discount', 'Revenue']).sum()

df_bev_cote.groupby(['Discount']).sum()


df_bev_cote_disc_rev = df_bev_cote.drop(columns=['Id', 'ProductId', 'OrderUnitPrice', 'RevPercentTotal', 'RevFractionTotal'])
df_bev_cote_disc_rev.rename(columns={'CategoryId': 'Number of Orders'})
print("Revenue totals sorted by discount, Cote de Blaye")
df_bev_cote_disc_rev.groupby(['Discount']).sum().sort_values(['Revenue'], ascending=False)

# Note that, since the CategoryId for Beverages is '1' and each row
# represents a summation of each item, the number showing in the 
# CategoryId column represents the number of orders placed at that 
# discount level.  OrderQty is the total number of units purchased
# across all orders at that discount level.
Revenue totals sorted by discount, Cote de Blaye
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
OrderQty CategoryId Revenue
Discount
0.00 320 14 79577.000
0.05 145 5 34294.525
0.20 99 2 16695.360
0.25 40 1 6324.000
0.10 19 2 4505.850
df_bev_dsc.groupby(['Discount']).sum().sort_values(['Revenue'], ascending=False).plot(kind='bar', 
            title ="Beverages--Discount, sorted by Revenue", figsize=(8, 5), fontsize=12)
plt.show()

png

Cote de Blaye: observations and comments on revenues by discount

Sorting revenues by discount level, we can see that the greatest order revenues occurred at 0% and 5% discount, with the next highest order revenues occurring at 20% and 25% and relatively low order revenues at the 10% or 15% level.

  • Almost \$80,000 of total sales of the product occurred without any discount.
  • The next highest revenues for the product (almost \$40,000) were made at the 5% discount level.
  • The third highest revenues of over \$16,000 were made with a 20% discount applied.
  • Over \$6300 was made at the 25% discount level.
  • Finally, just over \$4000 in revenues was realized on product at the 10% level. (No product was sold at the 15% discount level.)

I was curious to know if the date of the sale had any bearing on these sales figures. I thought that perhaps seasonality might be at play. Below is the dataset for this product, sorted by discount.

df.loc[df.ProductName == "Côte de Blaye"].sort_values(['Discount'], ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
970 10372 QUEE 2012-12-04 38 210.8 40 0.25 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6324.000 0.499608 0.004996
968 10353 PICCO 2012-11-13 38 210.8 50 0.20 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 8432.000 0.666144 0.006661
972 10424 MEREP 2013-01-23 38 210.8 49 0.20 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 8263.360 0.652821 0.006528
978 10672 BERGS 2013-09-17 38 263.5 15 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3557.250 0.281029 0.002810
976 10541 HANAR 2013-05-19 38 263.5 4 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 948.600 0.074941 0.000749
985 10865 QUICK 2014-02-02 38 263.5 60 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 15019.500 1.186568 0.011866
981 10816 GREAL 2014-01-06 38 263.5 30 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7509.750 0.593284 0.005933
967 10351 ERNSH 2012-11-11 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
977 10616 GREAL 2013-07-31 38 263.5 15 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3754.875 0.296642 0.002966
966 10329 SPLIR 2012-10-15 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
975 10540 QUICK 2013-05-19 38 263.5 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7905.000 0.624510 0.006245
974 10518 TORTU 2013-04-25 38 263.5 15 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3952.500 0.312255 0.003123
973 10479 RATTC 2013-03-19 38 210.8 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6324.000 0.499608 0.004996
979 10783 HANAR 2013-12-18 38 263.5 5 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 1317.500 0.104085 0.001041
980 10805 THEBI 2013-12-30 38 263.5 10 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2635.000 0.208170 0.002082
971 10417 SIMOB 2013-01-16 38 210.8 50 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 10540.000 0.832680 0.008327
982 10817 KOENE 2014-01-06 38 263.5 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7905.000 0.624510 0.006245
983 10828 RANCH 2014-01-13 38 263.5 2 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 527.000 0.041634 0.000416
984 10831 SANTG 2014-01-14 38 263.5 8 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2108.000 0.166536 0.001665
969 10360 BLONP 2012-11-22 38 210.8 10 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2108.000 0.166536 0.001665
986 10889 RATTC 2014-02-16 38 263.5 40 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 10540.000 0.832680 0.008327
987 10964 SPECD 2014-03-20 38 263.5 5 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 1317.500 0.104085 0.001041
988 10981 HANAR 2014-03-27 38 263.5 60 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 15810.000 1.249019 0.012490
989 11032 WHITC 2014-04-17 38 263.5 25 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6587.500 0.520425 0.005204

The three largest discounts were given on sales between November 1, 2012 and January 31, 2013. I also noticed that a few other discounted sales were made at the end of the year, but more were made at the beginning of the year. There were also a number of non-discounted sales that happened close to the of 2014, but there were just as many--if not more--non-discounted orders at other times of the year.

The number of datapoints in this subset of product data is quite small, which is why a robust statistical test would be challenging to run. With more time, I would research which tests would be most likely to yield meaningful results, given the small sample size for this product. Given the high revenues per order, it would be helpful to understand more about the sales patterns of this product--seasonality and geography in particular. Perhaps there is unmet customer demand that could be tapped by identifying revenue potential in different geographies or at different times of the year.

df.loc[df.ProductName == "Côte de Blaye"].sort_values(['Revenue'], ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
988 10981 HANAR 2014-03-27 38 263.5 60 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 15810.000 1.249019 0.012490
985 10865 QUICK 2014-02-02 38 263.5 60 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 15019.500 1.186568 0.011866
971 10417 SIMOB 2013-01-16 38 210.8 50 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 10540.000 0.832680 0.008327
986 10889 RATTC 2014-02-16 38 263.5 40 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 10540.000 0.832680 0.008327
968 10353 PICCO 2012-11-13 38 210.8 50 0.20 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 8432.000 0.666144 0.006661
972 10424 MEREP 2013-01-23 38 210.8 49 0.20 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 8263.360 0.652821 0.006528
975 10540 QUICK 2013-05-19 38 263.5 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7905.000 0.624510 0.006245
982 10817 KOENE 2014-01-06 38 263.5 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7905.000 0.624510 0.006245
981 10816 GREAL 2014-01-06 38 263.5 30 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7509.750 0.593284 0.005933
989 11032 WHITC 2014-04-17 38 263.5 25 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6587.500 0.520425 0.005204
970 10372 QUEE 2012-12-04 38 210.8 40 0.25 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6324.000 0.499608 0.004996
973 10479 RATTC 2013-03-19 38 210.8 30 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 6324.000 0.499608 0.004996
966 10329 SPLIR 2012-10-15 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
967 10351 ERNSH 2012-11-11 38 210.8 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
974 10518 TORTU 2013-04-25 38 263.5 15 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3952.500 0.312255 0.003123
977 10616 GREAL 2013-07-31 38 263.5 15 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3754.875 0.296642 0.002966
978 10672 BERGS 2013-09-17 38 263.5 15 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3557.250 0.281029 0.002810
980 10805 THEBI 2013-12-30 38 263.5 10 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2635.000 0.208170 0.002082
984 10831 SANTG 2014-01-14 38 263.5 8 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2108.000 0.166536 0.001665
969 10360 BLONP 2012-11-22 38 210.8 10 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 2108.000 0.166536 0.001665
979 10783 HANAR 2013-12-18 38 263.5 5 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 1317.500 0.104085 0.001041
987 10964 SPECD 2014-03-20 38 263.5 5 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 1317.500 0.104085 0.001041
976 10541 HANAR 2013-05-19 38 263.5 4 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 948.600 0.074941 0.000749
983 10828 RANCH 2014-01-13 38 263.5 2 0.00 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 527.000 0.041634 0.000416

Seventh possible question

Do revenues from discounts vary in a statistically significant way from revenues on non-discounted products in each product category?

Null and alternative hypotheses

  • Null hypothesis: No significant difference between revenues from non-discounted products in a category versus revenues from discounted products in a category

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$)
  • Alternative hypothesis: Differences between revenues from non-discounted products in a category versus revenues from discounted products in a category are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
# dataframes:

# df_bev
# df_cond
# df_confect
# df_dairy
# df_meat
# df_grains
# df_produce
# df_seafood
# df.loc[df["CategoryName"] == "Condiments"]

Create dataframes (discounts/no discounts), EDA, Sampling, T-tests

Beverages: Discounts vs. no discounts

Create dataframes Revenues from no discounts and Revenues from all discounts
df_bev_no_disc = df_bev.loc[df["Discount"] == 0]
df_bev_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
1 10294 RATTC 2012-08-30 1 14.4 18 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 259.2 0.020477 0.000205
2 10317 LONEP 2012-09-30 1 14.4 20 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 288.0 0.022753 0.000228
4 10354 PERIC 2012-11-14 1 14.4 12 0.0 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 172.8 0.013652 0.000137
df_bev_disc = df_bev.loc[df["Discount"] > 0]
df_bev_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
EDA / Visualization
hist_plot2(df_bev_no_disc.Revenue, df_bev_disc.Revenue, label="Beverage revenues per order, no discounts", label1="Beverage revenues per order, ALL discounts", xlim=([0, 2500]))

png

Sampling to create normal distribution of sample means
bev_no_disc_mean, bev_no_disc_std = sampling_mean(df_bev_no_disc.Revenue)
print(f"The mean of the distribution is {round(np.mean(bev_no_disc_mean), 2)}")
print(f"The std deviation of the distribution is {round(np.std(bev_no_disc_std), 2)}")
hist_plot1(bev_no_disc_mean, label="Resampling means, order revs, NO discounts,  Beverages", xlim=([0, 1200]), print_stmt=False)
The mean of the distribution is 618.11
The std deviation of the distribution is 460.16

png

bev_disc_mean, bev_disc_std = sampling_mean(df_bev_disc.Revenue)
print(f"The mean of the distribution is {round(np.mean(bev_disc_mean), 2)}")
print(f"The std deviation of the distribution is {round(np.std(bev_disc_std), 2)}")
hist_plot1(bev_disc_mean, label="Resampling means, order revs, ALL discounts,  Beverages", xlim=([0, 1200]), print_stmt=False)
The mean of the distribution is 731.42
The std deviation of the distribution is 339.08

png

df_bev_disc.Discount.value_counts()
0.20    39
0.15    36
0.25    32
0.05    26
0.10    25
Name: Discount, dtype: int64
# Divide df_bev_disc into two dataframes:  5-19% and 20+%

df_bev_disc_5_15 = df_bev_disc.loc[(df['Discount'] >= 0.05) & (df['Discount'] < 0.20)]
df_bev_disc_5_15.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
3 10348 WANDK 2012-11-07 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
5 10370 CHOPS 2012-12-03 1 14.4 15 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 183.6 0.014505 0.000145
10 10526 WARTH 2013-05-05 1 18.0 8 0.15 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 122.4 0.009670 0.000097
df_bev_disc_20plus = df_bev_disc.loc[(df['Discount'] >= 0.2)]
df_bev_disc_20plus.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
0 10285 QUICK 2012-08-20 1 14.4 45 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.4 0.040955 0.000410
9 10522 LEHMS 2013-04-30 1 18.0 40 0.20 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 576.0 0.045505 0.000455
16 10646 HUNGO 2013-08-27 1 18.0 15 0.25 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 202.5 0.015998 0.000160

Test chosen: t-test (2-sided, 2-sample)

Have gotten relatively normal distributions by taking the means of multiple samples. Need to check to make sure that variances are similar before performing the t-test. We can use Bartlett's test in this case to check for similar variances.

import scipy.stats

scipy.stats.bartlett(bev_disc_mean, bev_no_disc_mean)
BartlettResult(statistic=15.45656010891127, pvalue=8.442339866062341e-05)

The p-value is < 0.05, so we can reject the null hypothesis that the samples come from populations with similar or equal variances.

scipy.stats.levene(bev_disc_mean, bev_no_disc_mean)
LeveneResult(statistic=12.009820337335572, pvalue=0.0005671280164836042)

Like the Bartlett test, the Levene test evaluates two different distributions to determine whether or not they have equal variances. The Levene test is less sensitive to normality than Bartlett's, so it is an alternative to the latter when the distributions deviate from normal.

Both tests yield p-values that are < 0.05 (although Levene's test is close to 0.05). We can reject the null hypothesis that the distributions have the same variances. This is useful information: while the distributions are reasonably normal, they don't appear to be the same shape. The distribution of discounted product revenue sales seems to have a stronger left skew than the non-discounted product sales. You can also look at the mean and standard deviations to see that the means are substantially different.

Although the distributions do not have the same variances, they are quite similar. I think that it could be useful to run the 2-sided 2-sample t-test on these two distributions.

scs.ttest_ind(bev_disc_mean, bev_no_disc_mean)
Ttest_indResult(statistic=11.800139489541447, pvalue=4.901027148387775e-29)

We see an extremely low p-value, which tells us that we should reject the null hypothesis that there is no statistically significant difference between the revenues from products with no discounts and revenues from discounted products.

Next, let's take a look at Cohen's d:

df_bev_disc_05to15 = df_bev_disc.loc[(df["Discount"] >= 0.01) & (df["Discount"] < 0.15)]
df_bev_disc_05to15
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
26 10905 WELLI 2014-02-24 1 18.00 20 0.05 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 342.000 0.027019 0.000270
33 11025 WARTH 2014-04-15 1 18.00 10 0.10 Chai 1 Beverages Soft drinks, coffees, teas, beers, and ales 162.000 0.012798 0.000128
50 10485 LINOD 2013-03-25 2 15.20 20 0.10 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 273.600 0.021615 0.000216
54 10632 WANDK 2013-08-14 2 19.00 30 0.05 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 541.500 0.042780 0.000428
61 10787 LAMAI 2013-12-19 2 19.00 15 0.05 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 270.750 0.021390 0.000214
66 10851 RICAR 2014-01-26 2 19.00 5 0.05 Chang 1 Beverages Soft drinks, coffees, teas, beers, and ales 90.250 0.007130 0.000071
571 10275 MAGAA 2012-08-07 24 3.60 12 0.05 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 41.040 0.003242 0.000032
577 10358 LAMAI 2012-11-20 24 3.60 10 0.05 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 34.200 0.002702 0.000027
580 10446 TOMSP 2013-02-14 24 3.60 20 0.10 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 64.800 0.005119 0.000051
583 10472 SEVES 2013-03-12 24 3.60 80 0.05 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 273.600 0.021615 0.000216
588 10541 HANAR 2013-05-19 24 4.50 35 0.10 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 141.750 0.011199 0.000112
590 10565 MEREP 2013-06-11 24 4.50 25 0.10 Guaraná Fantástica 1 Beverages Soft drinks, coffees, teas, beers, and ales 101.250 0.007999 0.000080
876 10358 LAMAI 2012-11-20 34 11.20 10 0.05 Sasquatch Ale 1 Beverages Soft drinks, coffees, teas, beers, and ales 106.400 0.008406 0.000084
880 10483 WHITC 2013-03-24 34 11.20 35 0.05 Sasquatch Ale 1 Beverages Soft drinks, coffees, teas, beers, and ales 372.400 0.029420 0.000294
890 11008 ERNSH 2014-04-08 34 14.00 90 0.05 Sasquatch Ale 1 Beverages Soft drinks, coffees, teas, beers, and ales 1197.000 0.094565 0.000946
898 10390 ERNSH 2012-12-23 35 14.40 40 0.10 Steeleye Stout 1 Beverages Soft drinks, coffees, teas, beers, and ales 518.400 0.040955 0.000410
911 10623 FRANK 2013-08-07 35 18.00 30 0.10 Steeleye Stout 1 Beverages Soft drinks, coffees, teas, beers, and ales 486.000 0.038395 0.000384
919 10845 QUICK 2014-01-21 35 18.00 25 0.10 Steeleye Stout 1 Beverages Soft drinks, coffees, teas, beers, and ales 405.000 0.031996 0.000320
928 11046 WANDK 2014-04-23 35 18.00 18 0.05 Steeleye Stout 1 Beverages Soft drinks, coffees, teas, beers, and ales 307.800 0.024317 0.000243
966 10329 SPLIR 2012-10-15 38 210.80 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
967 10351 ERNSH 2012-11-11 38 210.80 20 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 4005.200 0.316418 0.003164
976 10541 HANAR 2013-05-19 38 263.50 4 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 948.600 0.074941 0.000749
977 10616 GREAL 2013-07-31 38 263.50 15 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3754.875 0.296642 0.002966
978 10672 BERGS 2013-09-17 38 263.50 15 0.10 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 3557.250 0.281029 0.002810
981 10816 GREAL 2014-01-06 38 263.50 30 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 7509.750 0.593284 0.005933
985 10865 QUICK 2014-02-02 38 263.50 60 0.05 Côte de Blaye 1 Beverages Soft drinks, coffees, teas, beers, and ales 15019.500 1.186568 0.011866
993 10305 OLDWO 2012-09-13 39 14.40 30 0.10 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 388.800 0.030716 0.000307
996 10361 QUICK 2012-11-22 39 14.40 54 0.10 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 699.840 0.055289 0.000553
1006 10654 BERGS 2013-09-02 39 18.00 20 0.10 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 324.000 0.025597 0.000256
1009 10764 ERNSH 2013-12-03 39 18.00 130 0.10 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 2106.000 0.166378 0.001664
1014 10865 QUICK 2014-02-02 39 18.00 80 0.05 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 1368.000 0.108075 0.001081
1019 11077 RATTC 2014-05-06 39 18.00 2 0.05 Chartreuse verte 1 Beverages Soft drinks, coffees, teas, beers, and ales 34.200 0.002702 0.000027
1143 10340 BONAP 2012-10-29 43 36.80 40 0.05 Ipoh Coffee 1 Beverages Soft drinks, coffees, teas, beers, and ales 1398.400 0.110476 0.001105
1893 10420 WELLI 2013-01-21 70 12.00 8 0.10 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 86.400 0.006826 0.000068
1894 10453 AROUT 2013-02-21 70 12.00 25 0.10 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 270.000 0.021331 0.000213
1896 10485 LINOD 2013-03-25 70 12.00 60 0.10 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 648.000 0.051193 0.000512
1897 10506 KOENE 2013-04-15 70 15.00 14 0.10 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 189.000 0.014931 0.000149
1900 10616 GREAL 2013-07-31 70 15.00 15 0.05 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 213.750 0.016887 0.000169
1903 10659 QUEE 2013-09-05 70 15.00 40 0.05 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 570.000 0.045031 0.000450
1906 10698 ERNSH 2013-10-09 70 15.00 8 0.05 Outback Lager 1 Beverages Soft drinks, coffees, teas, beers, and ales 114.000 0.009006 0.000090
2039 10436 BLONP 2013-02-05 75 6.20 24 0.10 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 133.920 0.010580 0.000106
2043 10510 SAVEA 2013-04-18 75 7.75 36 0.10 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 251.100 0.019837 0.000198
2047 10572 BERGS 2013-06-18 75 7.75 15 0.10 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 104.625 0.008266 0.000083
2053 10631 LAMAI 2013-08-14 75 7.75 8 0.10 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 55.800 0.004408 0.000044
2063 10788 QUICK 2013-12-22 75 7.75 40 0.05 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 294.500 0.023266 0.000233
2065 10894 SAVEA 2014-02-18 75 7.75 120 0.05 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 883.500 0.069798 0.000698
2069 10932 BONAP 2014-03-06 75 7.75 20 0.10 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 139.500 0.011021 0.000110
2070 10951 RICSU 2014-03-16 75 7.75 50 0.05 Rhönbräu Klosterbier 1 Beverages Soft drinks, coffees, teas, beers, and ales 368.125 0.029083 0.000291
2079 10273 QUICK 2012-08-05 76 14.40 33 0.05 Lakkalikööri 1 Beverages Soft drinks, coffees, teas, beers, and ales 451.440 0.035665 0.000357
2095 10604 FURIB 2013-07-18 76 18.00 10 0.10 Lakkalikööri 1 Beverages Soft drinks, coffees, teas, beers, and ales 162.000 0.012798 0.000128
2115 11050 FOLKO 2014-04-27 76 18.00 50 0.10 Lakkalikööri 1 Beverages Soft drinks, coffees, teas, beers, and ales 810.000 0.063992 0.000640
# rev_disc_01to05 = df.loc[(df["Discount"] >= .01) & (df["Discount"] < .05)]

Condiments: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_cond_no_disc = df_cond.loc[df["Discount"] == 0]
df_cond_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
82 10289 BSBEV 2012-08-26 3 8.0 30 0.0 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 240.0 0.018960 0.000190
83 10405 LINOD 2013-01-06 3 8.0 50 0.0 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 400.0 0.031601 0.000316
85 10540 QUICK 2013-05-19 3 10.0 60 0.0 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 600.0 0.047401 0.000474
df_cond_disc = df_cond.loc[df["Discount"] > 0]
df_cond_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
84 10485 LINOD 2013-03-25 3 8.0 20 0.1 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 144.00 0.011376 0.000114
89 10764 ERNSH 2013-12-03 3 10.0 20 0.1 Aniseed Syrup 2 Condiments Sweet and savory sauces, relishes, spreads, an... 180.00 0.014220 0.000142
96 10336 PRINI 2012-10-23 4 17.6 18 0.1 Chef Anton's Cajun Seasoning 2 Condiments Sweet and savory sauces, relishes, spreads, an... 285.12 0.022525 0.000225

Confections: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_confect_no_disc = df_confect.loc[df["Discount"] == 0]
df_confect_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
336 10255 RICSU 2012-07-12 16 13.9 35 0.0 Pavlova 3 Confections Desserts, candies, and sweet breads 486.5 0.038434 0.000384
339 10296 LILAS 2012-09-03 16 13.9 30 0.0 Pavlova 3 Confections Desserts, candies, and sweet breads 417.0 0.032944 0.000329
340 10310 THEBI 2012-09-20 16 13.9 10 0.0 Pavlova 3 Confections Desserts, candies, and sweet breads 139.0 0.010981 0.000110
df_confect_disc = df_confect.loc[df["Discount"] > 0]
df_confect_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
337 10263 ERNSH 2012-07-23 16 13.9 60 0.25 Pavlova 3 Confections Desserts, candies, and sweet breads 625.500 0.049416 0.000494
338 10287 RICAR 2012-08-22 16 13.9 40 0.15 Pavlova 3 Confections Desserts, candies, and sweet breads 472.600 0.037336 0.000373
341 10324 SAVEA 2012-10-08 16 13.9 21 0.15 Pavlova 3 Confections Desserts, candies, and sweet breads 248.115 0.019602 0.000196

Dairy: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_dairy_no_disc = df_dairy.loc[df["Discount"] == 0]
df_dairy_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
216 10248 VINET 2012-07-04 11 14.0 12 0.0 Queso Cabrales 4 Dairy Products Cheeses 168.0 0.013272 0.000133
217 10296 LILAS 2012-09-03 11 16.8 12 0.0 Queso Cabrales 4 Dairy Products Cheeses 201.6 0.015927 0.000159
220 10365 ANTO 2012-11-27 11 16.8 24 0.0 Queso Cabrales 4 Dairy Products Cheeses 403.2 0.031854 0.000319
df_dairy_disc = df_dairy.loc[df["Discount"] > 0]
df_dairy_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
218 10327 FOLKO 2012-10-11 11 16.8 50 0.2 Queso Cabrales 4 Dairy Products Cheeses 672.00 0.053089 0.000531
219 10353 PICCO 2012-11-13 11 16.8 12 0.2 Queso Cabrales 4 Dairy Products Cheeses 161.28 0.012741 0.000127
224 10443 REGGC 2013-02-12 11 16.8 6 0.2 Queso Cabrales 4 Dairy Products Cheeses 80.64 0.006371 0.000064

Meat/Poultry: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_meat_no_disc = df_meat.loc[df["Discount"] == 0]
df_meat_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
181 10693 WHITC 2013-10-06 9 97.0 6 0.0 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 582.0 0.045979 0.000460
182 10848 CONSH 2014-01-23 9 97.0 3 0.0 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 291.0 0.022990 0.000230
379 10265 BLONP 2012-07-25 17 31.2 30 0.0 Alice Mutton 6 Meat/Poultry Prepared meats 936.0 0.073946 0.000739
df_meat_disc = df_meat.loc[df["Discount"] > 0]
df_meat_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
178 10420 WELLI 2013-01-21 9 77.6 20 0.10 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 1396.8 0.110350 0.001103
179 10515 QUICK 2013-04-23 9 97.0 16 0.15 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 1319.2 0.104219 0.001042
180 10687 HUNGO 2013-09-30 9 97.0 50 0.25 Mishi Kobe Niku 6 Meat/Poultry Prepared meats 3637.5 0.287369 0.002874

Grains/Cereals: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_grains_no_disc = df_grains.loc[df["Discount"] == 0]
df_grains_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
536 10435 CONSH 2013-02-04 22 16.8 12 0.0 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 201.6 0.015927 0.000159
537 10553 WARTH 2013-05-30 22 21.0 24 0.0 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 504.0 0.039817 0.000398
538 10603 SAVEA 2013-07-18 22 21.0 48 0.0 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 1008.0 0.079634 0.000796
df_grains_disc = df_grains.loc[df["Discount"] > 0]
df_grains_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
535 10251 VICTE 2012-07-08 22 16.8 6 0.05 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 95.76 0.007565 0.000076
542 10651 WANDK 2013-09-01 22 21.0 20 0.25 Gustaf's Knäckebröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 315.00 0.024886 0.000249
556 10543 LILAS 2013-05-21 23 9.0 70 0.15 Tunnbröd 5 Grains/Cereals Breads, crackers, pasta, and cereal 535.50 0.042305 0.000423

Produce: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_produce_no_disc = df_produce.loc[df["Discount"] == 0]
df_produce_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
136 10262 RATTC 2012-07-22 7 24.0 15 0.0 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 360.0 0.028441 0.000284
139 10471 BSBEV 2013-03-11 7 24.0 30 0.0 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 720.0 0.056881 0.000569
141 10546 VICTE 2013-05-23 7 30.0 10 0.0 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 300.0 0.023701 0.000237
df_produce_disc = df_produce.loc[df["Discount"] > 0]
df_produce_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
137 10385 SPLIR 2012-12-17 7 24.0 10 0.20 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 192.0 0.015168 0.000152
138 10459 VICTE 2013-02-27 7 24.0 16 0.05 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 364.8 0.028820 0.000288
140 10511 BONAP 2013-04-18 7 30.0 50 0.15 Uncle Bob's Organic Dried Pears 7 Produce Dried fruit and bean curd 1275.0 0.100727 0.001007

Seafood: Discounts vs. no discounts

Create dataframes for Revenues from no discounts and Revenues due to all discounts

df_seafood_no_disc = df_seafood.loc[df["Discount"] == 0]
df_seafood_no_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
184 10276 TORTU 2012-08-08 10 24.8 15 0.0 Ikura 8 Seafood Seaweed and fish 372.0 0.029389 0.000294
186 10389 BOTTM 2012-12-20 10 24.8 16 0.0 Ikura 8 Seafood Seaweed and fish 396.8 0.031348 0.000313
187 10449 BLONP 2013-02-18 10 24.8 14 0.0 Ikura 8 Seafood Seaweed and fish 347.2 0.027429 0.000274
df_seafood_disc = df_seafood.loc[df["Discount"] > 0]
df_seafood_disc.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id CustomerId OrderDate ProductId OrderUnitPrice OrderQty Discount ProductName CategoryId CategoryName CatDescription Revenue RevPercentTotal RevFractionTotal
183 10273 QUICK 2012-08-05 10 24.8 24 0.05 Ikura 8 Seafood Seaweed and fish 565.44 0.044671 0.000447
185 10357 LILAS 2012-11-19 10 24.8 30 0.20 Ikura 8 Seafood Seaweed and fish 595.20 0.047022 0.000470
188 10450 VICTE 2013-02-19 10 24.8 20 0.20 Ikura 8 Seafood Seaweed and fish 396.80 0.031348 0.000313

While we could look at the simple question of whether the average revenue per order is larger for discounted products than for non-discounted products in each product category, I believe that it would be more informative to look at how the level of discount correlates to average revenues per order. We already know from question 1 that discounts drive revenues overall vs. no discounts, and we know from question 2 that discounts in the 5-9.9% range maximize revenues summed up across all categories. I'd like to find out if the result we see across all categories holds true within each category.

Originally, I thought about dividing up each category into discount levels within categories of 5%, 10%, 15%, 20%, and more than 20%. However, in many cases, the sample sizes are small at each of these levels (often with an n < 30). However, we can get larger samples by consolidating discounts levels into two categories: 1-14.9%, and 15%+.

Depending on the findings (and time permitting) it would be really interesting to compare product categories to look for interesting trends or differences in terms of discounts applied by product.

Null and alternative hypotheses

  • Null hypothesis: No significant difference between revenues from non-discounted products in a category versus revenues from discounts of 1% - 14.9% and discounts of 15% and up in a category

    • Ho: $\mu_1 = \mu_2 = \mu_3 ... = \mu_i$)
  • Alternative hypothesis: Differences among revenues from non-discounted products in a category versus revenues from discounts of 1% - 14.9% and discounts of 15% and up in a category are unlikely to be due to chance

    • Ha: $\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$

About

Module 2 final project for Flatiron School data science immersive program

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published