The goal of the project is to use statistical analysis of data in the Northwind SQL database to answer at least four questions of business relevance to Northwind, an importer of food and beverage products. To answer these questions using statistical analysis, it will be necessary to perform hypothesis testing.
The first of these four questions is a required question: Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?
The following deliverables are required:
- A Jupyter Notebook containing any code written for this project.
- A Blog Post explaining process, methodology, and findings.
- An "Executive Summary" PowerPoint Presentation that explains the hypothesis tests run, findings, and relevance to company stakeholders.
- A Video Walkthrough of the “Executive Summary” presentation.
- (Required) Do discounts have a statistically significant effect on the average quantity of products per order?
- Do discounts have a statistically significant effect on the average revenue per order?
- Does the level of discount (e.g., 5%, 10%, 15%...) have a statistically significant effect on the average quantity of products per order?
- Does the level of discount (e.g., 5%, 10%, 15%...) have a statistically significant effect on the average revenue per order?
- Do the revenue distributions differ in a statistically significant manner from one product category to the next? Put another way, do some product categories have statistically significantly higher revenues per order than others?
import pandas as pd
from sqlite3 import *
from sqlalchemy import *
import pprint as pp
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
con = engine.connect()
# Get names of tables in database
inspector = inspect(engine)
print("Table names")
table_names = inspector.get_table_names()
pp.pprint(inspector.get_table_names())
print()
Table names
['Category',
'Customer',
'CustomerCustomerDemo',
'CustomerDemographic',
'Employee',
'EmployeeTerritory',
'Order',
'OrderDetail',
'Product',
'Region',
'Shipper',
'Supplier',
'Territory',
'placeholder']
# Get column names for a specific table
def print_col_names(var_table):
var_table = var_table
print(f"Column names for table {var_table}:")
for i in inspector.get_columns(var_table):
print(i['name'])
print()
print_col_names('Customer')
Column names for table Customer:
Id
CompanyName
ContactName
ContactTitle
Address
City
Region
PostalCode
Country
Phone
Fax
print_col_names('Order')
Column names for table Order:
Id
CustomerId
EmployeeId
OrderDate
RequiredDate
ShippedDate
ShipVia
Freight
ShipName
ShipAddress
ShipCity
ShipRegion
ShipPostalCode
ShipCountry
print_col_names('OrderDetail')
Column names for table OrderDetail:
Id
OrderId
ProductId
UnitPrice
Quantity
Discount
print_col_names('Product')
Column names for table Product:
Id
ProductName
SupplierId
CategoryId
QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
Discontinued
print_col_names('Category')
Column names for table Category:
Id
CategoryName
Description
# Create a connection object and a cursor object to execute SQL commands
engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
con = engine.connect()
# Set up query
query = """
SELECT o.id, o.CustomerID, o.OrderDate,
od.ProductID, od.UnitPrice as [OrderUnitPrice],
od.Quantity as [OrderQty], od.Discount,
p.ProductName, p.CategoryId,
cat.CategoryName, cat.Description as [CatDescription]
FROM [Order] as o
JOIN OrderDetail as oD
ON o.ID = od.OrderID
JOIN Product as p
ON od.ProductID = p.ID
JOIN Category as cat
ON cat.id = p.CategoryId
ORDER BY ProductID
"""
# Execute query and create df from results
res = con.execute(query)
df = pd.read_sql(query, con)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
# Function load_df() to pull original data from sqlite db and load into pandas
# dataframe. This function is used as a subfunction later in the notebook
# to enable reconstruction of the database if needed.
def load_df(query):
# Create a connection object and a cursor object to execute SQL commands
engine = create_engine('sqlite:///Northwind_small.sqlite', echo=False)
con = engine.connect()
# Set up query
query = """
SELECT o.id, o.CustomerID, o.OrderDate,
od.ProductID, od.UnitPrice as [OrderUnitPrice],
od.Quantity as [OrderQty], od.Discount,
p.ProductName, p.CategoryId,
cat.CategoryName, cat.Description as [CatDescription]
FROM [Order] as o
JOIN OrderDetail as oD
ON o.ID = od.OrderID
JOIN Product as p
ON od.ProductID = p.ID
JOIN Category as cat
ON p.CategoryId = cat.Id
ORDER BY ProductID
"""
# Execute query and create df from results
res = con.execute(query)
df = pd.read_sql(query, con)
return df
df = load_df(query)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
Do discounts result in a statistically significant effect on quantity of products per order?
-
Null hypothesis: No statistically significant difference between the means of quantities of discounted products purchased per order versus quantities of non-discounted products purchased per order
- Ho:
$\mu_1 = \mu_2$
- Ho:
-
Alternative hypothesis: Statistically significant differences between the means of quantities of discounted products purchased per order versus quantities of non-discounted products purchased per order are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2$
- Ha:
Create dataframes that contain products ordered without discounts and products ordered that were discounted:
df_no_disc = df.loc[df['Discount'] == 0]
df_no_disc.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
6 | 10406 | QUEE | 2013-01-07 | 1 | 14.4 | 10 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
7 | 10413 | LAMAI | 2013-01-14 | 1 | 14.4 | 24 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
df_disc = df.loc[df['Discount'] > 0]
df_disc.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
9 | 10522 | LEHMS | 2013-04-30 | 1 | 18.0 | 40 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
10 | 10526 | WARTH | 2013-05-05 | 1 | 18.0 | 8 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales |
# Functions for plotting average order quantities
def hist_plot1_qty(series1, label=None, color='darkcyan', alpha=0.7, bins=30, xlim=[0, 150]):
plt.hist(series1, label=label, color=color, alpha=alpha, bins=bins)
plt.title(f"Order quantity by occurrence")
plt.xlabel("Order Quantity")
plt.ylabel("Number")
plt.legend(loc='best')
plt.xlim(xlim)
plt.show()
# function for plotting two series
def hist_plot2_qty(series1, series2, label=None, label1=None, color1='b', color2='r', alpha=0.4, bins=30, xlim=[0, 150]):
plt.hist(series1, label=label, color=color1, alpha=alpha, bins=bins)
plt.hist(series2, label=label1, color=color2, alpha=alpha, bins=bins)
plt.title(f"Order quantity by occurrence")
plt.xlabel("Order Quantity")
plt.ylabel("Number")
plt.legend(loc='best')
plt.xlim(xlim)
plt.show()
# function for plotting three different series
def hist_plot3_qty(var1, var2, var3, label=None, label1=None, label2=None, color='b', alpha=0.5, bins=30, xlim=[0, 150]):
var1 = var1
var2 = var2
var3 = var3
alpha = alpha
plt.hist(var1, label=label, color='r', alpha=0.5, bins=bins)
plt.hist(var2, label=label1, color='b', alpha=0.5, bins=bins)
plt.hist(var3, label=label2, color='g', alpha=0.5, bins=bins)
plt.title(f"Order quantity by occurrence")
plt.xlabel("Order Quantity")
plt.ylabel("Number")
plt.legend(loc='best')
plt.xlim(xlim)
plt.show()
Histogram of total average order quantity by average order quantity for no discount vs. some discount
# plt.figure(figsize=(8, 6))
hist_plot2_qty(df_no_disc.OrderQty, df_disc.OrderQty, label='No discounts (X = 0% discount)', label1='All discounts (X = 1%-25%+ discount)')
The histogram shows the distribution of order quantities values (x-axis). The distribution of both order quantities without discounts and total order quantities (whether discounted or not) appear very similar, with the smallest order amounts appearing at the far left hand side of graph and the highest order order amounts appearing at the far right.
If discounts have an effect on the quantity of products sold, we would expect to see the distribution of quantities sold from discounts look different from the distribution of quantities sold that didn't involve discounts. If the shape of the distribution changes, such that a greater proportion of quantities appear in the middle or far right-hand-side of the graph, then that difference in distribution suggests that discounts increase the quantity of products in a given order.
To test whether or not the presence of a discount likely has an effect on the quantity of products sold, we can use a two-sided t-test. Two prerequisites for performing this test are that 1) both datasets have approximately normal distributions and 2) the variances of the two datasets are equal.
We can see that the above distributions are not normal. Before proceding with the 2-sides, 2-sample t-test, we must first address this issue.
Fortunately, we can obtain a normal distribution (per the Central Limit Theorem) by sampling each dataset multiple times and computing each sample's mean. With a large enough number of samples (n > 30) we can expect that the distribution of the means will be normal or very nearly normal. We can take these normal distributions and run statistical tests on them.
Each time the cells below are run, 100 samples are taken 300 times for each of the two populations (orders with no discounts and orders with some level of discount). The mean and standard deviation are computed for each of the 300 sampling events, and each of the resulting means and standard deviations is appended to the respective relevant list.
Multiple sampling results in different results each time; however, because there are well over 30 means and standard deviations, we can assume that the distributions of these means and standard deviations are normal, per the Central Limit Theorem. We can now perform hypothesis tests requiring normal distributions of statistics on these means and standard deviations.
def sampling_mean(var, sampling_events=300, size=100): # var is the dataframe and column of interest
import numpy as np
var = var
samp_mean = []
samp_std = []
for i in range(sampling_events):
sample = np.random.choice(var, size=size, replace=False)
samp_mean.append(sample.mean())
samp_std.append(sample.std())
return samp_mean, samp_std
sampling_no_disc_mean, sampling_no_disc_std = sampling_mean(df_no_disc.OrderQty)
print(np.mean(sampling_no_disc_mean))
print(np.std(sampling_no_disc_std))
21.6004
2.3409277308454226
hist_plot1_qty(sampling_no_disc_mean, label="Order qty, no discounts", xlim=None)
sampling_w_disc_mean, sampling_w_disc_std = sampling_mean(df_disc.OrderQty)
print(np.mean(sampling_w_disc_mean))
print(np.std(sampling_w_disc_std))
27.1053
2.24426228479168
hist_plot1_qty(sampling_w_disc_mean, label="Order qty, w/discounts", xlim=None)
The t-test performed above requires not only that the samples come from approximately normally distributed populations, but also that they come from populations that have approximately equal variances. We use Bartlett’s test on normal populations to test the null hypothesis that the samples are from populations with equal variances.
import scipy.stats
scipy.stats.bartlett(sampling_no_disc_mean, sampling_w_disc_mean)
BartlettResult(statistic=2.564061882357949, pvalue=0.10931738849981974)
When the p-value is > 0.05, we cannot reject the null hypothesis that the samples come from populations with similar or equal variances.
One interesting point is that there appears to be more fluctuation in terms of means and distributions for order quantities than for order revenues (presumably because the revenue numbers are much larger than the quantity numbers). Sometimes the distributions that result from resampling pass the Bartlett test, and sometimes they don't. One thing that seems fairly consistent, though, is that the distributions are reasonably normal and the shapes for each are similar--meaning that it the distributions might not always pass Bartlett's test, but when they don't, they aren't so very far off the mark.
Because of this, I would still like to run the t-test to find out what results we get.
import scipy.stats as scs
scs.ttest_ind(sampling_no_disc_mean, sampling_w_disc_mean)
Ttest_indResult(statistic=-35.273313157553034, pvalue=3.1316578958420464e-148)
The p-value is <<< 0.05, so if the other conditions for this test (normality, similar variances) are met, we should reject the null hypothesis that the samples come from populations with the same means. Now, depending on the particular samples that are pulled when the model is run, the distributions may or may not pass Bartlett's test. If we do not pass Bartlett's test on a particular running of the model, we can take a look at the distributions to see how similar or dissimilar they are to each other and figure out if an alterative test would be better.
If we decide to perform this test and we get a p-value of << 0.05, then it's likely that there is an actual statistically significant difference between the populations that is not due to chance. But what is the effect size of that difference? Enter Cohen's d.
Cohen's d allows us to view the magnitude of the effect of a statistically significant difference. From the "Trending Sideways: Practical Engineering and Statistics Updates" blog page, Cohen’s d is "simply a measure of the distance between two means, measured in standard deviations." So, we could expect that the majority of analyses will yield Cohen's d results in a range from 0 to 3 or 4 (representing the differences between two means in terms of standard deviations).
# From learn.co Section 20 Effect Size lesson, with conditional print statements added
def Cohen_d(group1, group2, print_stmt=True):
# Compute Cohen's d.
# group1: Series or NumPy array ----> in this case, the array is the list of sample means from resampling
# group2: Series or NumPy array ----> in this case, the array is the list of sample means from resampling
# returns a floating point number
diff = group1.mean() - group2.mean()
n1, n2 = len(group1), len(group2)
var1 = group1.var()
var2 = group2.var()
# Calculate the pooled threshold as shown earlier
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
# Calculate Cohen's d statistic
d = diff / np.sqrt(pooled_var)
if print_stmt: # changing print_stmt argument in the function call to False will turn off these print statements
if abs(d) >= 0 and abs(d) < .4:
print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **small** effect size.")
elif abs(d) >= .4 and abs(d) < .7:
print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **medium** effect size.")
else:
print(f"This Cohen's d result ({round(d, 8)}) represents a difference of {round(d, 2)} pooled standard deviations between the two groups, and suggests a **large** effect size.")
return d
group1 = np.array(sampling_w_disc_mean)
group2 = np.array(sampling_no_disc_mean)
d = Cohen_d(group1, group2)
d
This Cohen's d result (2.88486608) represents a difference of 2.88 pooled standard deviations between the two groups, and suggests a **large** effect size.
2.884866082701662
The functions below allow the creation of plots showing the overlap of two randomly generated normal distributions look like at a given Cohen's d. Following each pair's Cohen's d result is a visualization of the amount of overlap the given Cohen's d provides. The functions below are adapted from the Effect Sizes lesson in Section 20 of Module 2 of the Flatiron Full-Time Online Data Science Bootcamp.
def evaluate_PDF(var, x=4):
'''Input: a random variable object, standard deviation
output : x and y values for the normal distribution
'''
# Identify the mean and standard deviation of random variable
mean = var.mean()
std = var.std()
# Use numpy to calculate evenly spaced numbers over the specified interval (4 sd) and generate 100 samples.
xs = np.linspace(mean - x*std, mean + x*std, 100)
# Calculate the peak of normal distribution i.e. probability density.
ys = var.pdf(xs)
return xs, ys # Return calculated values
def plot_pdfs(cohen_d=2):
"""Plot randomly generated, normally distributed PDFs for
distributions that differ by some number of stds.
cohen_d: number of standard deviations between the means
"""
group1 = scipy.stats.norm(cohen_d, 1)
group2 = scipy.stats.norm(0, 1)
xs, ys = evaluate_PDF(group1)
plt.fill_between(xs, ys, label='Group1', color='#ff2289', alpha=0.7)
xs, ys = evaluate_PDF(group2)
plt.fill_between(xs, ys, label='Group2', color='#376cb0', alpha=0.7)
plot_pdfs(d)
Red = group1 = all discounted products;
Blue = group2 = all non-discounted products
This graph provides a way to visualize the difference of the means, as provided by the Cohen's d test result. Cohen's d is a measure of the distance between the means, depicted in units of pooled standard deviation. In this case,
There is a statistically significant difference in the quantity of products per order for non-discounted products vs. discounted products.
Cohen offered basic guidelines for interpreting the result of his test (along with a caution not to simply mindlessly apply the guidance, but to evaluate it in the larger context of the field of analysis).
In general, d = 0.2 would be considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect. In the function above, I have included conditional print statments to accompany the result of running the function.
Note that Cohen's d can be positive or negative, depending on which dataset is assigned to group1 and which to group2. In the Cohen's d calculation in the previous cell:
- 'group1' contained the distribution of means from resampling of the dataset containing only revenues from discounted products, and
- 'group2' contained the distribution of means from resampling of the dataset containing revenues only from non-discounted products.
A positive Cohen's d tells us that the means in the order quantities from discounted products are d pooled standard deviations higher than the means from the order quantities from non-discounted products.
Do discounts result in a statistically significant effect on per-order revenue?
-
Null hypothesis: No statistically significant difference between the means of per-order revenues resulting from discounts and per-order revenues from non-discounted products
- Ho:
$\mu_1 = \mu_2$
- Ho:
-
Alternative hypothesis: There are statistically significant differences between the means of revenues resulting from discounts and revenues from non-discounted products that are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2$
- Ha:
Before proceding, we need to add a "Revenue" column
# Add Revenue, Revenue Percentage, and Revenue Fraction (actual revenue per row divided by total revenues)
def add_rev_col():
# add a revenue column
df["Revenue"] = df["OrderUnitPrice"] * (1 - df["Discount"]) * df["OrderQty"]
# add columns showing percentage / fraction of total revenue
df["RevPercentTotal"] = (df["Revenue"] / df["Revenue"].sum())*100
df["RevFractionTotal"] = (df["Revenue"] / df["Revenue"].sum())
return df
df = add_rev_col()
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 259.2 | 0.020477 | 0.000205 |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 288.0 | 0.022753 | 0.000228 |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.00 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 172.8 | 0.013652 | 0.000137 |
total_revenue = df.loc[df.Discount >= 0.00]
revenue_no_discount = df.loc[df.Discount == 0.00]
revenue_all_discounts = df.loc[df.Discount >=0.01]
total_revenue.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.2 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 259.2 | 0.020477 | 0.000205 |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 288.0 | 0.022753 | 0.000228 |
revenue_no_discount.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 259.2 | 0.020477 | 0.000205 |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 288.0 | 0.022753 | 0.000228 |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 172.8 | 0.013652 | 0.000137 |
revenue_all_discounts.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
# Functions for plotting
def hist_plot1(series1, label=None, color='darkcyan', alpha=0.7, bins=50, xlim=[0, 3500], print_stmt=True):
plt.hist(series1, label=label, color=color, alpha=alpha, bins=bins)
plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
plt.xlabel("Revenue")
plt.ylabel("Number")
plt.legend(loc="best")
plt.xlim(xlim)
plt.show()
if print_stmt:
print(f"Total revenues (discount and no discount): {int(total_revenue.Revenue.sum())}")
print(f"Total revenues from all discounted products: {int(revenue_all_discounts.Revenue.sum())}")
print(f"Percentage of total revenue from discounts: {round((revenue_all_discounts.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
def hist_plot2(series1, series2, label=None, label1=None, color1='b', color2='r', alpha=0.4, bins=100, xlim=[0, 3500], ylim=None):
plt.hist(series1, label=label, color=color1, alpha=alpha, bins=bins)
plt.hist(series2, label=label1, color=color2, alpha=alpha, bins=bins)
plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
plt.xlabel("Revenue")
plt.ylabel("Number")
plt.legend(loc='best')
plt.xlim(xlim)
plt.ylim(ylim)
plt.show()
hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label="Avg order revenue, no discount", label1="Avg order revenue, all discounts", color1='g', color2='b', xlim=[0, 3500])
hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label="Avg order revenue, no discount", label1="Avg order revenue, all discounts", color1='g', color2='b', xlim=[1500, 10000], ylim=[0,20])
The histogram shows the distribution of revenue values (x-axis). The distribution of both revenues without discounts and all revenues coming from discounts appear very similar, with the smallest order amounts appearing most frequently (far left hand side of graph) and the highest revenue order amounts appearing at the far right.
If discounts have an effect on sales revenues, we would expect to see the distribution of revenues from discounts look different from the distribution of revenues due to sales that didn't involve discounts. If the shape of the distribution changes, such that a greater proportion of revenues appear in the middle or far right-hand-side of the graph, then that difference in distribution suggests that discounts actually increase revenues (since higher revenues would occur more frequently in that distribution than in the no-discount distribution).
As for question one, we'd like to perform a two-sided two-sample test. We need to check for normality and check variances between the two distributions.
We can see that the above distributions are not normal. As we did for Question 1, we will perform resampling for each dataset to obtain a reasonaly normal distribution of means.
Each time the cells below are run, 100 samples are taken 300 times for each of the two populations (revenues from no discounts, revenues from all discounts). The mean and standard deviation are computed for each of the 300 sampling events, and each of the resulting means and standard deviations is appended to the respective relevant list.
# using sampling_mean(var, sampling_events=300, size=100) function from Question 1
sampling_no_disc_mean, sampling_no_disc_std = sampling_mean(revenue_no_discount.Revenue)
print(np.mean(sampling_no_disc_mean))
print(np.std(sampling_no_disc_std))
570.5213759999999
381.14048065742645
hist_plot1(sampling_no_disc_mean, label="Avg order revenue, no discounts", xlim=None)
Total revenues (discount and no discount): 1265793
Total revenues from all discounted products: 515094
Percentage of total revenue from discounts: 40.693%
sampling_w_disc_mean, sampling_w_disc_std = sampling_mean(revenue_all_discounts.Revenue)
print(np.mean(sampling_w_disc_mean))
print(np.std(sampling_w_disc_std))
618.2533294666667
331.2482261270112
hist_plot1(sampling_w_disc_mean, label="Avg order revenue, all discounts", xlim=None)
Total revenues (discount and no discount): 1265793
Total revenues from all discounted products: 515094
Percentage of total revenue from discounts: 40.693%
The t-test performed above requires not only that the samples come from approximately normally distributed populations, but also that they come from populations that have approximately equal variances. We use Bartlett’s test on normal populations to test the null hypothesis that the samples are from populations with equal variances.
import scipy.stats
scipy.stats.bartlett(sampling_no_disc_mean, sampling_w_disc_mean)
BartlettResult(statistic=0.04396538104808006, pvalue=0.8339180343436614)
The p-value is > 0.05, so we cannot reject the null hypothesis that the samples come from populations with similar or equal variances. Therefore, the null hypothesis that the variances of the two data set populations are approximately equal.
import scipy.stats as scs
scs.ttest_ind(sampling_no_disc_mean, sampling_w_disc_mean)
Ttest_indResult(statistic=-6.189709371451119, pvalue=1.1197350179704822e-09)
The p-value is << 0.05, so we should reject the null hypothesis that the samples come from populations with the same means. Therefore, we can say that it's unlikely that the differences in sample means from one dataset to another are due to random chance.
So, it's likely that there is an actual statistically significant difference between the two populations. But what is the effect size of that difference? Enter Cohen's d.
Using function from Cohen's d section in Question 1:
group1 = np.array(sampling_w_disc_mean)
group2 = np.array(sampling_no_disc_mean)
Cohen_d(group1, group2)
This Cohen's d result (0.50623208) represents a difference of 0.51 pooled standard deviations between the two groups, and suggests a **medium** effect size.
0.506232078277466
Using evaluate_PDF and plot_pdfs functions defined in Question 1:
plot_pdfs(d)
Red = all discounted products; Blue = all non-discounted products
This graph provides a way to visualize the difference of the means, as provided by the Cohen's d test result. Cohen's d is a measure of the distance between the means, depicted in units of pooled standard deviation. In this case,
There is a statistically significant difference in the average revenues per order for non-discounted products vs. discounted products. The average revenue per order was higher when discounts were present than when discounts were not present.
As described in the Cohen's d section in Question 1 above, Cohen suggested rules of thumb for how to interpret Cohen's d. In general, d = 0.2 would be considered a small effect, 0.5 a medium effect, and 0.8 or above a large effect.
Note that Cohen's d can be positive or negative, depending on which dataset is assigned to group1 and which to group2. In the Cohen's d calculation in the previous cell:
- 'group1' contained the distribution of means from resampling of the dataset containing only revenues from discounted products, and
- 'group2' contained the distribution of means from resampling of the dataset containing revenues only from non-discounted products.
A positive Cohen's d from this ordering tells us that the means in the revenues from discounted products are d pooled standard deviations higher than the means from the revenues from non-discounted products.
Does the discount level (e.g., 5%, 10%, 15%...) affect the average order quantity?
While we were able to reject the null hypothesis in the first question (that order quantities are the same whether products are discounted or not), we should dig a bit deeper to find out whether some discount levels result in a greater quantities per order than others.
-
Null hypothesis: No significant difference among the means of product order quantities from the various discount values
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$ )
- Ho:
-
Alternative hypothesis: Differences among the means of product order quantities resulting from various discount values are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha:
# Create dataframes with different discount levels (1-5%, 5-10%, 10-15%, 15-20%, and 20% and up)
def create_disc_dfs():
'''
This function creates dataframes for each category of discount in 5% increments, up to
20% (which is represented as 20% and up). The function creates the dataframes
for future use.
'''
df_disc_01to05 = df.loc[(df["Discount"] >= .01) & (df["Discount"] < .05)]
df_disc_05to10 = df.loc[(df["Discount"] >= .05) & (df["Discount"] < .10)]
df_disc_10to15 = df.loc[(df["Discount"] >= .10) & (df["Discount"] < .15)]
df_disc_15to20 = df.loc[(df["Discount"] >= .15) & (df["Discount"] < .2)]
df_disc_20andup = df.loc[df["Discount"] >= .20]
return df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup
# when running this function, run the following line to extract all of the dataframes:
# df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup = create_disc_dfs()
df_disc_01to05, df_disc_05to10, df_disc_10to15, df_disc_15to20, df_disc_20andup = create_disc_dfs()
Uncomment the lines in each of the cells below to see the top 5 rows of the dataframe and info on the dataframe columns and datatypes
df_disc_01to05.head(3)
# rev_disc_01to05.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
135 | 11077 | RATTC | 2014-05-06 | 6 | 25.00 | 1 | 0.02 | Grandma's Boysenberry Spread | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 24.5000 | 0.001936 | 0.000019 |
329 | 11077 | RATTC | 2014-05-06 | 14 | 23.25 | 1 | 0.03 | Tofu | 7 | Produce | Dried fruit and bean curd | 22.5525 | 0.001782 | 0.000018 |
378 | 11077 | RATTC | 2014-05-06 | 16 | 17.45 | 2 | 0.03 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 33.8530 | 0.002674 | 0.000027 |
df_disc_05to10.head(3)
# rev_disc_05to10.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26 | 10905 | WELLI | 2014-02-24 | 1 | 18.0 | 20 | 0.05 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 342.00 | 0.027019 | 0.000270 |
54 | 10632 | WANDK | 2013-08-14 | 2 | 19.0 | 30 | 0.05 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 541.50 | 0.042780 | 0.000428 |
61 | 10787 | LAMAI | 2013-12-19 | 2 | 19.0 | 15 | 0.05 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 270.75 | 0.021390 | 0.000214 |
df_disc_10to15.head(3)
# rev_disc_10to15.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 11025 | WARTH | 2014-04-15 | 1 | 18.0 | 10 | 0.1 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 162.0 | 0.012798 | 0.000128 |
50 | 10485 | LINOD | 2013-03-25 | 2 | 15.2 | 20 | 0.1 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 273.6 | 0.021615 | 0.000216 |
84 | 10485 | LINOD | 2013-03-25 | 3 | 8.0 | 20 | 0.1 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 144.0 | 0.011376 | 0.000114 |
df_disc_15to20.head(3)
# rev_disc_15to20.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
10 | 10526 | WARTH | 2013-05-05 | 1 | 18.0 | 8 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 122.4 | 0.009670 | 0.000097 |
df_disc_20andup.head(3)
# rev_disc_20andup.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
9 | 10522 | LEHMS | 2013-04-30 | 1 | 18.0 | 40 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 576.0 | 0.045505 | 0.000455 |
16 | 10646 | HUNGO | 2013-08-27 | 1 | 18.0 | 15 | 0.25 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 202.5 | 0.015998 | 0.000160 |
df_disc_10to15.head(3)
# rev_disc_10to15.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 11025 | WARTH | 2014-04-15 | 1 | 18.0 | 10 | 0.1 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 162.0 | 0.012798 | 0.000128 |
50 | 10485 | LINOD | 2013-03-25 | 2 | 15.2 | 20 | 0.1 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 273.6 | 0.021615 | 0.000216 |
84 | 10485 | LINOD | 2013-03-25 | 3 | 8.0 | 20 | 0.1 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 144.0 | 0.011376 | 0.000114 |
df_disc_15to20.head(3)
# rev_disc_15to20.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
10 | 10526 | WARTH | 2013-05-05 | 1 | 18.0 | 8 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 122.4 | 0.009670 | 0.000097 |
df_disc_20andup.head(3)
# rev_disc_20andup.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
9 | 10522 | LEHMS | 2013-04-30 | 1 | 18.0 | 40 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 576.0 | 0.045505 | 0.000455 |
16 | 10646 | HUNGO | 2013-08-27 | 1 | 18.0 | 15 | 0.25 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 202.5 | 0.015998 | 0.000160 |
df_disc_05to10.groupby(['CategoryName']).count() # shows sample size for each category
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CategoryName | |||||||||||||
Beverages | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
Condiments | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 |
Confections | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 | 29 |
Dairy Products | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 |
Grains/Cereals | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 |
Meat/Poultry | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 |
Produce | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
Seafood | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
df.groupby('Discount')['OrderQty'].sum().plot(kind='bar',
title ="Total Number of Items Sold by Discount Level", figsize=(8, 5), fontsize=12)
plt.show()
Around 28,500 product units sold (approximately 55%) were not discounted, whereas approximately 23,000 (around 45%) product units were sold at a discount. Of these ~23,000 product units sold at a discount, product units sold at a 5% discount comprised the largest subset at just over 5000 units sold at this discount level. The remaining four discount levels--10%, 15%, 20%, and 25%--represent almost identical unit sales of approximately 4500 units each.
As in the first case, we can create some initial visualizations to see if different discount levels result in differences in average order quantities.
hist_plot2_qty(df_no_disc.OrderQty, df_disc.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, all discounts")
hist_plot2_qty(df_no_disc.OrderQty, df_disc_01to05.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, Discounts = 1-4.9%", xlim=([0,150]))
# there are very few data points in the df_disc_01to05 category; going forward, I will delete this one.
hist_plot2_qty(df_no_disc.OrderQty, df_disc_05to10.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, Discounts = 5-9.9%", xlim=([0,150]))
hist_plot2_qty(df_no_disc.OrderQty, df_disc_10to15.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, Discounts = 10-14.9%", xlim=([0,150]))
hist_plot2_qty(df_no_disc.OrderQty, df_disc_15to20.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, Discounts = 15-19.9%", xlim=([0,150]))
hist_plot2_qty(df_no_disc.OrderQty, df_disc_20andup.OrderQty, label="Avg order qty, no discounts", label1="Avg order qty, Discounts = 20% and up", xlim=([0,150]))
A review of the histogram plots for order quantities deriving from various discount levels shows that, much like the overall order quantity frequency histograms, the histograms of these revenue-from-discount subsets reveal non-normally distributed data. To run the statistical tests needed to test the hypotheses in the second question, I will perform multiple sampling of the datasets, then performing statistical tests on the distribution of the means. Per the Central Limit Theorem, these distributions should be approximately normal, and so can be used for other tests (specifically Tukey's test).
This test will tell us that at least one of the datasets is significantly different than the others. (It doesn't tell us which one, though.)
As with the t-test above, I will first perform resampling and create distributions of the means of these resampling events to form approximately normal distributions to perform the ANOVA test.
no_discounts = sampling_mean(revenue_no_discount.OrderQty)
print(f"The mean of the distribution is {round(np.mean(no_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(no_discounts[1]), 2)}")
hist_plot1_qty(no_discounts[0], label="Avg order qty, no discount", xlim=([0, 40]))
The mean of the distribution is 21.76
The std deviation of the distribution is 2.29
all_discounts = sampling_mean(revenue_all_discounts.OrderQty)
print(f"The mean of the distribution is {round(np.mean(all_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(all_discounts[1]), 2)}")
hist_plot1_qty(all_discounts[0], label="Avg order qty, all discounts", xlim=([0, 50]))
The mean of the distribution is 27.12
The std deviation of the distribution is 2.25
# testing that function results can be plugged into ttest_ind
scs.ttest_ind(no_discounts[0], all_discounts[0])
Ttest_indResult(statistic=-36.465964150081504, pvalue=3.945476681113779e-154)
As a reminder, here are the dataframes I've built so far:
- total_revenue -- Total revenues, with and without discounts
- revenue_no_discount -- All revenues resulting from sales of non-discounted products
- revenue_all_discounts -- All revenues resulting from sales of products at all discount levels
- df_disc_01to05.Revenue -- Products discounted 1-4.9%
- df_disc_05to10.Revenue -- Products discounted 5-9.9%
- df_disc_10to15.Revenue -- Products discounted 10-14.9%
- df_disc_15to20.Revenue -- Products discounted 15-19.9%
- df_disc_20andup.Revenue -- Products discounted 20% or more
total_rev_means = sampling_mean(total_revenue.OrderQty)
np.mean(total_rev_means[0]), np.std(total_rev_means[1])
(23.843966666666667, 2.5112563991838885)
revs_no_disc_means = sampling_mean(revenue_no_discount.OrderQty)
np.mean(revs_no_disc_means[0]), np.std(revs_no_disc_means[1])
(21.763033333333333, 2.3470803659571597)
revs_all_disc_means = sampling_mean(revenue_all_discounts.OrderQty)
np.mean(revs_all_disc_means[0]), np.std(revs_all_disc_means[1])
(27.08096666666667, 2.2441359474713334)
Note--I'm not including revenues from 1-4.9% discounts because there are only 7 data points--not enough revenue points in this sample to compute these statistics
df_5to10_disc_means = sampling_mean(df_disc_05to10.OrderQty)
np.mean(df_5to10_disc_means[0]), np.std(df_5to10_disc_means[1])
(27.908199999999997, 1.5019613711301814)
df_10to15_disc_means = sampling_mean(df_disc_10to15.OrderQty)
np.mean(df_10to15_disc_means[0]), np.std(df_10to15_disc_means[1])
(25.406566666666663, 1.999842545129951)
df_15to20_disc_means = sampling_mean(df_disc_15to20.OrderQty)
np.mean(df_15to20_disc_means[0]), np.std(df_15to20_disc_means[1])
(28.39096666666667, 1.170205827015813)
df_20andup_disc_means = sampling_mean(df_disc_20andup.OrderQty)
np.mean(df_20andup_disc_means[0]), np.std(df_20andup_disc_means[1])
(27.48933333333333, 1.7520488624991175)
scipy.stats.f_oneway(*args)
A p-value of <
scs.f_oneway(revs_no_disc_means[0], df_5to10_disc_means[0], df_10to15_disc_means[0], df_15to20_disc_means[0], df_20andup_disc_means[0])
F_onewayResult(statistic=1017.8604514479314, pvalue=0.0)
The extremely low p-value of the ANOVA test tell us that we should reject the null hypothesis (that there is no statistically significant difference between the distributions of means) and strongly suggests that at least one of the distributions of means differs signficantly from at least one other.
Because ANOVA doesn't tell us which variables differ from others, I'll run Tukey's test to find out.
This test can tell us which variables are significantly different from one another.
Per the codeacademy Hypothesis Testing presentation by Hillary Green-Lerman, three things are required for pairwise_tukeyhsd:
- A vector of all data (concatenated using np.concatenate)
- A vector of labels for the data
- A level of significance (usually 0.05)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
no_disc = np.array(revs_no_disc_means[0]) # the number of means in these lists
disc_05to10 = np.array(df_5to10_disc_means[0]) # equals the sampling events input to
disc_10to15 = np.array(df_10to15_disc_means[0]) # the sampling function that was run to
disc_15to20 = np.array(df_15to20_disc_means[0]) # obtain this data
disc_20andup = np.array(df_20andup_disc_means[0])
v = np.concatenate([no_disc, disc_05to10, disc_10to15, disc_15to20, disc_20andup])
labels = ['no_disc'] * len(no_disc) + ['disc_05to10'] * len(disc_05to10) + ['disc_10to15'] * len(disc_10to15) + ['disc_15to20']*len(disc_15to20) + ['disc_20andup']*len(disc_20andup)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)
type(tukey_results)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=========================================================
group1 group2 meandiff lower upper reject
---------------------------------------------------------
disc_05to10 disc_10to15 -2.5016 -2.8315 -2.1718 True
disc_05to10 disc_15to20 0.4828 0.1529 0.8126 True
disc_05to10 disc_20andup -0.4189 -0.7487 -0.089 True
disc_05to10 no_disc -6.1452 -6.475 -5.8153 True
disc_10to15 disc_15to20 2.9844 2.6545 3.3143 True
disc_10to15 disc_20andup 2.0828 1.7529 2.4126 True
disc_10to15 no_disc -3.6435 -3.9734 -3.3137 True
disc_15to20 disc_20andup -0.9016 -1.2315 -0.5718 True
disc_15to20 no_disc -6.6279 -6.9578 -6.2981 True
disc_20andup no_disc -5.7263 -6.0562 -5.3964 True
---------------------------------------------------------
statsmodels.sandbox.stats.multicomp.TukeyHSDResults
Once again, for average order quantity (much more so than for average revenues per order), the relatively small number of potential point samples represented by quantity (as opposed to revenues) can result in differences in running these tests from one set of sampling runs to the next. The first time I ran the sampling function and then ran Tukey's test, every single pairing was found to be significant (True for "reject null hypothesis" for all pairings). The second time I ran the sampling functions and the tests, the result was to reject the null for all pairs except for disc_05to10 / disc_20andup.
In any case, what I find is that the average order quantity is statistically significantly different for most, if not all, pairs. What this means is that each of these distributions of average order quantity by discount level (except for the disc_15ot20 / no_disc pair) is statistically significant different from each other at the 0.05 level.
To find out the effect sizes of the various discount levels, we can run Cohen's d to see which ones are most important. Except for one pairing, all were significant, so will need to do Cohen's d on each of the other pairs.
d = Cohen_d(disc_05to10, disc_10to15)
plot_pdfs(d)
This Cohen's d result (1.7861853) represents a difference of 1.79 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_10to15
The average order quantities from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 10-14.9%.
That is a rather strange outcome; whereas one might expect such a result for revenues, where two somewhat opposing forces are at work (higher discounts might drive customers to purchase more, but at the same time reduce margins for the seller), it is not clear why we would see higher quantities sold for a discount of 5% as opposed to a discount of 10%.
It could be that Northwind is only offering the larger discounts on more expensive products: the larger discount might prompt a customer to buy the product (as opposed to not buying it), but might not be enough in and of itself to prompt the customer to buy more units than planned.
d = Cohen_d(disc_05to10, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-0.36344668) represents a difference of -0.36 pooled standard deviations between the two groups, and suggests a **small** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_15to20
The average order quantity on products discounted 5-9.9% is slightly lower than is the average order quantity on products discounted 15-19.9%.
Directionally, this result makes sense (the greater the discount, the larger the quantity sold) but some of the same comments I made in the previous pairwise comparison might apply here also. While in the previous case, perhaps the 10% discount wasn't enough to spur additional quantity buying (but might have induced a customer to buy at least one unit). But as the discount gets bigger, it might start driving increased quantity purchasing, even on more expensive products. This might represent a continuum, as seen in the next graph...
d = Cohen_d(disc_05to10, disc_20andup)
plot_pdfs(d)
# note that this pairing was not found to be statistically significant in Tukey's test.
This Cohen's d result (0.26794938) represents a difference of 0.27 pooled standard deviations between the two groups, and suggests a **small** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_20andup
The average order quantity on products discounted 5-9.9% is slightly higher than the average order quantity on products discounted 20% and up. This is somewhat counterintuitive; all things being equal, one would expect to see more product units purchased at a 20%+ discount than at a 5% discount. However, it is likely that a lot fewer products get the 20%+ discount treatment, so there just isn't as much volume available at that discount level. It could also be that the larger discounts are relatively more available on more expensive products, and so one might not see as much of an increase in quantity purchased as with lower-cost items.
d = Cohen_d(disc_05to10, no_disc)
plot_pdfs(d)
This Cohen's d result (4.14557783) represents a difference of 4.15 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = no_disc
The the average order quantity on products discounted 5-9.9% are substantially higher than the average order quantity from non-discounted products, which is not at all surprising.
d = Cohen_d(disc_10to15, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-2.23519295) represents a difference of -2.24 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = disc_15to20
The average order quantity on products discounted 10-14.9% is significantly lower than the average order quantity on products discounted 15-20%. This is unsurprising.
d = Cohen_d(disc_10to15, disc_20andup)
plot_pdfs(d)
This Cohen's d result (-1.32737613) represents a difference of -1.33 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = disc_20andup
The average order quantity on products discounted 10-14.9% is significantly lower than the average order quantity on products discounted 20% and up. Again, unsurprising, and consistent with the previous two Cohen's d results and graphs.
d = Cohen_d(disc_10to15, no_disc)
plot_pdfs(d)
This Cohen's d result (2.44776268) represents a difference of 2.45 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = no_disc
The average order quantity on products discounted 10-14.9% is significantly higher than the average order quantity on non-discounted products. Again, consistent with the previous findings described above.
d = Cohen_d(disc_15to20, disc_20andup)
plot_pdfs(d)
This Cohen's d result (0.59911024) represents a difference of 0.6 pooled standard deviations between the two groups, and suggests a **medium** effect size.
Magenta = group1 = disc_15to20 ; blue = group2 = disc_20andup
The means of revenues from discounts of 15-19.9% are slightly lower than the means of revenues from discounts of 20% and up.
The average order quantity on products discounted 15-19.9% is somewhat lower than the average order quantity on products discounted 20% and up. Again, consistent with the previous findings described above.
d = Cohen_d(disc_15to20, no_disc)
plot_pdfs(d)
This Cohen's d result (4.66508814) represents a difference of 4.67 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_15to20 ; blue = group2 = no_disc
Big difference between average order quantity for substantially discounted products (magenta) and non-discounted products (blue).
d = Cohen_d(disc_20andup, no_disc)
plot_pdfs(d)
This Cohen's d result (3.48630871) represents a difference of 3.49 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_20andup ; blue = group2 = no_disc
And once again, an even bigger difference between average order quantity for substantially discounted products (magenta) and non-discounted products (blue).
Results regarding average order quantity relative to discount offered are, with a few exceptions, intuitive.
- Not surprising: Larger discounts tend to correlate in a statistically significant way to greater order unit quantities.
- Somewhat surprising: Products in the 5-9.9% discount band (which is comprised almost entirely of 5% discounts) tend to have higher effect sizes (measured by Cohen's d) than those at 10-15%. They even have a slightly higher effect size than discounts at 20% and above (though Cohen's d for this pairing is small).
- Future exploration could focus on why these patterns appear. A few possibilities:
- Smaller discounts are likely to be applied to a broader range of products than deeper discounts, so there are just a lot more products that can be purchased at that level
- It could be that larger discounts are applied relatively more often to more expensive products than less expensive ones. Rather than driving additional unit purchases, these larger discounts might be applied to more expensive products to induce a purchase that might not otherwise have happened, and/or to hasten a sale in the case of perishable products.
These findings, combined with those on average order revenue by discount levels, provide some intriguing insights into how pricing occurs and sales are made. Some focused exploration on discounts by product categories and specific products will likely lead to targeted strategies for improving sales in certain categories and/or geographies.
I've concluded the following after running Tukey's Test and Cohen's d for each pairwise comparison:
- Discounts at the 5-10% level appear to drive the highest revenues per order, relative to all other discount levels (or no discount at all)
- While higher discounts are likely to drive more purchases, the per-unit revenue will decrease with larger discounts.
- Perhaps this relatively small discount level is enough to get customers to buy more while preserving higher revenue per unit
Does the level of discount affect average revenues per order?
-
Null hypothesis: No significant difference among the means of per-order revenues from the various discount values
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$ )
- Ho:
-
Alternative hypothesis: Differences among the means of per-order revenues resulting from various discount values are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha:
df.groupby('Discount')['Revenue'].sum().plot(kind='bar',
title ="Revenues by Discount Level", figsize=(8, 5), fontsize=12)
plt.show()
Almost 60% of product revenues are from non-discounted products--which means that over 40% of product revenues are the result of discounted products. Of discounted products, those products discounted by 5% bring in the most revenue of all discounted products, with 25% representing the next highest revenue amount (~$100000). The remaining discount amounts with significant revenues are 10%, 15% and 20%. All three are virtually equal, but 10% appears to represent slightly greater revenues than those from 15% or 20% discounts.
As in the first case, we can create some initial visualizations to see if the magnitude of the discount results in differences in the sales revenues distributions.
# function for plotting three different series
def hist_plot3(var1, var2, var3, label=None, label1=None, label2=None, color='b', alpha=0.5, bins=100, xlim=[0, 3500]):
var1 = var1
var2 = var2
var3 = var3
alpha = alpha
plt.hist(var1, label=label, color='r', alpha=0.5, bins=100)
plt.hist(var2, label=label1, color='b', alpha=0.5, bins=100)
plt.hist(var3, label=label2, color='b', alpha=0.5, bins=100)
plt.title(f"Revenue amounts by occurrence (xlim = {xlim})")
plt.xlabel("Revenue")
plt.ylabel("Number")
plt.legend(loc='best')
plt.xlim(xlim)
plt.show()
hist_plot3(total_revenue.Revenue, revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label='Avg order revenue, discounts and no discounts', label1='Avg order revenue, no discounts', label2='Avg order revenue, all discounts')
print(f"Total revenues (discount and no discount): {int(total_revenue.Revenue.sum())}")
print(f"Total revenues, no discount: {int(revenue_no_discount.Revenue.sum())}")
print(f"Total revenues from all discounted products: {int(revenue_all_discounts.Revenue.sum())}")
Total revenues (discount and no discount): 1265793
Total revenues, no discount: 750698
Total revenues from all discounted products: 515094
hist_plot2(revenue_no_discount.Revenue, revenue_all_discounts.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, ALL discounts', alpha=0.5, bins=100)
Because the revenues from this group are so small, I will be ignoring them for the purposes of this analysis.
hist_plot2(revenue_no_discount.Revenue, df_disc_05to10.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 5-9.9%', alpha=0.5)
print(f"Revenues from discount level of 5-9.99%: {int(df_disc_05to10.Revenue.sum())}")
print(f"Discount 5-9.99% as a percentage of total revenue: {round((df_disc_05to10.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 5-9.99% as a percentage of all discount revenue: {round((df_disc_05to10.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")
Revenues from discount level of 5-9.99%: 147681
Discount 5-9.99% as a percentage of total revenue: 11.667%
Discount 5-9.99% as a percentage of all discount revenue: 28.671%
hist_plot2(revenue_no_discount.Revenue, df_disc_10to15.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 10-14.9%', alpha=.5)
print(f"Revenues from discount level of 10-14.99%: {int(df_disc_10to15.Revenue.sum())}")
print(f"Discount 10-14.99% as a percentage of total revenue: {round((df_disc_10to15.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 10-14.99% as a percentage of all discount revenue: {round((df_disc_10to15.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")
Revenues from discount level of 10-14.99%: 91499
Discount 10-14.99% as a percentage of total revenue: 7.229%
Discount 10-14.99% as a percentage of all discount revenue: 17.764%
hist_plot2(total_revenue.Revenue, df_disc_15to20.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 15-19.9%',alpha=0.5, bins=100)
print(f"Revenues from discount level of 15-19.99%: {int(df_disc_15to20.Revenue.sum())}")
print(f"Discount 15-19.99% as a percentage of total revenue: {round((df_disc_15to20.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 15-19.99% as a percentage of all discount revenue: {round((df_disc_15to20.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")
Revenues from discount level of 15-19.99%: 87506
Discount 15-19.99% as a percentage of total revenue: 6.913%
Discount 15-19.99% as a percentage of all discount revenue: 16.988%
hist_plot2(total_revenue.Revenue, df_disc_20andup.Revenue, label='Avg order revenue, no discounts', label1='Avg order revenue, discounts 20% and up', alpha=0.5, bins=100)
print(f"Revenues from discount level of 20% and up: {int(df_disc_20andup.Revenue.sum())}")
print(f"Discount 20% and up as a percentage of total revenue: {round((df_disc_20andup.Revenue.sum()/total_revenue.Revenue.sum())*100, 3)}%")
print(f"Discount 20% and up as a percentage of all discount revenue: {round((df_disc_20andup.Revenue.sum()/revenue_all_discounts.Revenue.sum())*100, 3)}%")
Revenues from discount level of 20% and up: 188119
Discount 20% and up as a percentage of total revenue: 14.862%
Discount 20% and up as a percentage of all discount revenue: 36.521%
df.groupby(['Discount']).sum().sort_values(['Discount'], ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | CategoryId | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|
Discount | ||||||||
0.25 | 1648801 | 5820 | 4345.02 | 4349 | 643 | 98938.5675 | 7.816331 | 0.078163 |
0.20 | 1711705 | 6445 | 3787.81 | 4351 | 661 | 89181.1040 | 7.045473 | 0.070455 |
0.15 | 1672996 | 6351 | 3607.22 | 4456 | 595 | 87506.1740 | 6.913150 | 0.069132 |
0.10 | 1833523 | 7164 | 4354.68 | 4366 | 742 | 91499.1390 | 7.228602 | 0.072286 |
0.06 | 11077 | 60 | 34.00 | 2 | 4 | 63.9200 | 0.005050 | 0.000050 |
0.05 | 1972417 | 7311 | 5697.32 | 5182 | 783 | 147617.3745 | 11.662047 | 0.116620 |
0.04 | 11077 | 20 | 81.00 | 1 | 3 | 77.7600 | 0.006143 | 0.000061 |
0.03 | 33231 | 94 | 73.95 | 5 | 15 | 120.9105 | 0.009552 | 0.000096 |
0.02 | 22154 | 52 | 37.00 | 4 | 10 | 59.7800 | 0.004723 | 0.000047 |
0.01 | 11077 | 73 | 15.00 | 2 | 8 | 29.7000 | 0.002346 | 0.000023 |
0.00 | 14042897 | 54519 | 34467.91 | 28599 | 5448 | 750698.6100 | 59.306584 | 0.593066 |
print(f"The mean of the distribution is {round(revenue_no_discount.Revenue.mean(), 2)}")
hist_plot1(revenue_no_discount.Revenue, label="Average order revenues, no discounts") # plotting with xlim in place
The mean of the distribution is 570.01
Total revenues (discount and no discount): 1265793
Total revenues from all discounted products: 515094
Percentage of total revenue from discounts: 40.693%
# hist_plot1(df_disc_05to10.Revenue, xlim=None) # plotting without xlim to see outliers
print(f"The mean of the distribution is {round(df_disc_05to10.Revenue.mean(), 2)}")
hist_plot1(df_disc_05to10.Revenue, label="Average order revenues, discounts of 5-9.9%", print_stmt=False) # plotting with xlim in place
The mean of the distribution is 793.99
# hist_plot1(df_disc_10to15.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_10to15.Revenue.mean(), 2)}")
hist_plot1(df_disc_10to15.Revenue, label="Average order revenues, discounts of 10-14.9%", print_stmt=False)
The mean of the distribution is 528.9
# hist_plot1(df_disc_15to20.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_15to20.Revenue.mean(), 2)}")
hist_plot1(df_disc_15to20.Revenue, label="Average order revenues, discounts of 15-19.9%", print_stmt=False)
The mean of the distribution is 557.36
# hist_plot1(df_disc_20andup.Revenue, xlim=None)
print(f"The mean of the distribution is {round(df_disc_20andup.Revenue.mean(), 2)}")
hist_plot1(df_disc_20andup.Revenue, label="Average order revenues, discounts of 20% and up", xlim=[0, 3000], print_stmt=False)
The mean of the distribution is 597.21
A review of the histogram plots for revenues deriving from various discount levels shows that, much like the overall revenue frequency histograms, the histograms of these revenue-from-discount subsets reveal non-normally distributed data. To run the statistical tests needed to test the hypotheses in the second question, I will perform multiple sampling of the datasets, then performing statistical tests on the distribution of the means. Per the Central Limit Theorem, these distributions should be approximately normal, and so can be used for other tests (specifically Tukey's test).
This test will tell us that at least one of the datasets is significantly different than the others. (It doesn't tell us which one, though.)
As with the t-test above, I will first perform resampling and create distributions of the means of these resampling events to form approximately normal distributions to perform the ANOVA test.
no_discounts = sampling_mean(revenue_no_discount.Revenue)
print(f"The mean of the distribution is {round(np.mean(no_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(no_discounts[1]), 2)}")
hist_plot1(no_discounts[0], label="Average order revenues, no discounts", xlim=([0, 1200]))
The mean of the distribution is 581.46
The std deviation of the distribution is 384.4
Total revenues (discount and no discount): 1265793
Total revenues from all discounted products: 515094
Percentage of total revenue from discounts: 40.693%
all_discounts = sampling_mean(revenue_all_discounts.Revenue)
print(f"The mean of the distribution is {round(np.mean(all_discounts[0]), 2)}")
print(f"The std deviation of the distribution is {round(np.std(all_discounts[1]), 2)}")
hist_plot1(all_discounts[0], label="Average order revenues, all discounts", xlim=([0, 1200]))
The mean of the distribution is 612.94
The std deviation of the distribution is 335.39
Total revenues (discount and no discount): 1265793
Total revenues from all discounted products: 515094
Percentage of total revenue from discounts: 40.693%
# testing that function results can be plugged into ttest_ind
scs.ttest_ind(no_discounts[0], all_discounts[0])
Ttest_indResult(statistic=-4.003612486005798, pvalue=7.022917511073499e-05)
As a reminder, here are the dataframes I've built so far:
- total_revenue -- Total revenues, with and without discounts
- revenue_no_discount -- All revenues resulting from sales of non-discounted products
- revenue_all_discounts -- All revenues resulting from sales of products at all discount levels
- df_disc_01to05.Revenue -- Products discounted 1-4.9%
- df_disc_05to10.Revenue -- Products discounted 5-9.9%
- df_disc_10to15.Revenue -- Products discounted 10-14.9%
- df_disc_15to20.Revenue -- Products discounted 15-19.9%
- df_disc_20andup.Revenue -- Products discounted 20% or more
total_rev_means = sampling_mean(total_revenue.Revenue)
np.mean(total_rev_means[0]), np.std(total_rev_means[1])
(583.0617516000001, 351.03901350261066)
revs_no_disc_means = sampling_mean(revenue_no_discount.Revenue)
np.mean(revs_no_disc_means[0]), np.std(revs_no_disc_means[1])
(564.6655116666666, 345.1591629191503)
revs_all_disc_means = sampling_mean(revenue_all_discounts.Revenue)
np.mean(revs_all_disc_means[0]), np.std(revs_all_disc_means[1])
(622.9363593333333, 385.2859280957691)
Note--I'm not including revenues from 1-4.9% discounts because there are only 7 data points--not enough revenue points in this sample to compute these statistics
df_5to10_disc_means = sampling_mean(df_disc_05to10.Revenue)
np.mean(df_5to10_disc_means[0]), np.std(df_5to10_disc_means[1])
(785.3587769166667, 388.22407082181945)
df_10to15_disc_means = sampling_mean(df_disc_10to15.Revenue)
np.mean(df_10to15_disc_means[0]), np.std(df_10to15_disc_means[1])
(529.1909292, 51.08710103154266)
df_15to20_disc_means = sampling_mean(df_disc_15to20.Revenue)
np.mean(df_15to20_disc_means[0]), np.std(df_15to20_disc_means[1])
(555.7451172499999, 51.47884666219702)
df_20andup_disc_means = sampling_mean(df_disc_20andup.Revenue)
np.mean(df_20andup_disc_means[0]), np.std(df_20andup_disc_means[1])
(595.8233765, 239.43192635432803)
scipy.stats.f_oneway(*args)
A p-value of <
scs.f_oneway(revs_no_disc_means[0], df_5to10_disc_means[0], df_10to15_disc_means[0], df_15to20_disc_means[0], df_20andup_disc_means[0])
F_onewayResult(statistic=628.196879270906, pvalue=3.506107e-318)
The extremely low p-value of the ANOVA test tell us that we should reject the null hypothesis (that there is no statistically significant difference between the distributions of means) and strongly suggests that at least one of the distributions of means differs signficantly from at least one other.
Because ANOVA doesn't tell us which variables differ from others, I'll run Tukey's test to find out.
This test can tell us which variables are significantly different from one another.
Per the codeacademy Hypothesis Testing presentation by Hillary Green-Lerman, three things are required for pairwise_tukeyhsd:
- A vector of all data (concatenated using np.concatenate)
- A vector of labels for the data
- A level of significance (usually 0.05)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
no_disc = np.array(revs_no_disc_means[0]) # the number of means in these lists
disc_05to10 = np.array(df_5to10_disc_means[0]) # equals the sampling events input to
disc_10to15 = np.array(df_10to15_disc_means[0]) # the sampling function that was run to
disc_15to20 = np.array(df_15to20_disc_means[0]) # obtain this data
disc_20andup = np.array(df_20andup_disc_means[0])
v = np.concatenate([no_disc, disc_05to10, disc_10to15, disc_15to20, disc_20andup])
labels = ['no_disc'] * len(no_disc) + ['disc_05to10'] * len(disc_05to10) + ['disc_10to15'] * len(disc_10to15) + ['disc_15to20']*len(disc_15to20) + ['disc_20andup']*len(disc_20andup)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)
type(tukey_results)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================================
group1 group2 meandiff lower upper reject
--------------------------------------------------------------
disc_05to10 disc_10to15 -256.1678 -272.0348 -240.3009 True
disc_05to10 disc_15to20 -229.6137 -245.4806 -213.7468 True
disc_05to10 disc_20andup -189.5354 -205.4023 -173.6685 True
disc_05to10 no_disc -220.6933 -236.5602 -204.8264 True
disc_10to15 disc_15to20 26.5542 10.6873 42.4211 True
disc_10to15 disc_20andup 66.6324 50.7655 82.4993 True
disc_10to15 no_disc 35.4746 19.6077 51.3415 True
disc_15to20 disc_20andup 40.0783 24.2114 55.9452 True
disc_15to20 no_disc 8.9204 -6.9465 24.7873 False
disc_20andup no_disc -31.1579 -47.0248 -15.291 True
--------------------------------------------------------------
statsmodels.sandbox.stats.multicomp.TukeyHSDResults
On most occasions of running the sampling function to get these results, the result I get is that we can reject the null hypothesis for all pairs except the disc_15to20 / no_disc pair. What this means is that each of these distributions of means (except for the disc_15ot20 / no_disc pair) are statistically significantly different from each other at the 0.05 level.
I have noticed that if I reduce the number of sampling events in the function, I will occasionally see slightly different results. Sometimes, I will get a "reject the null hypothesis" result for every single pair. Other times, I will get "False" for a few of the pairs (usually for the 15-20% / no discount pair, but have also seen it for one or two others). Increasing the number of sampling events from 200 to 300 in the function seems to eliminate that variability.
Two really interesting things that emerge from running the Tukey test are:
-
The differences between means of the disc_15to20 distribution and the means of the no_disc (no discounts group) are NOT statistically significant. 15-19% seems to be a fairly large discount, so I am a bit surprised to see that this pair was not significantly different.
-
ALL other pair comparisons result in statistically significant differences.
- While perhaps this is not surprising, it does make me wonder if there is a linear relationship between discount levels, or if it's just the fact that there is a discount.
- It's important to remind ourselves that sales from discounts of 1-4.9% were so low (only \$288 dollars of over \$1.2 million total revenues) that they were not included. There were only 7 data points in that category. It suggests that either such small discounts don't work, or that the company (or its salespeople) just don't offer such discounts.
To find out the effect sizes of the various discount levels, we can run Cohen's d to see which ones are most important. Except for one pairing, all were significant, so will need to do Cohen's d on each of the other pairs.
d = Cohen_d(disc_05to10, disc_10to15)
plot_pdfs(d)
This Cohen's d result (3.51500644) represents a difference of 3.52 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_10to15
The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 10-14.9%.
A possible mechanism with two opposing forces might explain why we see this outcome:
- A greater discount may drive customers to purchase more... BUT...
- The larger discount may ultimately put more downward pressure on the total revenue per sale...the customer may or may not be buying more stuff, but what they are buying, they are getting for a lower price --> lower revenue per unit
d = Cohen_d(disc_05to10, disc_15to20)
plot_pdfs(d)
This Cohen's d result (3.15453316) represents a difference of 3.15 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_15to20
The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 15-19.9%.
d = Cohen_d(disc_05to10, disc_20andup)
plot_pdfs(d)
This Cohen's d result (2.18613461) represents a difference of 2.19 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = disc_20andup
The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from discounts of 20% and up.
d = Cohen_d(disc_05to10, no_disc)
plot_pdfs(d)
This Cohen's d result (2.40795871) represents a difference of 2.41 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_05to10 ; blue = group2 = no_disc
The means of revenues from discounts of 5-9.9% are significantly higher than the means of revenues from non-discounted products.
d = Cohen_d(disc_10to15, disc_15to20)
plot_pdfs(d)
This Cohen's d result (-0.72665591) represents a difference of -0.73 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = disc_15to20
The means of revenues from discounts of 10-14.9% are somewhat lower than the means of revenues from discounts of 15-20%. Another slightly counter-intuitive result.
d = Cohen_d(disc_10to15, disc_20andup)
plot_pdfs(d)
This Cohen's d result (-1.11771677) represents a difference of -1.12 pooled standard deviations between the two groups, and suggests a **large** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = disc_20andup
The means of revenues from discounts of 10-14.9% are lower than the means of revenues from discounts of 20% and up.
d = Cohen_d(disc_10to15, no_disc)
plot_pdfs(d)
This Cohen's d result (-0.53255028) represents a difference of -0.53 pooled standard deviations between the two groups, and suggests a **medium** effect size.
Magenta = group1 = disc_10to15 ; blue = group2 = no_disc
The means of revenues from discounts of 10-14.9% are slightly lower than the means of revenues from non-discounted products. This is a somewhat counterintuitive result that warrants further investigation.
d = Cohen_d(disc_15to20, disc_20andup)
plot_pdfs(d)
This Cohen's d result (-0.67352883) represents a difference of -0.67 pooled standard deviations between the two groups, and suggests a **medium** effect size.
Magenta = group1 = disc_15to20 ; blue = group2 = disc_20andup
The means of revenues from discounts of 15-19.9% are slightly lower than the means of revenues from discounts of 20% and up.
d = Cohen_d(disc_15to20, no_disc)
plot_pdfs(d)
# note that this pairing was not found to be statistically significant in Tukey's test.
This Cohen's d result (-0.13411241) represents a difference of -0.13 pooled standard deviations between the two groups, and suggests a **small** effect size.
Magenta = group1 = disc_15to20 ; blue = group2 = no_disc
The means of revenues from discounts of 15-19.9% are very similar to the means of revenues from non-discounted products.
d = Cohen_d(disc_20andup, no_disc)
plot_pdfs(d)
This Cohen's d result (0.38229256) represents a difference of 0.38 pooled standard deviations between the two groups, and suggests a **small** effect size.
Magenta = group1 = disc_20andup ; blue = group2 = no_disc
The means of revenues from discounts of 20% and up are slightly higher than the means of revenues from non-discounted products.
I've concluded the following after running Tukey's Test and Cohen's d for each pairwise comparison:
-
Discounts at the 5-10% level appear to drive the highest revenues per order, relative to all other discount levels (or no discount at all)
- While higher discounts are likely to drive more purchases, the per-unit revenue will decrease with larger discounts.
- Perhaps this relatively small discount level is enough to get customers to buy more while preserving higher revenue per unit
-
Generally, as the discount increases, these differences shrink (though they are still quite significant)
- One notable exception is that in the "5-10% discount vs. no discount" pair, the difference between the means is not as great as it is for the "5-10% discount vs. 10-15% discount" pair
- It is possible that, in the cases where the difference between a given discount level and no discount is less than one would expect, there are some high-volume and/or high-priced products that are never discounted that drive the no-discount revenue means upward.
-
More investigation is needed to derive insights and make recommendations; questions to pursue include:
- Does the pattern of revenue by discounts hold across all categories, or is there a subset of products that create the outcomes we see?
- Similarly, is there a relationship between geography/sales region and the discounts that drive more revenues?
- Do we see certain discount levels drive more revenues at certain times of the year?
- Is there a linear or non-linear relationship between discount level and revenues?
- Does removal of outliers dramatically affect the relationships observed?
Does the distribution of revenues per order vary in a statistically significant way across product categories?
This analysis can show us whether the distribution of revenues per order varies in a significant way across product categories. This can help us understand if orders in certain product categories generate higher revenues on average.
-
Null hypothesis: No significant difference among the means of order revenues distributions by product category
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$
- Ho:
-
Alternative hypothesis: Differences among the means of order revenues distributions by product category are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha:
# df.groupby(['CategoryId', 'CategoryName', 'CatDescription']).sum().sort_values(['CategoryId'])
df.groupby(['CategoryId', 'CategoryName', 'Discount']).count().head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | ProductName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CategoryId | CategoryName | Discount | |||||||||||
1 | Beverages | 0.00 | 246 | 246 | 246 | 246 | 246 | 246 | 246 | 246 | 246 | 246 | 246 |
0.05 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | ||
0.10 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | ||
0.15 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | ||
0.20 | 39 | 39 | 39 | 39 | 39 | 39 | 39 | 39 | 39 | 39 | 39 | ||
0.25 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | ||
2 | Condiments | 0.00 | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 131 |
0.02 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
0.05 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | ||
0.10 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 |
df.groupby(['CategoryId', 'CategoryName', 'Discount']).sum().head(20)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | Revenue | RevPercentTotal | RevFractionTotal | |||
---|---|---|---|---|---|---|---|---|---|
CategoryId | CategoryName | Discount | |||||||
1 | Beverages | 0.00 | 2623997 | 10381 | 7161.95 | 5116 | 153060.2500 | 12.092044 | 0.120920 |
0.05 | 277313 | 1038 | 1507.75 | 898 | 43567.3800 | 3.441904 | 0.034419 | ||
0.10 | 264816 | 1219 | 812.20 | 728 | 13027.6350 | 1.029207 | 0.010292 | ||
0.15 | 384608 | 1348 | 609.65 | 885 | 12889.9950 | 1.018334 | 0.010183 | ||
0.20 | 418197 | 1353 | 1005.40 | 1140 | 29569.3200 | 2.336031 | 0.023360 | ||
0.25 | 343213 | 1017 | 714.70 | 765 | 15753.6000 | 1.244564 | 0.012446 | ||
2 | Condiments | 0.00 | 1399149 | 5724 | 2834.55 | 2793 | 60577.8000 | 4.785759 | 0.047858 |
0.02 | 11077 | 6 | 25.00 | 1 | 24.5000 | 0.001936 | 0.000019 | ||
0.05 | 202371 | 953 | 337.90 | 719 | 11891.5300 | 0.939453 | 0.009395 | ||
0.10 | 243158 | 995 | 449.30 | 536 | 8934.7050 | 0.705858 | 0.007059 | ||
0.15 | 180730 | 890 | 373.65 | 519 | 10639.0250 | 0.840503 | 0.008405 | ||
0.20 | 148734 | 563 | 340.50 | 312 | 6105.2000 | 0.482322 | 0.004823 | ||
0.25 | 118286 | 426 | 244.40 | 418 | 7874.3250 | 0.622086 | 0.006221 | ||
3 | Confections | 0.00 | 2131272 | 7572 | 4636.56 | 4618 | 107128.4700 | 8.463348 | 0.084633 |
0.03 | 11077 | 16 | 17.45 | 2 | 33.8530 | 0.002674 | 0.000027 | ||
0.04 | 11077 | 20 | 81.00 | 1 | 77.7600 | 0.006143 | 0.000061 | ||
0.05 | 309325 | 843 | 705.63 | 795 | 18598.5775 | 1.469322 | 0.014693 | ||
0.10 | 318611 | 1062 | 584.23 | 545 | 10200.7440 | 0.805878 | 0.008059 | ||
0.15 | 308602 | 951 | 617.37 | 818 | 15433.5690 | 1.219281 | 0.012193 | ||
0.20 | 180335 | 660 | 314.23 | 338 | 4716.2640 | 0.372594 | 0.003726 |
df.groupby('CategoryName')['Discount'].sum().plot(kind='bar',
title ="Discount Level in Percentages by Product Category", figsize=(8, 5), fontsize=12)
plt.show()
The category with the greatest discounts is Beverages, with Seafood, Dairy Products, and Confections nearly tied in 2nd place with approximately 20% discounts in each category.
Let's also look at revenues by product category and see how that compares to percentage discount:
df.groupby('CategoryName')['Revenue'].sum().plot(kind='bar',
title ="Revenues by Category Name", figsize=(8, 5), fontsize=12)
plt.show()
df.groupby('CategoryName')['Discount'].count().plot(kind='bar',
title ="Orders by Product Category", figsize=(8, 5), fontsize=12)
plt.show()
Interestingly, Beverages has the highest revenues at around \$260,000 of any category, followed closely by Dairy Products at around \$230,000, then Confections at around \$170,000, and Meat/Poultry at around \$160,000. Although Seafood tends to have fairly significant discounts at around 20%, it is 5th of 8 in terms of revenu, at around \$140,000.
Some observations and comments:
- For a category such as seafood, which is likely to include both highly perishable items and shelf-stable items (e.g., canned tuna, salmon, clams), it may not be surprising to see that discounts make up a higher proportion of revenues than is the case for less perishable items.
- On the other hand, Produce has fairly low discounts and relatively low revenues. I can think of a couple of possible questions that might warrant further exploration:
- Although perishable, produce quantities and discounts might have been optimized for its highly perishable nature; in other words, maybe Northwind only has access to a certain amount of produce from suppliers, and this amount tends to sell out (rather than having to be discarded).
- While there may be some truth to this potential explanation, as a category, Produce experiences relatively high rates of loss due to spoilage compared to other categories, so there might be some opportunities to get more revenue by offering greater discounts on selected items (thereby selling produce that might otherwise have gone to waste). - Condiments and Cereals/Grains are almost carbon copies of each other, in terms of both discount level and revenues. It is likely that the relatively low cost and shelf-stable nature of these products leads to relatively modest discounts (since spoilage rates are lower, so not as much urgency to induce purchases before items spoil) and relatively modest revenues, due to their general affordability.
df_bev = df.loc[df["CategoryName"] == "Beverages"]
df_bev.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.2 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 259.2 | 0.020477 | 0.000205 |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 288.0 | 0.022753 | 0.000228 |
df_cond = df.loc[df["CategoryName"] == "Condiments"]
df_cond.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
82 | 10289 | BSBEV | 2012-08-26 | 3 | 8.0 | 30 | 0.0 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 240.0 | 0.018960 | 0.000190 |
83 | 10405 | LINOD | 2013-01-06 | 3 | 8.0 | 50 | 0.0 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 400.0 | 0.031601 | 0.000316 |
84 | 10485 | LINOD | 2013-03-25 | 3 | 8.0 | 20 | 0.1 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 144.0 | 0.011376 | 0.000114 |
df_confect = df.loc[df["CategoryName"] == "Confections"]
df_confect.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
336 | 10255 | RICSU | 2012-07-12 | 16 | 13.9 | 35 | 0.00 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 486.5 | 0.038434 | 0.000384 |
337 | 10263 | ERNSH | 2012-07-23 | 16 | 13.9 | 60 | 0.25 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 625.5 | 0.049416 | 0.000494 |
338 | 10287 | RICAR | 2012-08-22 | 16 | 13.9 | 40 | 0.15 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 472.6 | 0.037336 | 0.000373 |
df_dairy = df.loc[df["CategoryName"] == "Dairy Products"]
df_dairy.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
216 | 10248 | VINET | 2012-07-04 | 11 | 14.0 | 12 | 0.0 | Queso Cabrales | 4 | Dairy Products | Cheeses | 168.0 | 0.013272 | 0.000133 |
217 | 10296 | LILAS | 2012-09-03 | 11 | 16.8 | 12 | 0.0 | Queso Cabrales | 4 | Dairy Products | Cheeses | 201.6 | 0.015927 | 0.000159 |
218 | 10327 | FOLKO | 2012-10-11 | 11 | 16.8 | 50 | 0.2 | Queso Cabrales | 4 | Dairy Products | Cheeses | 672.0 | 0.053089 | 0.000531 |
df_grains = df.loc[df["CategoryName"] == "Grains/Cereals"]
df_grains.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
535 | 10251 | VICTE | 2012-07-08 | 22 | 16.8 | 6 | 0.05 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 95.76 | 0.007565 | 0.000076 |
536 | 10435 | CONSH | 2013-02-04 | 22 | 16.8 | 12 | 0.00 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 201.60 | 0.015927 | 0.000159 |
537 | 10553 | WARTH | 2013-05-30 | 22 | 21.0 | 24 | 0.00 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 504.00 | 0.039817 | 0.000398 |
df_meat = df.loc[df["CategoryName"] == "Meat/Poultry"]
df_meat.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
178 | 10420 | WELLI | 2013-01-21 | 9 | 77.6 | 20 | 0.10 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 1396.8 | 0.110350 | 0.001103 |
179 | 10515 | QUICK | 2013-04-23 | 9 | 97.0 | 16 | 0.15 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 1319.2 | 0.104219 | 0.001042 |
180 | 10687 | HUNGO | 2013-09-30 | 9 | 97.0 | 50 | 0.25 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 3637.5 | 0.287369 | 0.002874 |
df_produce = df.loc[df["CategoryName"] == "Produce"]
df_produce.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136 | 10262 | RATTC | 2012-07-22 | 7 | 24.0 | 15 | 0.00 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 360.0 | 0.028441 | 0.000284 |
137 | 10385 | SPLIR | 2012-12-17 | 7 | 24.0 | 10 | 0.20 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 192.0 | 0.015168 | 0.000152 |
138 | 10459 | VICTE | 2013-02-27 | 7 | 24.0 | 16 | 0.05 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 364.8 | 0.028820 | 0.000288 |
df_seafood = df.loc[df["CategoryName"] == "Seafood"]
df_seafood.head(3)
# df_seafood.info()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
183 | 10273 | QUICK | 2012-08-05 | 10 | 24.8 | 24 | 0.05 | Ikura | 8 | Seafood | Seaweed and fish | 565.44 | 0.044671 | 0.000447 |
184 | 10276 | TORTU | 2012-08-08 | 10 | 24.8 | 15 | 0.00 | Ikura | 8 | Seafood | Seaweed and fish | 372.00 | 0.029389 | 0.000294 |
185 | 10357 | LILAS | 2012-11-19 | 10 | 24.8 | 30 | 0.20 | Ikura | 8 | Seafood | Seaweed and fish | 595.20 | 0.047022 | 0.000470 |
print()
print("Category: Beverages")
hist_plot1(df_bev.Revenue, label="Avg order revenue, discount and no discount, Beverages", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_bev.Revenue.mean(), 2)}")
hist_plot1(df_bev.Revenue, label="Avg order revenue, discount and no discount, Beverages", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_bev.Revenue.mean(), 2)}")
print()
print()
# note the large outliers
Category: Beverages
The mean of the distribution is 663.04
The mean of the distribution is 663.04
print()
print()
print("Category: Condiments")
# hist_plot1(df_cond.Revenue, color='tab:blue')
# print(f"The mean of the distribution is {round(df_cond.Revenue.mean(), 2)}")
hist_plot1(df_cond.Revenue, label="Avg order revenue, discount and no discount, Condiments", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_cond.Revenue.mean(), 2)}")
print()
print()
Category: Condiments
The mean of the distribution is 490.96
print()
print()
print("Category: Confections")
hist_plot1(df_confect.Revenue, label="Avg order revenue, discount and no discount, Confections", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_confect.Revenue.mean(), 2)}")
# hist_plot1(df_confect.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_confect.Revenue.mean(), 2)}")
print()
print()
Category: Confections
The mean of the distribution is 501.07
print()
print()
print("Category: Dairy Products")
hist_plot1(df_dairy.Revenue, label="Avg order revenue, all, Dairy Products", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_dairy.Revenue.mean(), 2)}")
# hist_plot1(df_dairy.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_dairy.Revenue.mean(), 2)}")
print()
print()
Category: Dairy Products
The mean of the distribution is 640.73
print()
print()
print("Category: Meat/Poultry")
hist_plot1(df_meat.Revenue, label="Avg order revenue, discount/no discount, Meat/Poultry",color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_meat.Revenue.mean(), 2)}")
# hist_plot1(df_meat.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_meat.Revenue.mean(), 2)}")
print()
print()
Category: Meat/Poultry
The mean of the distribution is 942.33
print()
print()
print("Category: Grains/Cereals")
hist_plot1(df_grains.Revenue, label="Avg order revenue, discount/no discount, Grains/Cereals", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_grains.Revenue.mean(), 2)}")
# hist_plot1(df_grains.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_grains.Revenue.mean(), 2)}")
print()
print()
Category: Grains/Cereals
The mean of the distribution is 488.49
print()
print()
print("Category: Produce")
hist_plot1(df_produce.Revenue, label="Avg order revenue, discount/no discount, Produce", color='tab:blue', print_stmt=False)
print(f"The mean of the distribution is {round(df_produce.Revenue.mean(), 2)}")
# hist_plot1(df_produce.Revenue, color='tab:blue', xlim=None)
# print(f"The mean of the distribution is {round(df_produce.Revenue.mean(), 2)}")
print()
print()
Category: Produce
The mean of the distribution is 735.18
print()
print()
print("Category: Seafood")
# hist_plot1(df_seafood.Revenue, color='tab:blue')
# print(f"The mean of the distribution is {round(df_seafood.Revenue.mean(), 2)}")
hist_plot1(df_seafood.Revenue, label="Avg order revenue, discount/no discount, Seafood", color='tab:blue', xlim=None, print_stmt=False)
print(f"The mean of the distribution is {round(df_seafood.Revenue.mean(), 2)}")
print()
print()
Category: Seafood
The mean of the distribution is 397.76
As with question 2, to ensure reasonably normal distributions for hypothesis testing, I will perform resampling for each product category to get a distribution of means.
bev_means = sampling_mean(df_bev.Revenue)
mean_bev_means = np.mean(bev_means[0])
std_bev_means = np.std(bev_means[1])
print(f"The mean of the means of beverage order revenues resampling is {round(mean_bev_means,2)}")
print(f"The std deviation of the means of beverage order revenues resampling is {round(std_bev_means,2)}")
hist_plot1(bev_means[0], label="Resampling means, order revenues, Beverages", xlim=([0, 1400]), print_stmt=False)
The mean of the means of beverage order revenues resampling is 663.44
The std deviation of the means of beverage order revenues resampling is 486.78
cond_means = sampling_mean(df_cond.Revenue)
mean_cond_means = np.mean(cond_means[0])
std_cond_means = np.std(cond_means[1])
print(f"The mean of the means of Condiments order revenues resampling is {round(mean_cond_means,2)}")
print(f"The std deviation of the means of Condiments order revenues resampling is {round(std_cond_means,2)}")
hist_plot1(cond_means[0], label="Resampling means, order revenues, Condiments", xlim=([0, 1000]), print_stmt=False)
The mean of the means of Condiments order revenues resampling is 491.51
The std deviation of the means of Condiments order revenues resampling is 40.06
confect_means = sampling_mean(df_confect.Revenue)
mean_confect_means = np.mean(confect_means[0])
std_confect_means = np.std(confect_means[1])
print(f"The mean of the means of Confections order revenues resampling is {round(mean_confect_means,2)}")
print(f"The std deviation of the means of Confections order revenues resampling is {round(std_confect_means,2)}")
hist_plot1(confect_means[0], label="Resampling means, order revenues, Confections", xlim=([0, 1000]), print_stmt=False)
The mean of the means of Confections order revenues resampling is 499.85
The std deviation of the means of Confections order revenues resampling is 100.95
meat_means = sampling_mean(df_meat.Revenue)
mean_meat_means = np.mean(meat_means[0])
std_meat_means = np.std(meat_means[1])
round(mean_meat_means,2)
print(f"The mean of the means of Meat/Poultry order revenues resampling is {round(mean_meat_means,2)}")
print(f"The std deviation of the means of Meat/Poultry order revenues resampling is {round(std_meat_means,2)}")
hist_plot1(meat_means[0], label="Resampling means, order revenues, Meat/Poultry", xlim=([0, 2000]), print_stmt=False)
The mean of the means of Meat/Poultry order revenues resampling is 938.78
The std deviation of the means of Meat/Poultry order revenues resampling is 184.78
dairy_means = sampling_mean(df_dairy.Revenue)
mean_dairy_means = np.mean(dairy_means[0])
std_dairy_means = np.std(dairy_means[1])
round(mean_dairy_means,2)
print(f"The mean of the means of Dairy Products order revenues resampling is {round(mean_dairy_means,2)}")
print(f"The std deviation of the means of Dairy Products order revenues resampling is {round(std_dairy_means,2)}")
hist_plot1(dairy_means[0], label="Resampling means, order revenues, Dairy Products", xlim=([0, 1200]), print_stmt=False)
The mean of the means of Dairy Products order revenues resampling is 645.36
The std deviation of the means of Dairy Products order revenues resampling is 119.31
grains_means = sampling_mean(df_grains.Revenue)
mean_grains_means = np.mean(grains_means[0])
std_grains_means = np.std(grains_means[1])
round(mean_grains_means,2)
print(f"The mean of the means of Grains/Cereals Products order revenues resampling is {round(mean_grains_means,2)}")
print(f"The std deviation of the means of Grains/Cereals Products order revenues resampling is {round(std_grains_means,2)}")
hist_plot1(grains_means[0], label="Resampling means, order revenues, Grains/Cereals", xlim=([0, 1000]), print_stmt=False)
The mean of the means of Grains/Cereals Products order revenues resampling is 487.07
The std deviation of the means of Grains/Cereals Products order revenues resampling is 80.64
produce_means = sampling_mean(df_produce.Revenue)
mean_produce_means = np.mean(produce_means[0])
std_produce_means = np.std(produce_means[1])
round(mean_produce_means,2)
print(f"The mean of the means of Produce order revenues resampling is {round(mean_produce_means,2)}")
print(f"The std deviation of the means of Produce order revenues resampling is {round(std_produce_means,2)}")
hist_plot1(produce_means[0], label="Resampling means, order revenues, Produce", xlim=([0, 1400]), print_stmt=False)
The mean of the means of Produce order revenues resampling is 734.53
The std deviation of the means of Produce order revenues resampling is 92.73
seafood_means = sampling_mean(df_seafood.Revenue)
mean_seafood_means = np.mean(seafood_means[0])
std_seafood_means = np.std(seafood_means[1])
round(mean_seafood_means,2)
print(f"The mean of the means of Seafood order revenues resampling is {round(mean_seafood_means,2)}")
print(f"The std deviation of the means of Seafood order revenues resampling is {round(std_seafood_means,2)}")
hist_plot1(seafood_means[0], label="Resampling means, order revenues, Seafood", xlim=([0, 800]), print_stmt=False)
The mean of the means of Seafood order revenues resampling is 398.49
The std deviation of the means of Seafood order revenues resampling is 61.55
means = [mean_bev_means, mean_cond_means, mean_confect_means, mean_meat_means,
mean_dairy_means, mean_grains_means, mean_produce_means, mean_seafood_means]
names = ['Beverages', 'Condiments', 'Confections',
'Meat/Poultry', 'Dairy Products', 'Grains/Cereals',
'Produce','Seafood']
df_means = pd.DataFrame(means, names)
df_means.plot(kind='bar', color='tab:blue', alpha=.9, legend=None, title ="Average Order Revenue by Product Category", figsize=(8, 5), fontsize=12)
plt.show()
scs.f_oneway(bev_means[0], cond_means[0], confect_means[0], meat_means[0], dairy_means[0], grains_means[0], produce_means[0], seafood_means[0])
F_onewayResult(statistic=1776.047633256583, pvalue=0.0)
The ANOVA test on the sample means shows an infinitesimally small p-value, meaning that we should reject the null hypothesis that there is no statistically significant difference among the revenue distributions of the different product categories.
To find out which distributions are different enough from one another to be statistically significant, I will run Tukey's test.
bev_means_np = np.array(bev_means[0])
cond_means_np = np.array(cond_means[0])
confect_means_np = np.array(confect_means[0])
dairy_means_np = np.array(dairy_means[0])
meat_means_np = np.array(meat_means[0])
grains_means_np = np.array(grains_means[0])
produce_means_np = np.array(produce_means[0])
seafood_means_np = np.array(seafood_means[0])
v = np.concatenate([bev_means_np, cond_means_np, confect_means_np,
dairy_means_np, meat_means_np, grains_means_np,
produce_means_np, seafood_means_np])
labels = ['Beverages'] * len(bev_means_np) + ['Condiments'] * len(cond_means_np) + ['Confections'] * len(confect_means_np) + ['Dairy Products']*len(dairy_means_np) + ['Meat/Poultry']*len(meat_means_np) + ['Cereals/Grains']*len(grains_means_np) + ['Produce']*len(produce_means_np) + ['Seafood']*len(seafood_means_np)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)
type(tukey_results)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==================================================================
group1 group2 meandiff lower upper reject
------------------------------------------------------------------
Beverages Cereals/Grains -176.3666 -194.1524 -158.5807 True
Beverages Condiments -171.9365 -189.7223 -154.1507 True
Beverages Confections -163.593 -181.3788 -145.8072 True
Beverages Dairy Products -18.0862 -35.872 -0.3004 True
Beverages Meat/Poultry 275.3426 257.5568 293.1284 True
Beverages Produce 71.0916 53.3058 88.8774 True
Beverages Seafood -264.9544 -282.7402 -247.1686 True
Cereals/Grains Condiments 4.4301 -13.3557 22.2159 False
Cereals/Grains Confections 12.7736 -5.0122 30.5594 False
Cereals/Grains Dairy Products 158.2803 140.4945 176.0662 True
Cereals/Grains Meat/Poultry 451.7091 433.9233 469.4949 True
Cereals/Grains Produce 247.4582 229.6724 265.244 True
Cereals/Grains Seafood -88.5879 -106.3737 -70.8021 True
Condiments Confections 8.3435 -9.4423 26.1293 False
Condiments Dairy Products 153.8503 136.0644 171.6361 True
Condiments Meat/Poultry 447.279 429.4932 465.0649 True
Condiments Produce 243.0281 225.2423 260.8139 True
Condiments Seafood -93.018 -110.8038 -75.2321 True
Confections Dairy Products 145.5068 127.7209 163.2926 True
Confections Meat/Poultry 438.9355 421.1497 456.7214 True
Confections Produce 234.6846 216.8988 252.4704 True
Confections Seafood -101.3615 -119.1473 -83.5756 True
Dairy Products Meat/Poultry 293.4288 275.643 311.2146 True
Dairy Products Produce 89.1778 71.392 106.9637 True
Dairy Products Seafood -246.8682 -264.654 -229.0824 True
Meat/Poultry Produce -204.2509 -222.0368 -186.4651 True
Meat/Poultry Seafood -540.297 -558.0828 -522.5112 True
Produce Seafood -336.046 -353.8319 -318.2602 True
------------------------------------------------------------------
statsmodels.sandbox.stats.multicomp.TukeyHSDResults
We can get a sense of the magnitude of the statistically significant differences among the product categories by computing and plotting Cohen's d for each pair.
d_bev_cond = Cohen_d(bev_means_np, cond_means_np)
plot_pdfs(d_bev_cond)
This Cohen's d result (1.62585469) represents a difference of 1.63 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_bev_dairy = Cohen_d(bev_means_np, dairy_means_np)
plot_pdfs(d_bev_dairy)
This Cohen's d result (0.16055547) represents a difference of 0.16 pooled standard deviations between the two groups, and suggests a **small** effect size.
d_bev_meat = Cohen_d(bev_means_np, meat_means_np)
plot_pdfs(d_bev_meat)
This Cohen's d result (-2.27595947) represents a difference of -2.28 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_bev_grains = Cohen_d(bev_means_np, grains_means_np)
plot_pdfs(d_bev_grains)
This Cohen's d result (1.65508548) represents a difference of 1.66 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_bev_produce = Cohen_d(bev_means_np, produce_means_np)
plot_pdfs(d_bev_produce)
This Cohen's d result (-0.66677525) represents a difference of -0.67 pooled standard deviations between the two groups, and suggests a **medium** effect size.
d_bev_seafood = Cohen_d(bev_means_np, seafood_means_np)
plot_pdfs(d_bev_seafood)
This Cohen's d result (2.49154765) represents a difference of 2.49 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_cond_confect = Cohen_d(cond_means_np, confect_means_np)
plot_pdfs(d_cond_confect)
This Cohen's d result (-0.19889735) represents a difference of -0.2 pooled standard deviations between the two groups, and suggests a **small** effect size.
d_cond_dairy = Cohen_d(cond_means_np, dairy_means_np)
plot_pdfs(d_cond_dairy)
This Cohen's d result (-3.02144627) represents a difference of -3.02 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_cond_meat = Cohen_d(cond_means_np, meat_means_np)
plot_pdfs(d_cond_meat)
This Cohen's d result (-6.63885587) represents a difference of -6.64 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_cond_grains = Cohen_d(cond_means_np, grains_means_np)
plot_pdfs(d_cond_grains)
This Cohen's d result (0.12488345) represents a difference of 0.12 pooled standard deviations between the two groups, and suggests a **small** effect size.
d_cond_produce = Cohen_d(cond_means_np, produce_means_np)
plot_pdfs(d_cond_produce)
This Cohen's d result (-6.81656087) represents a difference of -6.82 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_cond_seafood = Cohen_d(cond_means_np, seafood_means_np)
plot_pdfs(d_cond_seafood)
This Cohen's d result (2.67215594) represents a difference of 2.67 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_confect_dairy = Cohen_d(confect_means_np, dairy_means_np)
plot_pdfs(d_confect_dairy)
This Cohen's d result (-2.54615497) represents a difference of -2.55 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_confect_meat = Cohen_d(confect_means_np, meat_means_np)
plot_pdfs(d_confect_meat)
This Cohen's d result (-6.07982689) represents a difference of -6.08 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_confect_grains = Cohen_d(confect_means_np, grains_means_np)
plot_pdfs(d_confect_grains)
This Cohen's d result (0.29065039) represents a difference of 0.29 pooled standard deviations between the two groups, and suggests a **small** effect size.
d_confect_produce = Cohen_d(confect_means_np, produce_means_np)
plot_pdfs(d_confect_produce)
This Cohen's d result (-5.32251585) represents a difference of -5.32 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_confect_seafood = Cohen_d(confect_means_np, seafood_means_np)
plot_pdfs(d_confect_seafood)
This Cohen's d result (2.33474402) represents a difference of 2.33 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_dairy_meat = Cohen_d(dairy_means_np, meat_means_np)
plot_pdfs(d_dairy_meat)
This Cohen's d result (-3.77394204) represents a difference of -3.77 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_dairy_grains = Cohen_d(dairy_means_np, grains_means_np)
plot_pdfs(d_dairy_grains)
This Cohen's d result (3.01034195) represents a difference of 3.01 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_dairy_produce = Cohen_d(dairy_means_np, produce_means_np)
plot_pdfs(d_dairy_produce)
This Cohen's d result (-1.69218869) represents a difference of -1.69 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_dairy_seafood = Cohen_d(dairy_means_np, seafood_means_np)
plot_pdfs(d_dairy_seafood)
This Cohen's d result (4.73531703) represents a difference of 4.74 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_meat_grains = Cohen_d(meat_means_np, grains_means_np)
plot_pdfs(d_meat_grains)
This Cohen's d result (6.58125633) represents a difference of 6.58 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_meat_produce = Cohen_d(meat_means_np, produce_means_np)
plot_pdfs(d_meat_produce)
This Cohen's d result (2.97186024) represents a difference of 2.97 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_meat_seafood = Cohen_d(meat_means_np, seafood_means_np)
plot_pdfs(d_meat_seafood)
This Cohen's d result (7.91121656) represents a difference of 7.91 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_grains_produce = Cohen_d(grains_means_np, produce_means_np)
plot_pdfs(d_grains_produce)
This Cohen's d result (-6.51463463) represents a difference of -6.51 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_grains_seafood = Cohen_d(grains_means_np, seafood_means_np)
plot_pdfs(d_grains_seafood)
This Cohen's d result (2.38169432) represents a difference of 2.38 pooled standard deviations between the two groups, and suggests a **large** effect size.
d_produce_seafood = Cohen_d(produce_means_np, seafood_means_np)
plot_pdfs(d_produce_seafood)
This Cohen's d result (8.99337848) represents a difference of 8.99 pooled standard deviations between the two groups, and suggests a **large** effect size.
The results of Tukey's test suggest that the magnitude of the differences of the distribution of means for each product category is statistically significant for all but a few category pairings: Cereals/Grains and Condiments; Cereals/Grains and Confections; Condiments and Confections; and possibly one or two other pairings (have seen Beverages and Dairy Products not being statistically significant on occasion). This result is interesting--but not really surprising--in light of what is shown in the bar graph of revenues by category: the three categories represented by the first three pairing above are the most shelf-stable, and relatively inexpensive, of the 8 product categories. They also are discounted relatively less frequently than most other categories (with the possible exception of produce), probably due to lower rates of spoilage, therefore requiring fewer discounts to move products before they spoil.
In addition, we see some very large differences in average order value among certain product categories. The bar chart showing the average order revenue by category shows that in an intuitive manner; Tukey's test and Cohen's d tell us which differences are statistically significant (all but 4) and to what degree.
This last point on the value of Cohen's d is important: while you can see differences in the original bar chart across product categories, it's interesting to note that the difference shown by Cohen's d are sometimes more or less dramatic than the bar chart implies.
As a reminder, here is the bar chart of average order revenue by product category:
df_means.plot(kind='bar', legend=None, title ="Average Order Revenue by Product Category", figsize=(8, 5), fontsize=12)
plt.show()
While there is a notable difference on the bar chart between, say, Produce and Condiments, the Cohen's d value for this pair is 6.65! That's close to 7 pooled standard deviations separating the means of these two categories. That suggests that the difference between these two is even more significant than this bar chart would suggest. This is instructive because it can help us identify under- or over-performing categories on which to focus specific incentives.
I have a number of general questions that I would like to explore. Below are a few that I had started to investigate, but did not have time to complete.
We have seen that there are substantial differences in the revenues and discounts offered across the 8 categories. But what about within categories? Do the discounts within categories vary by product?
-
Null hypothesis: No significant difference among the means of discount values by product within a category
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$ )
- Ho:
-
Alternative hypothesis: Differences among the means of discount values by product within a category are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha:
Looking at the distribution of revenues in this category by discount...
df_bev = df.loc[df["CategoryName"] == "Beverages"]
df_bev.groupby(['Discount']).sum().sort_values(['Discount'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | CategoryId | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|
Discount | ||||||||
0.00 | 2623997 | 10381 | 7161.95 | 5116 | 246 | 153060.250 | 12.092044 | 0.120920 |
0.05 | 277313 | 1038 | 1507.75 | 898 | 26 | 43567.380 | 3.441904 | 0.034419 |
0.10 | 264816 | 1219 | 812.20 | 728 | 25 | 13027.635 | 1.029207 | 0.010292 |
0.15 | 384608 | 1348 | 609.65 | 885 | 36 | 12889.995 | 1.018334 | 0.010183 |
0.20 | 418197 | 1353 | 1005.40 | 1140 | 39 | 29569.320 | 2.336031 | 0.023360 |
0.25 | 343213 | 1017 | 714.70 | 765 | 32 | 15753.600 | 1.244564 | 0.012446 |
df_bev_dsc = df_bev.drop(['Id', 'ProductId', 'OrderUnitPrice', 'OrderQty', 'RevPercentTotal', 'RevFractionTotal', 'CustomerId', 'OrderDate', 'CategoryName', 'CatDescription'], axis=1)
df_bev_dsc.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Discount | ProductName | CategoryId | Revenue | |
---|---|---|---|---|
0 | 0.20 | Chai | 1 | 518.4 |
1 | 0.00 | Chai | 1 | 259.2 |
2 | 0.00 | Chai | 1 | 288.0 |
3 | 0.15 | Chai | 1 | 183.6 |
4 | 0.00 | Chai | 1 | 172.8 |
We can see that the largest revenue category by far, with over \$150,000 is 0% discount, followed by approximately \$40,000 for the 5% discount grouping (almost all of the discounts in the 5-9.9% discount group are 5% discounts).
Let's take a look at the details of discounts in Beverages, sorting by discount level and product name:
df_bev = df.loc[df['CategoryName']=="Beverages"]
df_bev.groupby(['Discount', 'ProductName']).sum().sort_values(['Discount']).head(20)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | CategoryId | Revenue | RevPercentTotal | RevFractionTotal | ||
---|---|---|---|---|---|---|---|---|---|
Discount | ProductName | ||||||||
0.00 | Chai | 235242 | 22 | 374.40 | 391 | 22 | 6681.600 | 0.527859 | 0.005279 |
Steeleye Stout | 266316 | 875 | 424.80 | 531 | 25 | 8812.800 | 0.696228 | 0.006962 | |
Sasquatch Ale | 128134 | 408 | 156.80 | 269 | 12 | 3542.000 | 0.279825 | 0.002798 | |
Rhönbräu Klosterbier | 308816 | 2175 | 212.35 | 631 | 29 | 4642.250 | 0.366746 | 0.003667 | |
Laughing Lumberjack Lager | 75320 | 469 | 98.00 | 115 | 7 | 1610.000 | 0.127193 | 0.001272 | |
Lakkalikööri | 277308 | 1976 | 446.40 | 591 | 26 | 10234.800 | 0.808568 | 0.008086 | |
Outback Lager | 245338 | 1610 | 324.00 | 438 | 23 | 6129.000 | 0.484202 | 0.004842 | |
Guaraná Fantástica | 353001 | 792 | 139.50 | 704 | 33 | 3011.400 | 0.237906 | 0.002379 | |
Côte de Blaye | 150244 | 532 | 3530.90 | 320 | 14 | 79577.000 | 6.286731 | 0.062867 | |
Chartreuse verte | 169736 | 624 | 266.40 | 343 | 16 | 5677.200 | 0.448509 | 0.004485 | |
Chang | 202413 | 38 | 342.00 | 408 | 19 | 7125.000 | 0.562888 | 0.005629 | |
Ipoh Coffee | 212129 | 860 | 846.40 | 375 | 20 | 16017.200 | 1.265389 | 0.012654 | |
0.05 | Outback Lager | 31973 | 210 | 45.00 | 63 | 3 | 897.750 | 0.070924 | 0.000709 |
Steeleye Stout | 11046 | 35 | 18.00 | 18 | 1 | 307.800 | 0.024317 | 0.000243 | |
Sasquatch Ale | 31849 | 102 | 36.40 | 135 | 3 | 1675.800 | 0.132391 | 0.001324 | |
Rhönbräu Klosterbier | 32633 | 225 | 23.25 | 210 | 3 | 1546.125 | 0.122147 | 0.001221 | |
Ipoh Coffee | 10340 | 43 | 36.80 | 40 | 1 | 1398.400 | 0.110476 | 0.001105 | |
Lakkalikööri | 10273 | 76 | 14.40 | 33 | 1 | 451.440 | 0.035665 | 0.000357 | |
Côte de Blaye | 52977 | 190 | 1212.10 | 145 | 5 | 34294.525 | 2.709331 | 0.027093 | |
Chartreuse verte | 21942 | 78 | 36.00 | 82 | 2 | 1402.200 | 0.110776 | 0.001108 |
Now let's sort by product name and revenue, to see which products bring in the most revenue.
Note: because CategoryId for all beverages is 1, the number in the CategoryId column represents the total number of orders of that product. For example, there were 24 orders for Cote de Blaye, 28 orders for Ipoh Coffee, etc.
We see Cote de Blaye is at the top of the list for revenues by a large margin, followed by Ipoh Coffee, Chang, Lakkalikööri, and Steeleye Stout.
df_bev.groupby(['ProductName']).sum().sort_values(['Revenue'], ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | Discount | CategoryId | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|
ProductName | |||||||||
Côte de Blaye | 255583 | 912 | 5902.40 | 623 | 1.10 | 24 | 141396.735 | 11.170605 | 0.111706 |
Ipoh Coffee | 298531 | 1204 | 1205.20 | 580 | 1.40 | 28 | 23526.700 | 1.858653 | 0.018587 |
Chang | 470965 | 88 | 786.60 | 1057 | 4.50 | 44 | 16355.960 | 1.292151 | 0.012922 |
Lakkalikööri | 415841 | 2964 | 662.40 | 981 | 2.05 | 39 | 15760.440 | 1.245104 | 0.012451 |
Steeleye Stout | 383553 | 1260 | 612.00 | 883 | 1.70 | 36 | 13644.000 | 1.077901 | 0.010779 |
Chai | 406841 | 38 | 651.60 | 828 | 2.95 | 38 | 12788.100 | 1.010284 | 0.010103 |
Chartreuse verte | 318790 | 1170 | 500.40 | 793 | 2.00 | 30 | 12294.540 | 0.971291 | 0.009713 |
Outback Lager | 416162 | 2730 | 552.00 | 817 | 2.45 | 39 | 10672.650 | 0.843159 | 0.008432 |
Rhönbräu Klosterbier | 490967 | 3450 | 339.45 | 1155 | 2.40 | 46 | 8177.490 | 0.646037 | 0.006460 |
Sasquatch Ale | 202659 | 646 | 246.40 | 506 | 0.95 | 19 | 6350.400 | 0.501693 | 0.005017 |
Guaraná Fantástica | 544786 | 1224 | 216.00 | 1125 | 2.90 | 51 | 4504.365 | 0.355853 | 0.003559 |
Laughing Lumberjack Lager | 107466 | 670 | 137.20 | 184 | 0.60 | 10 | 2396.800 | 0.189352 | 0.001894 |
df_bev.groupby(['ProductName', 'Discount']).sum().sort_values(['ProductName'], ascending=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | ProductId | OrderUnitPrice | OrderQty | CategoryId | Revenue | RevPercentTotal | RevFractionTotal | ||
---|---|---|---|---|---|---|---|---|---|
ProductName | Discount | ||||||||
Chai | 0.00 | 235242 | 22 | 374.40 | 391 | 22 | 6681.600 | 0.527859 | 0.005279 |
0.05 | 10905 | 1 | 18.00 | 20 | 1 | 342.000 | 0.027019 | 0.000270 | |
0.10 | 11025 | 1 | 18.00 | 10 | 1 | 162.000 | 0.012798 | 0.000128 | |
0.15 | 53177 | 5 | 82.80 | 98 | 5 | 1407.600 | 0.111203 | 0.001112 | |
0.20 | 42354 | 4 | 68.40 | 170 | 4 | 2318.400 | 0.183158 | 0.001832 | |
0.25 | 54138 | 5 | 90.00 | 139 | 5 | 1876.500 | 0.148247 | 0.001482 | |
Chang | 0.00 | 202413 | 38 | 342.00 | 408 | 19 | 7125.000 | 0.562888 | 0.005629 |
0.05 | 32270 | 6 | 57.00 | 50 | 3 | 902.500 | 0.071299 | 0.000713 | |
0.10 | 10485 | 2 | 15.20 | 20 | 1 | 273.600 | 0.021615 | 0.000216 | |
0.15 | 53993 | 10 | 87.40 | 125 | 5 | 1744.200 | 0.137795 | 0.001378 | |
0.20 | 106974 | 20 | 174.80 | 247 | 10 | 3432.160 | 0.271147 | 0.002711 | |
0.25 | 64830 | 12 | 110.20 | 207 | 6 | 2878.500 | 0.227407 | 0.002274 | |
Chartreuse verte | 0.25 | 21120 | 78 | 32.40 | 41 | 2 | 499.500 | 0.039461 | 0.000395 |
0.20 | 21501 | 78 | 36.00 | 13 | 2 | 187.200 | 0.014789 | 0.000148 | |
0.15 | 42407 | 156 | 64.80 | 80 | 4 | 1009.800 | 0.079776 | 0.000798 | |
0.05 | 21942 | 78 | 36.00 | 82 | 2 | 1402.200 | 0.110776 | 0.001108 | |
0.00 | 169736 | 624 | 266.40 | 343 | 16 | 5677.200 | 0.448509 | 0.004485 | |
0.10 | 42084 | 156 | 64.80 | 234 | 4 | 3518.640 | 0.277979 | 0.002780 | |
Côte de Blaye | 0.00 | 150244 | 532 | 3530.90 | 320 | 14 | 79577.000 | 6.286731 | 0.062867 |
0.05 | 52977 | 190 | 1212.10 | 145 | 5 | 34294.525 | 2.709331 | 0.027093 | |
0.10 | 21213 | 76 | 527.00 | 19 | 2 | 4505.850 | 0.355971 | 0.003560 | |
0.20 | 20777 | 76 | 421.60 | 99 | 2 | 16695.360 | 1.318964 | 0.013190 | |
0.25 | 10372 | 38 | 210.80 | 40 | 1 | 6324.000 | 0.499608 | 0.004996 | |
Guaraná Fantástica | 0.00 | 353001 | 792 | 139.50 | 704 | 33 | 3011.400 | 0.237906 | 0.002379 |
0.05 | 31105 | 72 | 10.80 | 102 | 3 | 348.840 | 0.027559 | 0.000276 | |
0.10 | 31552 | 72 | 12.60 | 80 | 3 | 307.800 | 0.024317 | 0.000243 | |
0.15 | 42213 | 96 | 17.10 | 60 | 4 | 218.025 | 0.017224 | 0.000172 | |
0.20 | 32203 | 72 | 13.50 | 63 | 3 | 226.800 | 0.017918 | 0.000179 | |
0.25 | 54712 | 120 | 22.50 | 116 | 5 | 391.500 | 0.030929 | 0.000309 | |
Ipoh Coffee | 0.20 | 21975 | 86 | 92.00 | 40 | 2 | 1472.000 | 0.116291 | 0.001163 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Lakkalikööri | 0.15 | 63760 | 456 | 97.20 | 162 | 6 | 2249.100 | 0.177683 | 0.001777 |
0.20 | 21584 | 152 | 36.00 | 94 | 2 | 1353.600 | 0.106937 | 0.001069 | |
0.25 | 21262 | 152 | 32.40 | 41 | 2 | 499.500 | 0.039461 | 0.000395 | |
Laughing Lumberjack Lager | 0.00 | 75320 | 469 | 98.00 | 115 | 7 | 1610.000 | 0.127193 | 0.001272 |
0.15 | 10939 | 67 | 14.00 | 40 | 1 | 476.000 | 0.037605 | 0.000376 | |
0.20 | 10923 | 67 | 14.00 | 24 | 1 | 268.800 | 0.021236 | 0.000212 | |
0.25 | 10284 | 67 | 11.20 | 5 | 1 | 42.000 | 0.003318 | 0.000033 | |
Outback Lager | 0.20 | 32933 | 210 | 45.00 | 82 | 3 | 984.000 | 0.077738 | 0.000777 |
0.15 | 21557 | 140 | 30.00 | 58 | 2 | 739.500 | 0.058422 | 0.000584 | |
0.25 | 42497 | 280 | 57.00 | 69 | 4 | 729.000 | 0.057592 | 0.000576 | |
0.05 | 31973 | 210 | 45.00 | 63 | 3 | 897.750 | 0.070924 | 0.000709 | |
0.00 | 245338 | 1610 | 324.00 | 438 | 23 | 6129.000 | 0.484202 | 0.004842 | |
0.10 | 41864 | 280 | 51.00 | 107 | 4 | 1193.400 | 0.094281 | 0.000943 | |
Rhönbräu Klosterbier | 0.00 | 308816 | 2175 | 212.35 | 631 | 29 | 4642.250 | 0.366746 | 0.003667 |
0.05 | 32633 | 225 | 23.25 | 210 | 3 | 1546.125 | 0.122147 | 0.001221 | |
0.10 | 53081 | 375 | 37.20 | 103 | 5 | 684.945 | 0.054112 | 0.000541 | |
0.15 | 21306 | 150 | 13.95 | 26 | 2 | 163.370 | 0.012907 | 0.000129 | |
0.20 | 64671 | 450 | 46.50 | 181 | 6 | 1122.200 | 0.088656 | 0.000887 | |
0.25 | 10460 | 75 | 6.20 | 4 | 1 | 18.600 | 0.001469 | 0.000015 | |
Sasquatch Ale | 0.25 | 10548 | 34 | 14.00 | 10 | 1 | 105.000 | 0.008295 | 0.000083 |
0.20 | 21138 | 68 | 25.20 | 32 | 2 | 313.600 | 0.024775 | 0.000248 | |
0.00 | 128134 | 408 | 156.80 | 269 | 12 | 3542.000 | 0.279825 | 0.002798 | |
0.05 | 31849 | 102 | 36.40 | 135 | 3 | 1675.800 | 0.132391 | 0.001324 | |
0.15 | 10990 | 34 | 14.00 | 60 | 1 | 714.000 | 0.056407 | 0.000564 | |
Steeleye Stout | 0.20 | 21164 | 70 | 32.40 | 95 | 2 | 1195.200 | 0.094423 | 0.000944 |
0.00 | 266316 | 875 | 424.80 | 531 | 25 | 8812.800 | 0.696228 | 0.006962 | |
0.05 | 11046 | 35 | 18.00 | 18 | 1 | 307.800 | 0.024317 | 0.000243 | |
0.10 | 31858 | 105 | 50.40 | 95 | 3 | 1409.400 | 0.111345 | 0.001113 | |
0.15 | 31877 | 105 | 50.40 | 105 | 3 | 1392.300 | 0.109994 | 0.001100 | |
0.25 | 21292 | 70 | 36.00 | 39 | 2 | 526.500 | 0.041594 | 0.000416 |
67 rows × 8 columns
# df_bev.groupby(['ProductName', 'Discount']).sum().sort_values(['ProductName'], ascending=True)
df_bev_dsc.groupby(['Discount']).sum().sort_values(['Discount']).plot(kind='bar',
title ="Beverages--Revenues by Discount", legend=False, figsize=(8, 5), fontsize=12)
plt.show()
We can see that virtually every product in this category is sold at almost all discount levels (including 0%). With more time, I would like to explore the distributions of revenues by discount level for each named product, and run tests (Tukey's test, among others) to figure out whether those distributions differ significantly across product name.
For example, within the Beverage category, I noticed an interesting distribution of revenues by discount level for Cote de Blaye (which is an expensive wine product and is the largest single product by revenue in the Beverages category).
df.loc[df.ProductName == "Côte de Blaye"].head(4)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
966 | 10329 | SPLIR | 2012-10-15 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.2 | 0.316418 | 0.003164 |
967 | 10351 | ERNSH | 2012-11-11 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.2 | 0.316418 | 0.003164 |
968 | 10353 | PICCO | 2012-11-13 | 38 | 210.8 | 50 | 0.20 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 8432.0 | 0.666144 | 0.006661 |
969 | 10360 | BLONP | 2012-11-22 | 38 | 210.8 | 10 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2108.0 | 0.166536 | 0.001665 |
Grouping by discount level and sorting on revenues:
df_bev.groupby(['ProductName']).sum()
df_bev.loc[df['ProductName'] == 'Côte de Blaye']
df_bev_cote = df_bev.loc[df['ProductName'] == 'Côte de Blaye']
# df_bev_cote.groupby(['Discount', 'Revenue']).sum()
df_bev_cote.groupby(['Discount']).sum()
df_bev_cote_disc_rev = df_bev_cote.drop(columns=['Id', 'ProductId', 'OrderUnitPrice', 'RevPercentTotal', 'RevFractionTotal'])
df_bev_cote_disc_rev.rename(columns={'CategoryId': 'Number of Orders'})
print("Revenue totals sorted by discount, Cote de Blaye")
df_bev_cote_disc_rev.groupby(['Discount']).sum().sort_values(['Revenue'], ascending=False)
# Note that, since the CategoryId for Beverages is '1' and each row
# represents a summation of each item, the number showing in the
# CategoryId column represents the number of orders placed at that
# discount level. OrderQty is the total number of units purchased
# across all orders at that discount level.
Revenue totals sorted by discount, Cote de Blaye
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
OrderQty | CategoryId | Revenue | |
---|---|---|---|
Discount | |||
0.00 | 320 | 14 | 79577.000 |
0.05 | 145 | 5 | 34294.525 |
0.20 | 99 | 2 | 16695.360 |
0.25 | 40 | 1 | 6324.000 |
0.10 | 19 | 2 | 4505.850 |
df_bev_dsc.groupby(['Discount']).sum().sort_values(['Revenue'], ascending=False).plot(kind='bar',
title ="Beverages--Discount, sorted by Revenue", figsize=(8, 5), fontsize=12)
plt.show()
Sorting revenues by discount level, we can see that the greatest order revenues occurred at 0% and 5% discount, with the next highest order revenues occurring at 20% and 25% and relatively low order revenues at the 10% or 15% level.
- Almost \$80,000 of total sales of the product occurred without any discount.
- The next highest revenues for the product (almost \$40,000) were made at the 5% discount level.
- The third highest revenues of over \$16,000 were made with a 20% discount applied.
- Over \$6300 was made at the 25% discount level.
- Finally, just over \$4000 in revenues was realized on product at the 10% level. (No product was sold at the 15% discount level.)
I was curious to know if the date of the sale had any bearing on these sales figures. I thought that perhaps seasonality might be at play. Below is the dataset for this product, sorted by discount.
df.loc[df.ProductName == "Côte de Blaye"].sort_values(['Discount'], ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
970 | 10372 | QUEE | 2012-12-04 | 38 | 210.8 | 40 | 0.25 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6324.000 | 0.499608 | 0.004996 |
968 | 10353 | PICCO | 2012-11-13 | 38 | 210.8 | 50 | 0.20 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 8432.000 | 0.666144 | 0.006661 |
972 | 10424 | MEREP | 2013-01-23 | 38 | 210.8 | 49 | 0.20 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 8263.360 | 0.652821 | 0.006528 |
978 | 10672 | BERGS | 2013-09-17 | 38 | 263.5 | 15 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3557.250 | 0.281029 | 0.002810 |
976 | 10541 | HANAR | 2013-05-19 | 38 | 263.5 | 4 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 948.600 | 0.074941 | 0.000749 |
985 | 10865 | QUICK | 2014-02-02 | 38 | 263.5 | 60 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 15019.500 | 1.186568 | 0.011866 |
981 | 10816 | GREAL | 2014-01-06 | 38 | 263.5 | 30 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7509.750 | 0.593284 | 0.005933 |
967 | 10351 | ERNSH | 2012-11-11 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
977 | 10616 | GREAL | 2013-07-31 | 38 | 263.5 | 15 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3754.875 | 0.296642 | 0.002966 |
966 | 10329 | SPLIR | 2012-10-15 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
975 | 10540 | QUICK | 2013-05-19 | 38 | 263.5 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7905.000 | 0.624510 | 0.006245 |
974 | 10518 | TORTU | 2013-04-25 | 38 | 263.5 | 15 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3952.500 | 0.312255 | 0.003123 |
973 | 10479 | RATTC | 2013-03-19 | 38 | 210.8 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6324.000 | 0.499608 | 0.004996 |
979 | 10783 | HANAR | 2013-12-18 | 38 | 263.5 | 5 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1317.500 | 0.104085 | 0.001041 |
980 | 10805 | THEBI | 2013-12-30 | 38 | 263.5 | 10 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2635.000 | 0.208170 | 0.002082 |
971 | 10417 | SIMOB | 2013-01-16 | 38 | 210.8 | 50 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 10540.000 | 0.832680 | 0.008327 |
982 | 10817 | KOENE | 2014-01-06 | 38 | 263.5 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7905.000 | 0.624510 | 0.006245 |
983 | 10828 | RANCH | 2014-01-13 | 38 | 263.5 | 2 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 527.000 | 0.041634 | 0.000416 |
984 | 10831 | SANTG | 2014-01-14 | 38 | 263.5 | 8 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2108.000 | 0.166536 | 0.001665 |
969 | 10360 | BLONP | 2012-11-22 | 38 | 210.8 | 10 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2108.000 | 0.166536 | 0.001665 |
986 | 10889 | RATTC | 2014-02-16 | 38 | 263.5 | 40 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 10540.000 | 0.832680 | 0.008327 |
987 | 10964 | SPECD | 2014-03-20 | 38 | 263.5 | 5 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1317.500 | 0.104085 | 0.001041 |
988 | 10981 | HANAR | 2014-03-27 | 38 | 263.5 | 60 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 15810.000 | 1.249019 | 0.012490 |
989 | 11032 | WHITC | 2014-04-17 | 38 | 263.5 | 25 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6587.500 | 0.520425 | 0.005204 |
The three largest discounts were given on sales between November 1, 2012 and January 31, 2013. I also noticed that a few other discounted sales were made at the end of the year, but more were made at the beginning of the year. There were also a number of non-discounted sales that happened close to the of 2014, but there were just as many--if not more--non-discounted orders at other times of the year.
The number of datapoints in this subset of product data is quite small, which is why a robust statistical test would be challenging to run. With more time, I would research which tests would be most likely to yield meaningful results, given the small sample size for this product. Given the high revenues per order, it would be helpful to understand more about the sales patterns of this product--seasonality and geography in particular. Perhaps there is unmet customer demand that could be tapped by identifying revenue potential in different geographies or at different times of the year.
df.loc[df.ProductName == "Côte de Blaye"].sort_values(['Revenue'], ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
988 | 10981 | HANAR | 2014-03-27 | 38 | 263.5 | 60 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 15810.000 | 1.249019 | 0.012490 |
985 | 10865 | QUICK | 2014-02-02 | 38 | 263.5 | 60 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 15019.500 | 1.186568 | 0.011866 |
971 | 10417 | SIMOB | 2013-01-16 | 38 | 210.8 | 50 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 10540.000 | 0.832680 | 0.008327 |
986 | 10889 | RATTC | 2014-02-16 | 38 | 263.5 | 40 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 10540.000 | 0.832680 | 0.008327 |
968 | 10353 | PICCO | 2012-11-13 | 38 | 210.8 | 50 | 0.20 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 8432.000 | 0.666144 | 0.006661 |
972 | 10424 | MEREP | 2013-01-23 | 38 | 210.8 | 49 | 0.20 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 8263.360 | 0.652821 | 0.006528 |
975 | 10540 | QUICK | 2013-05-19 | 38 | 263.5 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7905.000 | 0.624510 | 0.006245 |
982 | 10817 | KOENE | 2014-01-06 | 38 | 263.5 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7905.000 | 0.624510 | 0.006245 |
981 | 10816 | GREAL | 2014-01-06 | 38 | 263.5 | 30 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7509.750 | 0.593284 | 0.005933 |
989 | 11032 | WHITC | 2014-04-17 | 38 | 263.5 | 25 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6587.500 | 0.520425 | 0.005204 |
970 | 10372 | QUEE | 2012-12-04 | 38 | 210.8 | 40 | 0.25 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6324.000 | 0.499608 | 0.004996 |
973 | 10479 | RATTC | 2013-03-19 | 38 | 210.8 | 30 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 6324.000 | 0.499608 | 0.004996 |
966 | 10329 | SPLIR | 2012-10-15 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
967 | 10351 | ERNSH | 2012-11-11 | 38 | 210.8 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
974 | 10518 | TORTU | 2013-04-25 | 38 | 263.5 | 15 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3952.500 | 0.312255 | 0.003123 |
977 | 10616 | GREAL | 2013-07-31 | 38 | 263.5 | 15 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3754.875 | 0.296642 | 0.002966 |
978 | 10672 | BERGS | 2013-09-17 | 38 | 263.5 | 15 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3557.250 | 0.281029 | 0.002810 |
980 | 10805 | THEBI | 2013-12-30 | 38 | 263.5 | 10 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2635.000 | 0.208170 | 0.002082 |
984 | 10831 | SANTG | 2014-01-14 | 38 | 263.5 | 8 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2108.000 | 0.166536 | 0.001665 |
969 | 10360 | BLONP | 2012-11-22 | 38 | 210.8 | 10 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2108.000 | 0.166536 | 0.001665 |
979 | 10783 | HANAR | 2013-12-18 | 38 | 263.5 | 5 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1317.500 | 0.104085 | 0.001041 |
987 | 10964 | SPECD | 2014-03-20 | 38 | 263.5 | 5 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1317.500 | 0.104085 | 0.001041 |
976 | 10541 | HANAR | 2013-05-19 | 38 | 263.5 | 4 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 948.600 | 0.074941 | 0.000749 |
983 | 10828 | RANCH | 2014-01-13 | 38 | 263.5 | 2 | 0.00 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 527.000 | 0.041634 | 0.000416 |
Do revenues from discounts vary in a statistically significant way from revenues on non-discounted products in each product category?
-
Null hypothesis: No significant difference between revenues from non-discounted products in a category versus revenues from discounted products in a category
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$ )
- Ho:
-
Alternative hypothesis: Differences between revenues from non-discounted products in a category versus revenues from discounted products in a category are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha:
# dataframes:
# df_bev
# df_cond
# df_confect
# df_dairy
# df_meat
# df_grains
# df_produce
# df_seafood
# df.loc[df["CategoryName"] == "Condiments"]
df_bev_no_disc = df_bev.loc[df["Discount"] == 0]
df_bev_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10294 | RATTC | 2012-08-30 | 1 | 14.4 | 18 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 259.2 | 0.020477 | 0.000205 |
2 | 10317 | LONEP | 2012-09-30 | 1 | 14.4 | 20 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 288.0 | 0.022753 | 0.000228 |
4 | 10354 | PERIC | 2012-11-14 | 1 | 14.4 | 12 | 0.0 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 172.8 | 0.013652 | 0.000137 |
df_bev_disc = df_bev.loc[df["Discount"] > 0]
df_bev_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
hist_plot2(df_bev_no_disc.Revenue, df_bev_disc.Revenue, label="Beverage revenues per order, no discounts", label1="Beverage revenues per order, ALL discounts", xlim=([0, 2500]))
bev_no_disc_mean, bev_no_disc_std = sampling_mean(df_bev_no_disc.Revenue)
print(f"The mean of the distribution is {round(np.mean(bev_no_disc_mean), 2)}")
print(f"The std deviation of the distribution is {round(np.std(bev_no_disc_std), 2)}")
hist_plot1(bev_no_disc_mean, label="Resampling means, order revs, NO discounts, Beverages", xlim=([0, 1200]), print_stmt=False)
The mean of the distribution is 618.11
The std deviation of the distribution is 460.16
bev_disc_mean, bev_disc_std = sampling_mean(df_bev_disc.Revenue)
print(f"The mean of the distribution is {round(np.mean(bev_disc_mean), 2)}")
print(f"The std deviation of the distribution is {round(np.std(bev_disc_std), 2)}")
hist_plot1(bev_disc_mean, label="Resampling means, order revs, ALL discounts, Beverages", xlim=([0, 1200]), print_stmt=False)
The mean of the distribution is 731.42
The std deviation of the distribution is 339.08
df_bev_disc.Discount.value_counts()
0.20 39
0.15 36
0.25 32
0.05 26
0.10 25
Name: Discount, dtype: int64
# Divide df_bev_disc into two dataframes: 5-19% and 20+%
df_bev_disc_5_15 = df_bev_disc.loc[(df['Discount'] >= 0.05) & (df['Discount'] < 0.20)]
df_bev_disc_5_15.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 10348 | WANDK | 2012-11-07 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
5 | 10370 | CHOPS | 2012-12-03 | 1 | 14.4 | 15 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 183.6 | 0.014505 | 0.000145 |
10 | 10526 | WARTH | 2013-05-05 | 1 | 18.0 | 8 | 0.15 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 122.4 | 0.009670 | 0.000097 |
df_bev_disc_20plus = df_bev_disc.loc[(df['Discount'] >= 0.2)]
df_bev_disc_20plus.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10285 | QUICK | 2012-08-20 | 1 | 14.4 | 45 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.4 | 0.040955 | 0.000410 |
9 | 10522 | LEHMS | 2013-04-30 | 1 | 18.0 | 40 | 0.20 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 576.0 | 0.045505 | 0.000455 |
16 | 10646 | HUNGO | 2013-08-27 | 1 | 18.0 | 15 | 0.25 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 202.5 | 0.015998 | 0.000160 |
Have gotten relatively normal distributions by taking the means of multiple samples. Need to check to make sure that variances are similar before performing the t-test. We can use Bartlett's test in this case to check for similar variances.
import scipy.stats
scipy.stats.bartlett(bev_disc_mean, bev_no_disc_mean)
BartlettResult(statistic=15.45656010891127, pvalue=8.442339866062341e-05)
The p-value is < 0.05, so we can reject the null hypothesis that the samples come from populations with similar or equal variances.
scipy.stats.levene(bev_disc_mean, bev_no_disc_mean)
LeveneResult(statistic=12.009820337335572, pvalue=0.0005671280164836042)
Like the Bartlett test, the Levene test evaluates two different distributions to determine whether or not they have equal variances. The Levene test is less sensitive to normality than Bartlett's, so it is an alternative to the latter when the distributions deviate from normal.
Both tests yield p-values that are < 0.05 (although Levene's test is close to 0.05). We can reject the null hypothesis that the distributions have the same variances. This is useful information: while the distributions are reasonably normal, they don't appear to be the same shape. The distribution of discounted product revenue sales seems to have a stronger left skew than the non-discounted product sales. You can also look at the mean and standard deviations to see that the means are substantially different.
Although the distributions do not have the same variances, they are quite similar. I think that it could be useful to run the 2-sided 2-sample t-test on these two distributions.
scs.ttest_ind(bev_disc_mean, bev_no_disc_mean)
Ttest_indResult(statistic=11.800139489541447, pvalue=4.901027148387775e-29)
We see an extremely low p-value, which tells us that we should reject the null hypothesis that there is no statistically significant difference between the revenues from products with no discounts and revenues from discounted products.
Next, let's take a look at Cohen's d:
df_bev_disc_05to15 = df_bev_disc.loc[(df["Discount"] >= 0.01) & (df["Discount"] < 0.15)]
df_bev_disc_05to15
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26 | 10905 | WELLI | 2014-02-24 | 1 | 18.00 | 20 | 0.05 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 342.000 | 0.027019 | 0.000270 |
33 | 11025 | WARTH | 2014-04-15 | 1 | 18.00 | 10 | 0.10 | Chai | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 162.000 | 0.012798 | 0.000128 |
50 | 10485 | LINOD | 2013-03-25 | 2 | 15.20 | 20 | 0.10 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 273.600 | 0.021615 | 0.000216 |
54 | 10632 | WANDK | 2013-08-14 | 2 | 19.00 | 30 | 0.05 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 541.500 | 0.042780 | 0.000428 |
61 | 10787 | LAMAI | 2013-12-19 | 2 | 19.00 | 15 | 0.05 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 270.750 | 0.021390 | 0.000214 |
66 | 10851 | RICAR | 2014-01-26 | 2 | 19.00 | 5 | 0.05 | Chang | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 90.250 | 0.007130 | 0.000071 |
571 | 10275 | MAGAA | 2012-08-07 | 24 | 3.60 | 12 | 0.05 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 41.040 | 0.003242 | 0.000032 |
577 | 10358 | LAMAI | 2012-11-20 | 24 | 3.60 | 10 | 0.05 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 34.200 | 0.002702 | 0.000027 |
580 | 10446 | TOMSP | 2013-02-14 | 24 | 3.60 | 20 | 0.10 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 64.800 | 0.005119 | 0.000051 |
583 | 10472 | SEVES | 2013-03-12 | 24 | 3.60 | 80 | 0.05 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 273.600 | 0.021615 | 0.000216 |
588 | 10541 | HANAR | 2013-05-19 | 24 | 4.50 | 35 | 0.10 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 141.750 | 0.011199 | 0.000112 |
590 | 10565 | MEREP | 2013-06-11 | 24 | 4.50 | 25 | 0.10 | Guaraná Fantástica | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 101.250 | 0.007999 | 0.000080 |
876 | 10358 | LAMAI | 2012-11-20 | 34 | 11.20 | 10 | 0.05 | Sasquatch Ale | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 106.400 | 0.008406 | 0.000084 |
880 | 10483 | WHITC | 2013-03-24 | 34 | 11.20 | 35 | 0.05 | Sasquatch Ale | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 372.400 | 0.029420 | 0.000294 |
890 | 11008 | ERNSH | 2014-04-08 | 34 | 14.00 | 90 | 0.05 | Sasquatch Ale | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1197.000 | 0.094565 | 0.000946 |
898 | 10390 | ERNSH | 2012-12-23 | 35 | 14.40 | 40 | 0.10 | Steeleye Stout | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 518.400 | 0.040955 | 0.000410 |
911 | 10623 | FRANK | 2013-08-07 | 35 | 18.00 | 30 | 0.10 | Steeleye Stout | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 486.000 | 0.038395 | 0.000384 |
919 | 10845 | QUICK | 2014-01-21 | 35 | 18.00 | 25 | 0.10 | Steeleye Stout | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 405.000 | 0.031996 | 0.000320 |
928 | 11046 | WANDK | 2014-04-23 | 35 | 18.00 | 18 | 0.05 | Steeleye Stout | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 307.800 | 0.024317 | 0.000243 |
966 | 10329 | SPLIR | 2012-10-15 | 38 | 210.80 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
967 | 10351 | ERNSH | 2012-11-11 | 38 | 210.80 | 20 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 4005.200 | 0.316418 | 0.003164 |
976 | 10541 | HANAR | 2013-05-19 | 38 | 263.50 | 4 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 948.600 | 0.074941 | 0.000749 |
977 | 10616 | GREAL | 2013-07-31 | 38 | 263.50 | 15 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3754.875 | 0.296642 | 0.002966 |
978 | 10672 | BERGS | 2013-09-17 | 38 | 263.50 | 15 | 0.10 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 3557.250 | 0.281029 | 0.002810 |
981 | 10816 | GREAL | 2014-01-06 | 38 | 263.50 | 30 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 7509.750 | 0.593284 | 0.005933 |
985 | 10865 | QUICK | 2014-02-02 | 38 | 263.50 | 60 | 0.05 | Côte de Blaye | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 15019.500 | 1.186568 | 0.011866 |
993 | 10305 | OLDWO | 2012-09-13 | 39 | 14.40 | 30 | 0.10 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 388.800 | 0.030716 | 0.000307 |
996 | 10361 | QUICK | 2012-11-22 | 39 | 14.40 | 54 | 0.10 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 699.840 | 0.055289 | 0.000553 |
1006 | 10654 | BERGS | 2013-09-02 | 39 | 18.00 | 20 | 0.10 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 324.000 | 0.025597 | 0.000256 |
1009 | 10764 | ERNSH | 2013-12-03 | 39 | 18.00 | 130 | 0.10 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 2106.000 | 0.166378 | 0.001664 |
1014 | 10865 | QUICK | 2014-02-02 | 39 | 18.00 | 80 | 0.05 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1368.000 | 0.108075 | 0.001081 |
1019 | 11077 | RATTC | 2014-05-06 | 39 | 18.00 | 2 | 0.05 | Chartreuse verte | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 34.200 | 0.002702 | 0.000027 |
1143 | 10340 | BONAP | 2012-10-29 | 43 | 36.80 | 40 | 0.05 | Ipoh Coffee | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 1398.400 | 0.110476 | 0.001105 |
1893 | 10420 | WELLI | 2013-01-21 | 70 | 12.00 | 8 | 0.10 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 86.400 | 0.006826 | 0.000068 |
1894 | 10453 | AROUT | 2013-02-21 | 70 | 12.00 | 25 | 0.10 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 270.000 | 0.021331 | 0.000213 |
1896 | 10485 | LINOD | 2013-03-25 | 70 | 12.00 | 60 | 0.10 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 648.000 | 0.051193 | 0.000512 |
1897 | 10506 | KOENE | 2013-04-15 | 70 | 15.00 | 14 | 0.10 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 189.000 | 0.014931 | 0.000149 |
1900 | 10616 | GREAL | 2013-07-31 | 70 | 15.00 | 15 | 0.05 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 213.750 | 0.016887 | 0.000169 |
1903 | 10659 | QUEE | 2013-09-05 | 70 | 15.00 | 40 | 0.05 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 570.000 | 0.045031 | 0.000450 |
1906 | 10698 | ERNSH | 2013-10-09 | 70 | 15.00 | 8 | 0.05 | Outback Lager | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 114.000 | 0.009006 | 0.000090 |
2039 | 10436 | BLONP | 2013-02-05 | 75 | 6.20 | 24 | 0.10 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 133.920 | 0.010580 | 0.000106 |
2043 | 10510 | SAVEA | 2013-04-18 | 75 | 7.75 | 36 | 0.10 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 251.100 | 0.019837 | 0.000198 |
2047 | 10572 | BERGS | 2013-06-18 | 75 | 7.75 | 15 | 0.10 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 104.625 | 0.008266 | 0.000083 |
2053 | 10631 | LAMAI | 2013-08-14 | 75 | 7.75 | 8 | 0.10 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 55.800 | 0.004408 | 0.000044 |
2063 | 10788 | QUICK | 2013-12-22 | 75 | 7.75 | 40 | 0.05 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 294.500 | 0.023266 | 0.000233 |
2065 | 10894 | SAVEA | 2014-02-18 | 75 | 7.75 | 120 | 0.05 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 883.500 | 0.069798 | 0.000698 |
2069 | 10932 | BONAP | 2014-03-06 | 75 | 7.75 | 20 | 0.10 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 139.500 | 0.011021 | 0.000110 |
2070 | 10951 | RICSU | 2014-03-16 | 75 | 7.75 | 50 | 0.05 | Rhönbräu Klosterbier | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 368.125 | 0.029083 | 0.000291 |
2079 | 10273 | QUICK | 2012-08-05 | 76 | 14.40 | 33 | 0.05 | Lakkalikööri | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 451.440 | 0.035665 | 0.000357 |
2095 | 10604 | FURIB | 2013-07-18 | 76 | 18.00 | 10 | 0.10 | Lakkalikööri | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 162.000 | 0.012798 | 0.000128 |
2115 | 11050 | FOLKO | 2014-04-27 | 76 | 18.00 | 50 | 0.10 | Lakkalikööri | 1 | Beverages | Soft drinks, coffees, teas, beers, and ales | 810.000 | 0.063992 | 0.000640 |
# rev_disc_01to05 = df.loc[(df["Discount"] >= .01) & (df["Discount"] < .05)]
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_cond_no_disc = df_cond.loc[df["Discount"] == 0]
df_cond_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
82 | 10289 | BSBEV | 2012-08-26 | 3 | 8.0 | 30 | 0.0 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 240.0 | 0.018960 | 0.000190 |
83 | 10405 | LINOD | 2013-01-06 | 3 | 8.0 | 50 | 0.0 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 400.0 | 0.031601 | 0.000316 |
85 | 10540 | QUICK | 2013-05-19 | 3 | 10.0 | 60 | 0.0 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 600.0 | 0.047401 | 0.000474 |
df_cond_disc = df_cond.loc[df["Discount"] > 0]
df_cond_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
84 | 10485 | LINOD | 2013-03-25 | 3 | 8.0 | 20 | 0.1 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 144.00 | 0.011376 | 0.000114 |
89 | 10764 | ERNSH | 2013-12-03 | 3 | 10.0 | 20 | 0.1 | Aniseed Syrup | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 180.00 | 0.014220 | 0.000142 |
96 | 10336 | PRINI | 2012-10-23 | 4 | 17.6 | 18 | 0.1 | Chef Anton's Cajun Seasoning | 2 | Condiments | Sweet and savory sauces, relishes, spreads, an... | 285.12 | 0.022525 | 0.000225 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_confect_no_disc = df_confect.loc[df["Discount"] == 0]
df_confect_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
336 | 10255 | RICSU | 2012-07-12 | 16 | 13.9 | 35 | 0.0 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 486.5 | 0.038434 | 0.000384 |
339 | 10296 | LILAS | 2012-09-03 | 16 | 13.9 | 30 | 0.0 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 417.0 | 0.032944 | 0.000329 |
340 | 10310 | THEBI | 2012-09-20 | 16 | 13.9 | 10 | 0.0 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 139.0 | 0.010981 | 0.000110 |
df_confect_disc = df_confect.loc[df["Discount"] > 0]
df_confect_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
337 | 10263 | ERNSH | 2012-07-23 | 16 | 13.9 | 60 | 0.25 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 625.500 | 0.049416 | 0.000494 |
338 | 10287 | RICAR | 2012-08-22 | 16 | 13.9 | 40 | 0.15 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 472.600 | 0.037336 | 0.000373 |
341 | 10324 | SAVEA | 2012-10-08 | 16 | 13.9 | 21 | 0.15 | Pavlova | 3 | Confections | Desserts, candies, and sweet breads | 248.115 | 0.019602 | 0.000196 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_dairy_no_disc = df_dairy.loc[df["Discount"] == 0]
df_dairy_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
216 | 10248 | VINET | 2012-07-04 | 11 | 14.0 | 12 | 0.0 | Queso Cabrales | 4 | Dairy Products | Cheeses | 168.0 | 0.013272 | 0.000133 |
217 | 10296 | LILAS | 2012-09-03 | 11 | 16.8 | 12 | 0.0 | Queso Cabrales | 4 | Dairy Products | Cheeses | 201.6 | 0.015927 | 0.000159 |
220 | 10365 | ANTO | 2012-11-27 | 11 | 16.8 | 24 | 0.0 | Queso Cabrales | 4 | Dairy Products | Cheeses | 403.2 | 0.031854 | 0.000319 |
df_dairy_disc = df_dairy.loc[df["Discount"] > 0]
df_dairy_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
218 | 10327 | FOLKO | 2012-10-11 | 11 | 16.8 | 50 | 0.2 | Queso Cabrales | 4 | Dairy Products | Cheeses | 672.00 | 0.053089 | 0.000531 |
219 | 10353 | PICCO | 2012-11-13 | 11 | 16.8 | 12 | 0.2 | Queso Cabrales | 4 | Dairy Products | Cheeses | 161.28 | 0.012741 | 0.000127 |
224 | 10443 | REGGC | 2013-02-12 | 11 | 16.8 | 6 | 0.2 | Queso Cabrales | 4 | Dairy Products | Cheeses | 80.64 | 0.006371 | 0.000064 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_meat_no_disc = df_meat.loc[df["Discount"] == 0]
df_meat_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
181 | 10693 | WHITC | 2013-10-06 | 9 | 97.0 | 6 | 0.0 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 582.0 | 0.045979 | 0.000460 |
182 | 10848 | CONSH | 2014-01-23 | 9 | 97.0 | 3 | 0.0 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 291.0 | 0.022990 | 0.000230 |
379 | 10265 | BLONP | 2012-07-25 | 17 | 31.2 | 30 | 0.0 | Alice Mutton | 6 | Meat/Poultry | Prepared meats | 936.0 | 0.073946 | 0.000739 |
df_meat_disc = df_meat.loc[df["Discount"] > 0]
df_meat_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
178 | 10420 | WELLI | 2013-01-21 | 9 | 77.6 | 20 | 0.10 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 1396.8 | 0.110350 | 0.001103 |
179 | 10515 | QUICK | 2013-04-23 | 9 | 97.0 | 16 | 0.15 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 1319.2 | 0.104219 | 0.001042 |
180 | 10687 | HUNGO | 2013-09-30 | 9 | 97.0 | 50 | 0.25 | Mishi Kobe Niku | 6 | Meat/Poultry | Prepared meats | 3637.5 | 0.287369 | 0.002874 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_grains_no_disc = df_grains.loc[df["Discount"] == 0]
df_grains_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
536 | 10435 | CONSH | 2013-02-04 | 22 | 16.8 | 12 | 0.0 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 201.6 | 0.015927 | 0.000159 |
537 | 10553 | WARTH | 2013-05-30 | 22 | 21.0 | 24 | 0.0 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 504.0 | 0.039817 | 0.000398 |
538 | 10603 | SAVEA | 2013-07-18 | 22 | 21.0 | 48 | 0.0 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 1008.0 | 0.079634 | 0.000796 |
df_grains_disc = df_grains.loc[df["Discount"] > 0]
df_grains_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
535 | 10251 | VICTE | 2012-07-08 | 22 | 16.8 | 6 | 0.05 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 95.76 | 0.007565 | 0.000076 |
542 | 10651 | WANDK | 2013-09-01 | 22 | 21.0 | 20 | 0.25 | Gustaf's Knäckebröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 315.00 | 0.024886 | 0.000249 |
556 | 10543 | LILAS | 2013-05-21 | 23 | 9.0 | 70 | 0.15 | Tunnbröd | 5 | Grains/Cereals | Breads, crackers, pasta, and cereal | 535.50 | 0.042305 | 0.000423 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_produce_no_disc = df_produce.loc[df["Discount"] == 0]
df_produce_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136 | 10262 | RATTC | 2012-07-22 | 7 | 24.0 | 15 | 0.0 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 360.0 | 0.028441 | 0.000284 |
139 | 10471 | BSBEV | 2013-03-11 | 7 | 24.0 | 30 | 0.0 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 720.0 | 0.056881 | 0.000569 |
141 | 10546 | VICTE | 2013-05-23 | 7 | 30.0 | 10 | 0.0 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 300.0 | 0.023701 | 0.000237 |
df_produce_disc = df_produce.loc[df["Discount"] > 0]
df_produce_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
137 | 10385 | SPLIR | 2012-12-17 | 7 | 24.0 | 10 | 0.20 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 192.0 | 0.015168 | 0.000152 |
138 | 10459 | VICTE | 2013-02-27 | 7 | 24.0 | 16 | 0.05 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 364.8 | 0.028820 | 0.000288 |
140 | 10511 | BONAP | 2013-04-18 | 7 | 30.0 | 50 | 0.15 | Uncle Bob's Organic Dried Pears | 7 | Produce | Dried fruit and bean curd | 1275.0 | 0.100727 | 0.001007 |
Create dataframes for Revenues from no discounts and Revenues due to all discounts
df_seafood_no_disc = df_seafood.loc[df["Discount"] == 0]
df_seafood_no_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
184 | 10276 | TORTU | 2012-08-08 | 10 | 24.8 | 15 | 0.0 | Ikura | 8 | Seafood | Seaweed and fish | 372.0 | 0.029389 | 0.000294 |
186 | 10389 | BOTTM | 2012-12-20 | 10 | 24.8 | 16 | 0.0 | Ikura | 8 | Seafood | Seaweed and fish | 396.8 | 0.031348 | 0.000313 |
187 | 10449 | BLONP | 2013-02-18 | 10 | 24.8 | 14 | 0.0 | Ikura | 8 | Seafood | Seaweed and fish | 347.2 | 0.027429 | 0.000274 |
df_seafood_disc = df_seafood.loc[df["Discount"] > 0]
df_seafood_disc.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Id | CustomerId | OrderDate | ProductId | OrderUnitPrice | OrderQty | Discount | ProductName | CategoryId | CategoryName | CatDescription | Revenue | RevPercentTotal | RevFractionTotal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
183 | 10273 | QUICK | 2012-08-05 | 10 | 24.8 | 24 | 0.05 | Ikura | 8 | Seafood | Seaweed and fish | 565.44 | 0.044671 | 0.000447 |
185 | 10357 | LILAS | 2012-11-19 | 10 | 24.8 | 30 | 0.20 | Ikura | 8 | Seafood | Seaweed and fish | 595.20 | 0.047022 | 0.000470 |
188 | 10450 | VICTE | 2013-02-19 | 10 | 24.8 | 20 | 0.20 | Ikura | 8 | Seafood | Seaweed and fish | 396.80 | 0.031348 | 0.000313 |
While we could look at the simple question of whether the average revenue per order is larger for discounted products than for non-discounted products in each product category, I believe that it would be more informative to look at how the level of discount correlates to average revenues per order. We already know from question 1 that discounts drive revenues overall vs. no discounts, and we know from question 2 that discounts in the 5-9.9% range maximize revenues summed up across all categories. I'd like to find out if the result we see across all categories holds true within each category.
Originally, I thought about dividing up each category into discount levels within categories of 5%, 10%, 15%, 20%, and more than 20%. However, in many cases, the sample sizes are small at each of these levels (often with an n < 30). However, we can get larger samples by consolidating discounts levels into two categories: 1-14.9%, and 15%+.
Depending on the findings (and time permitting) it would be really interesting to compare product categories to look for interesting trends or differences in terms of discounts applied by product.
-
Null hypothesis: No significant difference between revenues from non-discounted products in a category versus revenues from discounts of 1% - 14.9% and discounts of 15% and up in a category
- Ho:
$\mu_1 = \mu_2 = \mu_3 ... = \mu_i$ )
- Ho:
-
Alternative hypothesis: Differences among revenues from non-discounted products in a category versus revenues from discounts of 1% - 14.9% and discounts of 15% and up in a category are unlikely to be due to chance
- Ha:
$\mu_1 \neq \mu_2 \neq \mu_3 \neq ... \neq \mu_i$
- Ha: