Exploring price elasticity

Price elasticity is the responsiveness of sales to changes in price. Initial exploration looking at demand/ recommendation modelling given price and other variables. Extendable to elasticity questions to help drive pricing strategies.

Pat Reen https://www.linkedin.com/in/patrick-reen/
2023-06-20

Background and application

…How might we predict changes in product sales patterns based upon price? Impact on advisor recommendations?…

Price elasticity is the responsiveness of sales to changes in price and price elasticity modelling helps drive pricing strategies.

Factors such as age and occupation are likely to impact price elasticity. Insurer specific factors or external factors influencing advisers might be present but hidden/ not explicit in the data. There might be a lag in the effect of price changes on recommendations.

Price elasticity of sales shares similarities with lapse rate modelling, but there the lag effect/ inertia effect might be more stark.

Initial exploration looking at demand/ recommendations modelling given price and other variables. Extendable to elasticity questions.

Further reading

A few articles of interest:

Libraries

Setting up the environment as well as a list of some of the packages used in the recipe.

# calling python from r
library(reticulate) 

# create environment if does not exist
# conda_create("r-reticulate") 
# py_config() # to check configuration

# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks

# if not already installed, install the below. if env specified, can drop envname parameter

# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# py_install("scipy",envname = "r-reticulate")
# py_install("matplotlib",envname = "r-reticulate")
# py_install("seaborn",envname = "r-reticulate")
# py_install("scikit-learn",envname = "r-reticulate")
# py_install("linearmodels",envname = "r-reticulate")
# py_install("tabulate",envname = "r-reticulate")
# py_install("rpy2",envname = "r-reticulate")
 

# ...etc

Libraries:


# import some standard libraries
import pandas as pd
import numpy as np
from tabulate import tabulate
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from linearmodels.panel import PanelOLS

Data

Hypothetical data - generate

Creating data split into 3 age groups, 3 occupations and 3 products/ insurers. Defining some initial price and assumping the initial recommendations are split roughly equally across products (in spite of initial price differences - implying a hidden preference not captured in the data which will perhaps be a feature of the model by product). A series of price changes are defined as well as their impact on the level of recomendations. Recommendations are assumed to peak around the middle of the year and a random noise term is thrown in.

Function to apply elasticity

Show code
def adjust_recommendations(row):
    # get the product, age and occupation specific elasticities
    product_elasticity = product_elasticities[row['product']]
    age_elasticity = age_elasticities[row['age_group']]
    occupation_elasticity = occupation_elasticities[row['occupation']]
    
    # calculate the combined elasticity
    combined_elasticity = age_elasticity * occupation_elasticity
    
    # calculate the percentage change in price
    price_change = (row['price'] - row['base_price']) / row['base_price']
    
    # adjust the recommendations based on the price change and elasticity
    adjusted_recommendations = row['recommendations'] * (1 - price_change * combined_elasticity)
    
    return adjusted_recommendations

Data generation

Show code
# Define date range
date_range = pd.date_range(start='2019-01-01', end='2024-06-30', freq='M', normalize=True)

# Define age groups, occupations and products
age_groups = ['20-35', '36-45', '46+']
occupations = ['LIGHT BLUE', 'HEAVY BLUE', 'WHITE COLLAR']
products = ['Product1', 'Product2', 'Product3']

# Define baseline price for each product
price = {'Product1': 100.00, 'Product2': 100.00, 'Product3': 100.00}

# Create initial dataframe
data = []
for date in date_range:
    for product in products:
        for age_group in age_groups:
            for occupation in occupations:
                row = {
                    'product': product,
                    'date': date.date(),
                    'age_group': age_group,
                    'occupation': occupation,
                    'base_price': price[product],
                    'price': price[product],
                    'price_changed': 0,
                    'change_direction': 'no change' 
                }
                data.append(row)


df = pd.DataFrame(data)

# Apply price changes
price_changes = [
    # (product, age_group, occupation, date, price multiplier)
    ('Product1', '46+', None, pd.to_datetime('2019-07-01').date(), 1.15),
    ('Product3', None, 'HEAVY BLUE', pd.to_datetime('2019-09-01').date(), 1.05),
    ('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
    ('Product2', '46+', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
    ('Product3', None, 'LIGHT BLUE', pd.to_datetime('2020-03-01').date(), 0.9),
    ('Product2', '46+', 'WHITE COLLAR', pd.to_datetime('2020-12-01').date(), 1.3),
    ('Product2', '20-35', None, pd.to_datetime('2021-04-01').date(), 0.8),
    ('Product1', None, 'HEAVY BLUE', pd.to_datetime('2021-12-01').date(), 1.05),
    ('Product1', '20-35', 'WHITE COLLAR', pd.to_datetime('2022-03-01').date(), 1.2),
    ('Product3', '46+', 'HEAVY BLUE', pd.to_datetime('2022-06-01').date(), 0.85),
    ('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2022-09-01').date(), 1.1),
    ('Product1', '36-45', 'LIGHT BLUE', pd.to_datetime('2023-01-01').date(), 1.05),
    ('Product2', '20-35', 'HEAVY BLUE', pd.to_datetime('2023-04-01').date(), 1.3),
    ('Product3', '46+', 'WHITE COLLAR', pd.to_datetime('2023-07-01').date(), 0.9),
    ('Product1', '20-35', None, pd.to_datetime('2023-10-01').date(), 1.1),
    ('Product2', None, 'HEAVY BLUE', pd.to_datetime('2024-01-01').date(), 0.95),
    ('Product3', '36-45', 'WHITE COLLAR', pd.to_datetime('2024-04-01').date(), 1.15),
]

# Apply price changes
for product, age_group, occupation, date, multiplier in price_changes:
    
    mask = (df['product'] == product) & (df['date'] >= date)
    if age_group is not None:
        mask = mask & (df['age_group'] == age_group)
    if occupation is not None:
        mask = mask & (df['occupation'] == occupation)
    
    df.loc[mask, 'price'] *= multiplier

# Flag price change
for product, age_group, occupation, date, multiplier in price_changes:
    
    mask = (df['product'] == product) & (df['date'].apply(lambda x: x.year)== date.year) & (df['date'].apply(lambda x: x.month) == date.month)
    if age_group is not None:
        mask = mask & (df['age_group'] == age_group)
    if occupation is not None:
        mask = mask & (df['occupation'] == occupation)
        
    df.loc[mask, 'price_changed'] = 1 
    df.loc[mask, 'change_direction'] = 'up' if multiplier > 1 else 'down'
    
# Ensure that each product, age, occ gets roughly equal recommendations initially
df['recommendations'] = 100 
    
# Noise terms
# Monthly
np.random.seed(0)  # for reproducibility
mu, sigma = 10, 5  # mean and standard deviation for the normal distribution
s = np.random.normal(mu, sigma, df.shape[0])  # generate random numbers from the distribution
df['month'] = df['date'].apply(lambda x: x.month)
df['month_norm'] = 1 + norm.pdf(df['month'], mu, sigma)  # generate the monthly pattern
df['recommendations'] = df['recommendations'] *df['month_norm']

# Random 
noise_factor = 1 + (np.random.normal(0, 0.02, df.shape[0])) 
df['recommendations'] *= noise_factor

# Apply price changes and adjust recommendations based on price elasticity
product_elasticities = {'Product1': 1.1, 'Product2': 1.1, 'Product3': 1}
age_elasticities = {'20-35': 1.5, '36-45': 1.2, '46+': 1.1}
occupation_elasticities = {'LIGHT BLUE': 1.2, 'HEAVY BLUE': 1.05, 'WHITE COLLAR': 1.05}

df['recommendations'] = df.apply(adjust_recommendations, axis=1)
# Round the recommendations to nearest integer as we're dealing with counts
df['recommendations'] = df['recommendations'].round().astype(int)

# We know that the data has seasonality, normalising this for later visuals:

# Calculate monthly averages
monthly_averages = df.groupby('month')['recommendations'].mean().reset_index()
monthly_averages.columns = ['month', 'avg_recommendations']

# Merge and norm
df = df.merge(monthly_averages, on='month', how='left')
df['normalized_recommendations'] = df['recommendations'] / df['avg_recommendations']

df.head()
    product        date  ... avg_recommendations normalized_recommendations
0  Product1  2019-01-31  ...           99.345679                   0.986455
1  Product1  2019-01-31  ...           99.345679                   1.016652
2  Product1  2019-01-31  ...           99.345679                   1.006586
3  Product1  2019-01-31  ...           99.345679                   1.026718
4  Product1  2019-01-31  ...           99.345679                   0.976389

[5 rows x 13 columns]

Exploratory plots

Show code
# Convert python df to r for plotting
df <- py$df

# Convert the date 
df$date <- as.Date(sapply(df$date, as.character))

# Plot
p <- ggplot(df, aes(x = date, y = price, color = product)) +
  geom_line() +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Price change history") +
  theme_minimal() +
  ylim(80, 150) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))  # This line makes the x-axis labels diagonal

print(p)

Plots show the seasonality. Arrows reflect the timing and direction of the rate changes. There is a fair amount of noise in the recommendations:

Show code
p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
  geom_line(size=0.5) +
  
  # Plot
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = recommendations, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Recommendations history") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

print(p)

From our input we know we have overlayed a seasonality factor:

Show code
# Calculate average month_norm across groups
df_avg <- df %>%
  group_by(date) %>%
  summarize(avg_month_norm = mean(month_norm))

head(df_avg)
# A tibble: 6 × 2
  date       avg_month_norm
  <date>              <dbl>
1 2019-01-31           1.02
2 2019-02-28           1.02
3 2019-03-31           1.03
4 2019-04-30           1.04
5 2019-05-31           1.05
6 2019-06-30           1.06
Show code
# Plot
p <- ggplot(df_avg, aes(x = date, y = avg_month_norm)) +
  geom_line() +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Monthly seasonality (data param)") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1)) 

print(p)

Taking a 3 month moving average reduces the noise, and some correlation with the rate change visable:

Show code
# Calculate the 3-month moving average
df <- df %>% 
  group_by(age_group, occupation, product) %>%
  arrange(date) %>%
  mutate(moving_avg = zoo::rollmean(recommendations, k = 3, align = "right", fill = NA))

p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
  
  # Adding a 3-month moving average line
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  geom_line(aes(y = moving_avg), linetype = "solid", size = 0.5) +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Recommendations history with 3-Month moving average") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

print(p)

Removing the seasonality by standardising the monthly recommendations by their averages, retaining the 3 month moving average:

Show code
# Add moving average based upon normalised recommendations
df <- df %>% 
  group_by(age_group, occupation, product) %>%
  arrange(date) %>%
  mutate(moving_avg = zoo::rollmean(normalized_recommendations, k = 3, align = "right", fill = NA))

# Plot the normalized data
p <- ggplot(df, aes(x = date, y = moving_avg, color = product)) +
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  geom_line() +
  facet_grid(age_group ~ occupation) +
  labs(title = "Normalized recommendations history") +
  theme_minimal() +
  theme(legend.position = "bottom")

print(p)

From our inputs, the price elasticity factors are:

Show code
# Convert the dictionaries to lists of tuples
product_data = list(product_elasticities.items())
age_data = list(age_elasticities.items())
occupation_data = list(occupation_elasticities.items())

# Print the tables
print("Product Elasticities:")
Product Elasticities:
Show code
print(tabulate(product_data, headers=["Product", "Elasticity"], tablefmt="grid"))
+-----------+--------------+
| Product   |   Elasticity |
+===========+==============+
| Product1  |          1.1 |
+-----------+--------------+
| Product2  |          1.1 |
+-----------+--------------+
| Product3  |          1   |
+-----------+--------------+
Show code
print("\nAge Elasticities:")

Age Elasticities:
Show code
print(tabulate(age_data, headers=["Age Group", "Elasticity"], tablefmt="grid"))
+-------------+--------------+
| Age Group   |   Elasticity |
+=============+==============+
| 20-35       |          1.5 |
+-------------+--------------+
| 36-45       |          1.2 |
+-------------+--------------+
| 46+         |          1.1 |
+-------------+--------------+
Show code
print("\nOccupation Elasticities:")

Occupation Elasticities:
Show code
print(tabulate(occupation_data, headers=["Occupation", "Elasticity"], tablefmt="grid"))
+--------------+--------------+
| Occupation   |   Elasticity |
+==============+==============+
| LIGHT BLUE   |         1.2  |
+--------------+--------------+
| HEAVY BLUE   |         1.05 |
+--------------+--------------+
| WHITE COLLAR |         1.05 |
+--------------+--------------+

Looking at the correlations between price and recommendations, we see a stronger negative correlation for Products 1 and 2 as expected. Looking at base recommendations and normalised for seasonality:

Show code
# Subset the data 
subset_df = df[['product', 'price', 'normalized_recommendations', 'recommendations']]

# Group the data by
correlations = subset_df.groupby('product').apply(lambda x: pd.Series({
    'Corr Price v Norm Recomment': x['price'].corr(x['normalized_recommendations']),
    'Corr Price v Recommend': x['price'].corr(x['recommendations'])
}))

# Round the correlations to 3 decimal points
correlations = correlations.round(3)

# Reset the index and rename the columns
correlation_table = correlations.reset_index().rename(columns={'level_1': 'Correlation Type'})

# Display the correlation table
print(correlation_table)
    product  Corr Price v Norm Recomment  Corr Price v Recommend
0  Product1                       -0.970                  -0.953
1  Product2                       -0.984                  -0.977
2  Product3                       -0.957                  -0.929

Normalising for seasonability, we see a stronger negative correlation for younger ages (<36) but stronger than expected for older ages given the inputs:

Show code
# Subset the data 
subset_df = df[['product', 'age_group', 'price', 'normalized_recommendations']]

# Group the data by 
correlations = subset_df.groupby(['product', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()

# Round the correlations to 3 decimal points
correlations = correlations.round(3)

# Display the correlation table
print(correlations)
age_group  20-35  36-45    46+
product                       
Product1  -0.991 -0.789 -0.945
Product2  -0.986 -0.983 -0.991
Product3  -0.980 -0.969 -0.956

Data gets a bit more sparse when we add in all of product, age and occupation, but we do largely see a stronger negative correlation for light blue as expected:

Show code
# Subset the data for the desired columns
subset_df = df[['product', 'occupation', 'age_group', 'price', 'normalized_recommendations']]

# Create an empty dictionary to store the correlation tables
correlation_tables = {}

# Iterate over each product
for product in subset_df['product'].unique():
    product_df = subset_df[subset_df['product'] == product]
    
    # Group the data by 'occupation' and 'age_group' and calculate the correlations
    correlations = product_df.groupby(['occupation', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
    
    # Round the correlations to 3 decimal points
    correlations = correlations.round(3)
    
    # Store the correlation table in the dictionary
    correlation_tables[product] = correlations

# Display the correlation tables
for product, correlations in correlation_tables.items():
    print(f"Correlation table for {product}")
    print(correlations)
    print()
Correlation table for Product1
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.974 -0.813 -0.968
LIGHT BLUE   -0.952 -0.826 -0.958
WHITE COLLAR -0.997    NaN -0.930

Correlation table for Product2
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.992 -0.631 -0.699
LIGHT BLUE   -0.992 -0.992 -0.973
WHITE COLLAR -0.987    NaN -0.996

Correlation table for Product3
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.864 -0.750 -0.967
LIGHT BLUE   -0.958 -0.932 -0.935
WHITE COLLAR    NaN -0.865 -0.906

Models

Model variations

Initially we are modelling recommendations as explained by price and other explanatory variables. Models considered were a Gradient Boosting Regressor, Generalized Linear Model (GLM), and Fixed Effects (FE) Panel Model:

GLM

To start with, let’s model the normalised recommendations. We see that month is not significant. Params can be directly interpreted as price sensitivities. No differential by age, occ and product.

Show code
# Set date format
df['date'] = pd.to_datetime(df['date'])

# Define predictors 
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)

# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
    for dummy in df_encoded.filter(like=col).columns:
        df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
        
df_encoded = df_encoded.astype(float)

# Define target and predictors 
X = df_encoded[['month'] + [col for col in df_encoded if ':price' in col]]
Y = df['normalized_recommendations']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())

# Train model
glm_results = glm_model.fit()

# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)

# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)

# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")
GLM MSE: 0.0008
Show code
print(f"GLM R-squared: {glm_r2:.4f}")
GLM R-squared: 0.9588
Show code
# Summary
glm_results.summary()
Generalized Linear Model Regression Results
Dep. Variable: normalized_recommendations No. Observations: 1425
Model: GLM Df Residuals: 1416
Model Family: Gaussian Df Model: 8
Link Function: Identity Scale: 0.00087058
Method: IRLS Log-Likelihood: 3008.5
Date: Wed, 26 Jul 2023 Deviance: 1.2234
Time: 13:36:11 Pearson chi2: 1.22
No. Iterations: 3 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 2.4810 0.009 268.747 0.000 2.463 2.499
month 0.0003 0.000 1.168 0.243 -0.000 0.001
product_Product1:price -0.0049 2.8e-05 -173.554 0.000 -0.005 -0.005
product_Product2:price -0.0048 3.2e-05 -149.704 0.000 -0.005 -0.005
product_Product3:price -0.0049 3.53e-05 -138.976 0.000 -0.005 -0.005
age_group_20-35:price -0.0049 3.52e-05 -138.179 0.000 -0.005 -0.005
age_group_36-45:price -0.0049 3.21e-05 -153.434 0.000 -0.005 -0.005
age_group_46+:price -0.0048 2.8e-05 -170.282 0.000 -0.005 -0.005
occupation_HEAVY BLUE:price -0.0049 3.15e-05 -154.921 0.000 -0.005 -0.005
occupation_LIGHT BLUE:price -0.0048 3.33e-05 -145.694 0.000 -0.005 -0.005
occupation_WHITE COLLAR:price -0.0048 3.04e-05 -158.408 0.000 -0.005 -0.005

Modifying this to model recommendations but with month as a polynomial:

Show code
# Set date format
df['date'] = pd.to_datetime(df['date'])

# Define predictors 
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)

# Add polynomial term for the month
df_encoded['month_squared'] = df_encoded['month']**2

# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
    for dummy in df_encoded.filter(like=col).columns:
        df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
        
df_encoded = df_encoded.astype(float)

# Define target and predictors 
X = df_encoded[['month', 'month_squared'] + [col for col in df_encoded if ':price' in col]]
Y = df['recommendations']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())

# Train model
glm_results = glm_model.fit()

# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)

# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)

# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")
GLM MSE: 8.9030
Show code
print(f"GLM R-squared: {glm_r2:.4f}")
GLM R-squared: 0.9574
Show code
# Summary
glm_results.summary()
Generalized Linear Model Regression Results
Dep. Variable: recommendations No. Observations: 1425
Model: GLM Df Residuals: 1415
Model Family: Gaussian Df Model: 9
Link Function: Identity Scale: 9.2575
Method: IRLS Log-Likelihood: -3602.7
Date: Wed, 26 Jul 2023 Deviance: 13101.
Time: 13:36:12 Pearson chi2: 1.31e+04
No. Iterations: 3 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 250.7090 0.983 254.945 0.000 248.782 252.636
month 1.3797 0.100 13.787 0.000 1.184 1.576
month_squared -0.0587 0.008 -7.734 0.000 -0.074 -0.044
product_Product1:price -0.5012 0.003 -173.719 0.000 -0.507 -0.496
product_Product2:price -0.4943 0.003 -149.876 0.000 -0.501 -0.488
product_Product3:price -0.5070 0.004 -139.169 0.000 -0.514 -0.500
age_group_20-35:price -0.5018 0.004 -138.316 0.000 -0.509 -0.495
age_group_36-45:price -0.5091 0.003 -153.678 0.000 -0.516 -0.503
age_group_46+:price -0.4915 0.003 -170.436 0.000 -0.497 -0.486
occupation_HEAVY BLUE:price -0.5039 0.003 -155.091 0.000 -0.510 -0.498
occupation_LIGHT BLUE:price -0.5006 0.003 -145.874 0.000 -0.507 -0.494
occupation_WHITE COLLAR:price -0.4980 0.003 -158.591 0.000 -0.504 -0.492

GB

Show code
# Create Gradient Boosting model
gb_model = GradientBoostingRegressor(random_state=42)

# Train model
gb_model.fit(X_train, y_train)
GradientBoostingRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show code
# Use the model to make predictions
gb_predictions = gb_model.predict(X_test)

# Calculate metrics
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_r2 = r2_score(y_test, gb_predictions)

print(f'Gradient Boosting MSE: {gb_mse:.4f}')
Gradient Boosting MSE: 5.2963
Show code
print(f'Gradient Boosting R^2: {gb_r2:.4f}')
Gradient Boosting R^2: 0.9747
Show code
# Get feature importances
importance = gb_model.feature_importances_

# Create a DataFrame to display the feature importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': importance})

# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

# Display the top features
print(feature_importance_df)
                          Feature  Importance
6           age_group_20-35:price    0.354780
8             age_group_46+:price    0.270391
7           age_group_36-45:price    0.124483
4          product_Product2:price    0.094404
3          product_Product1:price    0.074612
10    occupation_LIGHT BLUE:price    0.025876
1                           month    0.014360
5          product_Product3:price    0.014010
9     occupation_HEAVY BLUE:price    0.011436
11  occupation_WHITE COLLAR:price    0.008229
2                   month_squared    0.007420
0                           const    0.000000

Extracting elasticities:

Show code
def compute_elasticity(model, feature_name):
    # Create a dummy row of zeros
    dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
    
    # If the feature ends with ":price", then it's a continuous feature.
    if ":price" in feature_name:
        dummy_data[feature_name] = 100  # Setting initial price to a standard value
        original_pred = model.predict(dummy_data)
        
        # Simulate a 1% increase
        dummy_data[feature_name] = 110  
        new_pred = model.predict(dummy_data)
        
    else:  # It's a binary/categorical feature
        dummy_data[feature_name] = 1
        original_pred = model.predict(dummy_data)
        
        # Toggle its value
        dummy_data[feature_name] = 0
        new_pred = model.predict(dummy_data)
        
    percent_change_qty = (new_pred - original_pred) / original_pred
    elasticity = percent_change_qty / 0.1 
    
    return round(elasticity[0],4)  # Extract value from the array

feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]

# Create a dictionary to store computed elasticities
elasticities = {}

# For each feature of interest, compute the elasticity
for feature in feature_names:
    elasticities[feature] = compute_elasticity(gb_model, feature)

print(elasticities)
{'const': 0.0, 'product_Product1:price': -0.7521, 'product_Product2:price': -0.3182, 'product_Product3:price': -0.2497, 'age_group_20-35:price': -0.8312, 'age_group_36-45:price': -0.815, 'age_group_46+:price': -1.1606, 'occupation_HEAVY BLUE:price': 0.0, 'occupation_LIGHT BLUE:price': -0.1401, 'occupation_WHITE COLLAR:price': 0.0078}

Looking at a range of price change sensitivities:

Show code
def compute_elasticity_for_changes(model, feature_name, changes):
  
  # This dictionary will store elasticities for various changes
  results = {}
  
  # Create a dummy row of zeros
  dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
  
  for change in changes:
    
    # Avoid issues with 0% change
    if change == 0:
        results['0%'] = np.nan
        continue
          
    # If the feature ends with ":price", then it's a continuous feature.
    if ":price" in feature_name:
      dummy_data[feature_name] = 100  # Setting initial price to a standard value
      original_pred = model.predict(dummy_data)[0]  # Extract the scalar
      
      # Simulate a change
      dummy_data[feature_name] = 100 + 100 * change 
      new_pred = model.predict(dummy_data)[0]  # Extract the scalar
    
    else:  # It's a binary/categorical feature
      dummy_data[feature_name] = 1
      original_pred = model.predict(dummy_data)[0]  # Extract the scalar
      
      # Toggle its value
      dummy_data[feature_name] = 0
      new_pred = model.predict(dummy_data)[0]  # Extract the scalar
    
    percent_change_qty = (new_pred - original_pred) / original_pred
    elasticity = round(percent_change_qty / change, 4)
    
    results[str(int(change * 100)) + '%'] = elasticity
  
  return results

# Define a continuous range of sensitivities from -50% to 50% in 1% increments
changes = [i/100 for i in range(-50, 51)]
feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]

elasticity_table = pd.DataFrame(index=feature_names)

for feature in feature_names:
    elasticities = compute_elasticity_for_changes(gb_model, feature, changes)
    for key, value in elasticities.items():
        elasticity_table.at[feature, key] = value

print(elasticity_table)
                                 -50%    -49%    -48%  ...     48%     49%     50%
const                         -0.0000 -0.0000 -0.0000  ...  0.0000  0.0000  0.0000
product_Product1:price        -0.0256 -0.0000 -0.0000  ... -0.1769 -0.1733 -0.1698
product_Product2:price        -0.0978 -0.0998 -0.1019  ... -0.1778 -0.1741 -0.1707
product_Product3:price        -0.0236 -0.0241 -0.0246  ... -0.0535 -0.0525 -0.0514
age_group_20-35:price         -0.4954 -0.5055 -0.5160  ... -0.2911 -0.2851 -0.2794
age_group_36-45:price         -0.1829 -0.1867 -0.1906  ... -0.2840 -0.2782 -0.2726
age_group_46+:price           -0.1886 -0.1925 -0.1965  ... -0.4674 -0.4579 -0.4487
occupation_HEAVY BLUE:price   -0.0214 -0.0219 -0.0223  ... -0.0183 -0.0179 -0.0175
occupation_LIGHT BLUE:price   -0.0977 -0.0997 -0.1018  ... -0.0636 -0.0623 -0.0611
occupation_WHITE COLLAR:price -0.0258 -0.0263 -0.0269  ... -0.0182 -0.0178 -0.0175

[10 rows x 99 columns]

Plotting the elasticities:

Show code
# Adjust the names in the dictionaries to match the format in the dataframe
product_elasticities <- list('product_Product1:price' = 1.1, 'product_Product2:price' = 1.1, 'product_Product3:price' = 1)
age_elasticities <- list('age_group_20-35:price' = 1.5, 'age_group_36-45:price' = 1.2, 'age_group_46+:price' = 1.1)
occupation_elasticities <- list('occupation_LIGHT BLUE:price' = 1.2, 'occupation_HEAVY BLUE:price' = 1.05, 'occupation_WHITE COLLAR:price' = 1.05)

# Combine all the dictionaries into a data frame
original_elasticities <- data.frame(
  feature = c(names(product_elasticities), names(age_elasticities), names(occupation_elasticities)),
  elasticity = c(unlist(product_elasticities), unlist(age_elasticities), unlist(occupation_elasticities))
)

# Adjust the elasticity values
original_elasticities$elasticity <- original_elasticities$elasticity * (-1)

# Convert python df to r for plotting
elasticity_table <- py$elasticity_table

# Convert data to long format
long_data <- elasticity_table %>%
  rownames_to_column(var = "feature") %>%
  gather(key = "change", value = "elasticity", -feature)

# Extract numerical values from the "change" column for sizing
long_data$change_num <- as.numeric(gsub("%", "", long_data$change))/100

# Plot the data
ggplot(long_data, aes(x = elasticity, y = feature)) +
  geom_point(aes(size = change_num), color = "steelblue") +  # single color and size is based on change_num
  geom_point(data = original_elasticities, aes(x = elasticity, y = feature), color = "red", size = 4) +  # Data points for original elasticities
  geom_vline(aes(xintercept = 0), linetype = "dashed", color = "grey50") +  # line at x=0 for reference
  labs(title = "Elasticities by Feature for Different Percentage Changes",
       x = "Elasticity",
       y = "Feature") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(size = guide_legend(title = "Price Change"))

FE

Show code
# Convert to panel data
panel_data = df.set_index(['product', 'date'])

# Define exogenous variables
exog = panel_data[['age_group', 'occupation', 'price']]
exog = sm.add_constant(exog, has_constant='add')  # Add constant by product

# Define endogenous variable
endog = panel_data['normalized_recommendations']

# Create a model with fixed effects
fe_model = PanelOLS(endog, exog, entity_effects=True, drop_absorbed=True)

# Fit the model
fe_results = fe_model.fit()

# Use the model to make predictions
fe_predictions = fe_results.predict(exog)

# Calculate evaluation metrics
fe_mse = mean_squared_error(endog, fe_predictions)
fe_r2 = r2_score(endog, fe_predictions)

# Print evaluation metrics
print(f"FE Model MSE: {fe_mse:.4f}")
FE Model MSE: 0.0009
Show code
print(f"FE Model R-squared: {fe_r2:.4f}")
FE Model R-squared: 0.9594
Show code
# Print model summary
print(fe_results)
                              PanelOLS Estimation Summary                               
========================================================================================
Dep. Variable:     normalized_recommendations   R-squared:                        0.9547
Estimator:                           PanelOLS   R-squared (Between):              0.9917
No. Observations:                        1782   R-squared (Within):               0.9547
Date:                        Wed, Jul 26 2023   R-squared (Overall):              0.9594
Time:                                13:36:21   Log-likelihood                    3765.5
Cov. Estimator:                    Unadjusted                                           
                                                F-statistic:                      7470.5
Entities:                                   3   P-value                           0.0000
Avg Obs:                               594.00   Distribution:                  F(5,1774)
Min Obs:                               594.00                                           
Max Obs:                               594.00   F-statistic (robust):             7470.5
                                                P-value                           0.0000
Time periods:                              66   Distribution:                  F(5,1774)
Avg Obs:                               27.000                                           
Min Obs:                               27.000                                           
Max Obs:                               27.000                                           
                                                                                        
                                    Parameter Estimates                                    
===========================================================================================
                         Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------------------
const                       2.4563     0.0081     304.18     0.0000      2.4405      2.4722
age_group.36-45            -0.0077     0.0017    -4.4776     0.0000     -0.0111     -0.0043
age_group.46+               0.0062     0.0018     3.3776     0.0007      0.0026      0.0098
occupation.LIGHT BLUE       0.0053     0.0017     3.1193     0.0018      0.0020      0.0087
occupation.WHITE COLLAR     0.0043     0.0017     2.5084     0.0122      0.0009      0.0076
price                      -0.0143  8.051e-05    -177.81     0.0000     -0.0145     -0.0142
===========================================================================================

F-test for Poolability: 23.944
P-value: 0.0000
Distribution: F(2,1774)

Included effects: Entity

Next steps