Background and application

…How might we predict changes in product sales patterns based upon price? Impact on advisor recommendations?…

Price elasticity is the responsiveness of sales to changes in price and price elasticity modelling helps drive pricing strategies.

Factors such as age and occupation are likely to impact price elasticity. Insurer specific factors or external factors influencing advisers might be present but hidden/ not explicit in the data. There might be a lag in the effect of price changes on recommendations.

Price elasticity of sales shares similarities with lapse rate modelling, but there the lag effect/ inertia effect might be more stark.

Initial exploration looking at demand/ recommendations modelling given price and other variables. Extendable to elasticity questions.

Libraries

Setting up the environment as well as a list of some of the packages used in the recipe.

# calling python from r
library(reticulate) 

# create environment if does not exist
# conda_create("r-reticulate") 
# py_config() # to check configuration

# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks

# if not already installed, install the below. if env specified, can drop envname parameter

# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# py_install("scipy",envname = "r-reticulate")
# py_install("matplotlib",envname = "r-reticulate")
# py_install("seaborn",envname = "r-reticulate")
# py_install("scikit-learn",envname = "r-reticulate")
# py_install("linearmodels",envname = "r-reticulate")
# py_install("tabulate",envname = "r-reticulate")
# py_install("rpy2",envname = "r-reticulate")
 

# ...etc

Libraries:

# r libraries
library(dplyr)
library(ggplot2)
library(tidyverse)
library(scales)
library(lubridate)


# import some standard libraries
import pandas as pd
import numpy as np
from tabulate import tabulate
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from linearmodels.panel import PanelOLS

Data

Hypothetical data - generate

Creating data split into 3 age groups, 3 occupations and 3 products/ insurers. Defining some initial price and assumping the initial recommendations are split roughly equally across products (in spite of initial price differences - implying a hidden preference not captured in the data which will perhaps be a feature of the model by product). A series of price changes are defined as well as their impact on the level of recomendations. Recommendations are assumed to peak around the middle of the year and a random noise term is thrown in.

Function to apply elasticity

Show code

def adjust_recommendations(row):
    # get the product, age and occupation specific elasticities
    product_elasticity = product_elasticities[row['product']]
    age_elasticity = age_elasticities[row['age_group']]
    occupation_elasticity = occupation_elasticities[row['occupation']]
    
    # calculate the combined elasticity
    combined_elasticity = age_elasticity * occupation_elasticity
    
    # calculate the percentage change in price
    price_change = (row['price'] - row['base_price']) / row['base_price']
    
    # adjust the recommendations based on the price change and elasticity
    adjusted_recommendations = row['recommendations'] * (1 - price_change * combined_elasticity)
    
    return adjusted_recommendations

Data generation

Show code

# Define date range
date_range = pd.date_range(start='2019-01-01', end='2024-06-30', freq='M', normalize=True)

# Define age groups, occupations and products
age_groups = ['20-35', '36-45', '46+']
occupations = ['LIGHT BLUE', 'HEAVY BLUE', 'WHITE COLLAR']
products = ['Product1', 'Product2', 'Product3']

# Define baseline price for each product
price = {'Product1': 100.00, 'Product2': 100.00, 'Product3': 100.00}

# Create initial dataframe
data = []
for date in date_range:
    for product in products:
        for age_group in age_groups:
            for occupation in occupations:
                row = {
                    'product': product,
                    'date': date.date(),
                    'age_group': age_group,
                    'occupation': occupation,
                    'base_price': price[product],
                    'price': price[product],
                    'price_changed': 0,
                    'change_direction': 'no change' 
                }
                data.append(row)


df = pd.DataFrame(data)

# Apply price changes
price_changes = [
    # (product, age_group, occupation, date, price multiplier)
    ('Product1', '46+', None, pd.to_datetime('2019-07-01').date(), 1.15),
    ('Product3', None, 'HEAVY BLUE', pd.to_datetime('2019-09-01').date(), 1.05),
    ('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
    ('Product2', '46+', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
    ('Product3', None, 'LIGHT BLUE', pd.to_datetime('2020-03-01').date(), 0.9),
    ('Product2', '46+', 'WHITE COLLAR', pd.to_datetime('2020-12-01').date(), 1.3),
    ('Product2', '20-35', None, pd.to_datetime('2021-04-01').date(), 0.8),
    ('Product1', None, 'HEAVY BLUE', pd.to_datetime('2021-12-01').date(), 1.05),
    ('Product1', '20-35', 'WHITE COLLAR', pd.to_datetime('2022-03-01').date(), 1.2),
    ('Product3', '46+', 'HEAVY BLUE', pd.to_datetime('2022-06-01').date(), 0.85),
    ('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2022-09-01').date(), 1.1),
    ('Product1', '36-45', 'LIGHT BLUE', pd.to_datetime('2023-01-01').date(), 1.05),
    ('Product2', '20-35', 'HEAVY BLUE', pd.to_datetime('2023-04-01').date(), 1.3),
    ('Product3', '46+', 'WHITE COLLAR', pd.to_datetime('2023-07-01').date(), 0.9),
    ('Product1', '20-35', None, pd.to_datetime('2023-10-01').date(), 1.1),
    ('Product2', None, 'HEAVY BLUE', pd.to_datetime('2024-01-01').date(), 0.95),
    ('Product3', '36-45', 'WHITE COLLAR', pd.to_datetime('2024-04-01').date(), 1.15),
]

# Apply price changes
for product, age_group, occupation, date, multiplier in price_changes:
    
    mask = (df['product'] == product) & (df['date'] >= date)
    if age_group is not None:
        mask = mask & (df['age_group'] == age_group)
    if occupation is not None:
        mask = mask & (df['occupation'] == occupation)
    
    df.loc[mask, 'price'] *= multiplier

# Flag price change
for product, age_group, occupation, date, multiplier in price_changes:
    
    mask = (df['product'] == product) & (df['date'].apply(lambda x: x.year)== date.year) & (df['date'].apply(lambda x: x.month) == date.month)
    if age_group is not None:
        mask = mask & (df['age_group'] == age_group)
    if occupation is not None:
        mask = mask & (df['occupation'] == occupation)
        
    df.loc[mask, 'price_changed'] = 1 
    df.loc[mask, 'change_direction'] = 'up' if multiplier > 1 else 'down'
    
# Ensure that each product, age, occ gets roughly equal recommendations initially
df['recommendations'] = 100 
    
# Noise terms
# Monthly
np.random.seed(0)  # for reproducibility
mu, sigma = 10, 5  # mean and standard deviation for the normal distribution
s = np.random.normal(mu, sigma, df.shape[0])  # generate random numbers from the distribution
df['month'] = df['date'].apply(lambda x: x.month)
df['month_norm'] = 1 + norm.pdf(df['month'], mu, sigma)  # generate the monthly pattern
df['recommendations'] = df['recommendations'] *df['month_norm']

# Random 
noise_factor = 1 + (np.random.normal(0, 0.02, df.shape[0])) 
df['recommendations'] *= noise_factor

# Apply price changes and adjust recommendations based on price elasticity
product_elasticities = {'Product1': 1.1, 'Product2': 1.1, 'Product3': 1}
age_elasticities = {'20-35': 1.5, '36-45': 1.2, '46+': 1.1}
occupation_elasticities = {'LIGHT BLUE': 1.2, 'HEAVY BLUE': 1.05, 'WHITE COLLAR': 1.05}

df['recommendations'] = df.apply(adjust_recommendations, axis=1)
# Round the recommendations to nearest integer as we're dealing with counts
df['recommendations'] = df['recommendations'].round().astype(int)

# We know that the data has seasonality, normalising this for later visuals:

# Calculate monthly averages
monthly_averages = df.groupby('month')['recommendations'].mean().reset_index()
monthly_averages.columns = ['month', 'avg_recommendations']

# Merge and norm
df = df.merge(monthly_averages, on='month', how='left')
df['normalized_recommendations'] = df['recommendations'] / df['avg_recommendations']

df.head()

    product        date  ... avg_recommendations normalized_recommendations
0  Product1  2019-01-31  ...           99.345679                   0.986455
1  Product1  2019-01-31  ...           99.345679                   1.016652
2  Product1  2019-01-31  ...           99.345679                   1.006586
3  Product1  2019-01-31  ...           99.345679                   1.026718
4  Product1  2019-01-31  ...           99.345679                   0.976389

[5 rows x 13 columns]

Exploratory plots

Show code

# Convert python df to r for plotting
df <- py$df

# Convert the date 
df$date <- as.Date(sapply(df$date, as.character))

# Plot
p <- ggplot(df, aes(x = date, y = price, color = product)) +
  geom_line() +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Price change history") +
  theme_minimal() +
  ylim(80, 150) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))  # This line makes the x-axis labels diagonal

print(p)

Plots show the seasonality. Arrows reflect the timing and direction of the rate changes. There is a fair amount of noise in the recommendations:

Show code

p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
  geom_line(size=0.5) +
  
  # Plot
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = recommendations, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Recommendations history") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

print(p)

From our input we know we have overlayed a seasonality factor:

Show code

# Calculate average month_norm across groups
df_avg <- df %>%
  group_by(date) %>%
  summarize(avg_month_norm = mean(month_norm))

head(df_avg)

# A tibble: 6 × 2
  date       avg_month_norm
  <date>              <dbl>
1 2019-01-31           1.02
2 2019-02-28           1.02
3 2019-03-31           1.03
4 2019-04-30           1.04
5 2019-05-31           1.05
6 2019-06-30           1.06

Show code

# Plot
p <- ggplot(df_avg, aes(x = date, y = avg_month_norm)) +
  geom_line() +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Monthly seasonality (data param)") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1)) 

print(p)

Taking a 3 month moving average reduces the noise, and some correlation with the rate change visable:

Show code

# Calculate the 3-month moving average
df <- df %>% 
  group_by(age_group, occupation, product) %>%
  arrange(date) %>%
  mutate(moving_avg = zoo::rollmean(recommendations, k = 3, align = "right", fill = NA))

p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
  
  # Adding a 3-month moving average line
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  geom_line(aes(y = moving_avg), linetype = "solid", size = 0.5) +
  facet_grid(age_group ~ occupation) +
  scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
  labs(title = "Recommendations history with 3-Month moving average") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1))

print(p)

Removing the seasonality by standardising the monthly recommendations by their averages, retaining the 3 month moving average:

Show code

# Add moving average based upon normalised recommendations
df <- df %>% 
  group_by(age_group, occupation, product) %>%
  arrange(date) %>%
  mutate(moving_avg = zoo::rollmean(normalized_recommendations, k = 3, align = "right", fill = NA))

# Plot the normalized data
p <- ggplot(df, aes(x = date, y = moving_avg, color = product)) +
  geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
  scale_shape_manual(values = c(up = 24, down = 25)) +
  geom_line() +
  facet_grid(age_group ~ occupation) +
  labs(title = "Normalized recommendations history") +
  theme_minimal() +
  theme(legend.position = "bottom")

print(p)

From our inputs, the price elasticity factors are:

Show code

# Convert the dictionaries to lists of tuples
product_data = list(product_elasticities.items())
age_data = list(age_elasticities.items())
occupation_data = list(occupation_elasticities.items())

# Print the tables
print("Product Elasticities:")

Product Elasticities:

Show code

print(tabulate(product_data, headers=["Product", "Elasticity"], tablefmt="grid"))

+-----------+--------------+
| Product   |   Elasticity |
+===========+==============+
| Product1  |          1.1 |
+-----------+--------------+
| Product2  |          1.1 |
+-----------+--------------+
| Product3  |          1   |
+-----------+--------------+

Show code

print("\nAge Elasticities:")


Age Elasticities:

Show code

print(tabulate(age_data, headers=["Age Group", "Elasticity"], tablefmt="grid"))

+-------------+--------------+
| Age Group   |   Elasticity |
+=============+==============+
| 20-35       |          1.5 |
+-------------+--------------+
| 36-45       |          1.2 |
+-------------+--------------+
| 46+         |          1.1 |
+-------------+--------------+

Show code

print("\nOccupation Elasticities:")


Occupation Elasticities:

Show code

print(tabulate(occupation_data, headers=["Occupation", "Elasticity"], tablefmt="grid"))

+--------------+--------------+
| Occupation   |   Elasticity |
+==============+==============+
| LIGHT BLUE   |         1.2  |
+--------------+--------------+
| HEAVY BLUE   |         1.05 |
+--------------+--------------+
| WHITE COLLAR |         1.05 |
+--------------+--------------+

Looking at the correlations between price and recommendations, we see a stronger negative correlation for Products 1 and 2 as expected. Looking at base recommendations and normalised for seasonality:

Show code

# Subset the data 
subset_df = df[['product', 'price', 'normalized_recommendations', 'recommendations']]

# Group the data by
correlations = subset_df.groupby('product').apply(lambda x: pd.Series({
    'Corr Price v Norm Recomment': x['price'].corr(x['normalized_recommendations']),
    'Corr Price v Recommend': x['price'].corr(x['recommendations'])
}))

# Round the correlations to 3 decimal points
correlations = correlations.round(3)

# Reset the index and rename the columns
correlation_table = correlations.reset_index().rename(columns={'level_1': 'Correlation Type'})

# Display the correlation table
print(correlation_table)

    product  Corr Price v Norm Recomment  Corr Price v Recommend
0  Product1                       -0.970                  -0.953
1  Product2                       -0.984                  -0.977
2  Product3                       -0.957                  -0.929

Normalising for seasonability, we see a stronger negative correlation for younger ages (<36) but stronger than expected for older ages given the inputs:

Show code

# Subset the data 
subset_df = df[['product', 'age_group', 'price', 'normalized_recommendations']]

# Group the data by 
correlations = subset_df.groupby(['product', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()

# Round the correlations to 3 decimal points
correlations = correlations.round(3)

# Display the correlation table
print(correlations)

age_group  20-35  36-45    46+
product                       
Product1  -0.991 -0.789 -0.945
Product2  -0.986 -0.983 -0.991
Product3  -0.980 -0.969 -0.956

Data gets a bit more sparse when we add in all of product, age and occupation, but we do largely see a stronger negative correlation for light blue as expected:

Show code

# Subset the data for the desired columns
subset_df = df[['product', 'occupation', 'age_group', 'price', 'normalized_recommendations']]

# Create an empty dictionary to store the correlation tables
correlation_tables = {}

# Iterate over each product
for product in subset_df['product'].unique():
    product_df = subset_df[subset_df['product'] == product]
    
    # Group the data by 'occupation' and 'age_group' and calculate the correlations
    correlations = product_df.groupby(['occupation', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
    
    # Round the correlations to 3 decimal points
    correlations = correlations.round(3)
    
    # Store the correlation table in the dictionary
    correlation_tables[product] = correlations

# Display the correlation tables
for product, correlations in correlation_tables.items():
    print(f"Correlation table for {product}")
    print(correlations)
    print()

Correlation table for Product1
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.974 -0.813 -0.968
LIGHT BLUE   -0.952 -0.826 -0.958
WHITE COLLAR -0.997    NaN -0.930

Correlation table for Product2
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.992 -0.631 -0.699
LIGHT BLUE   -0.992 -0.992 -0.973
WHITE COLLAR -0.987    NaN -0.996

Correlation table for Product3
age_group     20-35  36-45    46+
occupation                       
HEAVY BLUE   -0.864 -0.750 -0.967
LIGHT BLUE   -0.958 -0.932 -0.935
WHITE COLLAR    NaN -0.865 -0.906

Models

Model variations

Initially we are modelling recommendations as explained by price and other explanatory variables. Models considered were a Gradient Boosting Regressor, Generalized Linear Model (GLM), and Fixed Effects (FE) Panel Model:

Generalized Linear Model (GLM): This model is a statistical regression model that extends linear regression to handle different types of response variables and apply different types of probability distributions. It models the relationship between the response variable (recommendations) and the predictors (exogenous variables) by specifying a suitable probability distribution and a link function. In this case, it is used to model the recommendations using the Negative Binomial distribution.
Gradient Boosting Regressor: This model is a machine learning algorithm that uses an ensemble of weak prediction models (decision trees) to make predictions. It iteratively builds and combines these weak models to create a stronger predictive model. In this case, it is used to predict the recommendations based on the exogenous variables.
Fixed Effects (FE) Panel Model: This model is used to analyze panel data, where observations are collected over time for multiple entities (products in this case). The FE model accounts for unobserved entity-specific effects (fixed effects) by including dummy variables for each entity. It estimates the relationship between the endogenous variable (recommendations) and the exogenous variables (age_group, occupation, and price), while controlling for the entity-specific effects.

GLM

To start with, let’s model the normalised recommendations. We see that month is not significant. Params can be directly interpreted as price sensitivities. No differential by age, occ and product.

Show code

# Set date format
df['date'] = pd.to_datetime(df['date'])

# Define predictors 
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)

# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
    for dummy in df_encoded.filter(like=col).columns:
        df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
        
df_encoded = df_encoded.astype(float)

# Define target and predictors 
X = df_encoded[['month'] + [col for col in df_encoded if ':price' in col]]
Y = df['normalized_recommendations']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())

# Train model
glm_results = glm_model.fit()

# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)

# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)

# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")

GLM MSE: 0.0008

Show code

print(f"GLM R-squared: {glm_r2:.4f}")

GLM R-squared: 0.9588

Show code

# Summary
glm_results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	normalized_recommendations	No. Observations:	1425
Model:	GLM	Df Residuals:	1416
Model Family:	Gaussian	Df Model:	8
Link Function:	Identity	Scale:	0.00087058
Method:	IRLS	Log-Likelihood:	3008.5
Date:	Wed, 26 Jul 2023	Deviance:	1.2234
Time:	13:36:11	Pearson chi2:	1.22
No. Iterations:	3	Pseudo R-squ. (CS):	1.000
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	2.4810	0.009	268.747	0.000	2.463	2.499
month	0.0003	0.000	1.168	0.243	-0.000	0.001
product_Product1:price	-0.0049	2.8e-05	-173.554	0.000	-0.005	-0.005
product_Product2:price	-0.0048	3.2e-05	-149.704	0.000	-0.005	-0.005
product_Product3:price	-0.0049	3.53e-05	-138.976	0.000	-0.005	-0.005
age_group_20-35:price	-0.0049	3.52e-05	-138.179	0.000	-0.005	-0.005
age_group_36-45:price	-0.0049	3.21e-05	-153.434	0.000	-0.005	-0.005
age_group_46+:price	-0.0048	2.8e-05	-170.282	0.000	-0.005	-0.005
occupation_HEAVY BLUE:price	-0.0049	3.15e-05	-154.921	0.000	-0.005	-0.005
occupation_LIGHT BLUE:price	-0.0048	3.33e-05	-145.694	0.000	-0.005	-0.005
occupation_WHITE COLLAR:price	-0.0048	3.04e-05	-158.408	0.000	-0.005	-0.005

Modifying this to model recommendations but with month as a polynomial:

Show code

# Set date format
df['date'] = pd.to_datetime(df['date'])

# Define predictors 
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)

# Add polynomial term for the month
df_encoded['month_squared'] = df_encoded['month']**2

# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
    for dummy in df_encoded.filter(like=col).columns:
        df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
        
df_encoded = df_encoded.astype(float)

# Define target and predictors 
X = df_encoded[['month', 'month_squared'] + [col for col in df_encoded if ':price' in col]]
Y = df['recommendations']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())

# Train model
glm_results = glm_model.fit()

# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)

# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)

# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")

GLM MSE: 8.9030

Show code

print(f"GLM R-squared: {glm_r2:.4f}")

GLM R-squared: 0.9574

Show code

# Summary
glm_results.summary()

Generalized Linear Model Regression Results
Dep. Variable:	recommendations	No. Observations:	1425
Model:	GLM	Df Residuals:	1415
Model Family:	Gaussian	Df Model:	9
Link Function:	Identity	Scale:	9.2575
Method:	IRLS	Log-Likelihood:	-3602.7
Date:	Wed, 26 Jul 2023	Deviance:	13101.
Time:	13:36:12	Pearson chi2:	1.31e+04
No. Iterations:	3	Pseudo R-squ. (CS):	1.000
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	250.7090	0.983	254.945	0.000	248.782	252.636
month	1.3797	0.100	13.787	0.000	1.184	1.576
month_squared	-0.0587	0.008	-7.734	0.000	-0.074	-0.044
product_Product1:price	-0.5012	0.003	-173.719	0.000	-0.507	-0.496
product_Product2:price	-0.4943	0.003	-149.876	0.000	-0.501	-0.488
product_Product3:price	-0.5070	0.004	-139.169	0.000	-0.514	-0.500
age_group_20-35:price	-0.5018	0.004	-138.316	0.000	-0.509	-0.495
age_group_36-45:price	-0.5091	0.003	-153.678	0.000	-0.516	-0.503
age_group_46+:price	-0.4915	0.003	-170.436	0.000	-0.497	-0.486
occupation_HEAVY BLUE:price	-0.5039	0.003	-155.091	0.000	-0.510	-0.498
occupation_LIGHT BLUE:price	-0.5006	0.003	-145.874	0.000	-0.507	-0.494
occupation_WHITE COLLAR:price	-0.4980	0.003	-158.591	0.000	-0.504	-0.492

GB

Show code

# Create Gradient Boosting model
gb_model = GradientBoostingRegressor(random_state=42)

# Train model
gb_model.fit(X_train, y_train)

GradientBoostingRegressor(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

# Use the model to make predictions
gb_predictions = gb_model.predict(X_test)

# Calculate metrics
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_r2 = r2_score(y_test, gb_predictions)

print(f'Gradient Boosting MSE: {gb_mse:.4f}')

Gradient Boosting MSE: 5.2963

Show code

print(f'Gradient Boosting R^2: {gb_r2:.4f}')

Gradient Boosting R^2: 0.9747

Show code

# Get feature importances
importance = gb_model.feature_importances_

# Create a DataFrame to display the feature importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': importance})

# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

# Display the top features
print(feature_importance_df)

                          Feature  Importance
6           age_group_20-35:price    0.354780
8             age_group_46+:price    0.270391
7           age_group_36-45:price    0.124483
4          product_Product2:price    0.094404
3          product_Product1:price    0.074612
10    occupation_LIGHT BLUE:price    0.025876
1                           month    0.014360
5          product_Product3:price    0.014010
9     occupation_HEAVY BLUE:price    0.011436
11  occupation_WHITE COLLAR:price    0.008229
2                   month_squared    0.007420
0                           const    0.000000

Extracting elasticities:

Show code

def compute_elasticity(model, feature_name):
    # Create a dummy row of zeros
    dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
    
    # If the feature ends with ":price", then it's a continuous feature.
    if ":price" in feature_name:
        dummy_data[feature_name] = 100  # Setting initial price to a standard value
        original_pred = model.predict(dummy_data)
        
        # Simulate a 1% increase
        dummy_data[feature_name] = 110  
        new_pred = model.predict(dummy_data)
        
    else:  # It's a binary/categorical feature
        dummy_data[feature_name] = 1
        original_pred = model.predict(dummy_data)
        
        # Toggle its value
        dummy_data[feature_name] = 0
        new_pred = model.predict(dummy_data)
        
    percent_change_qty = (new_pred - original_pred) / original_pred
    elasticity = percent_change_qty / 0.1 
    
    return round(elasticity[0],4)  # Extract value from the array

feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]

# Create a dictionary to store computed elasticities
elasticities = {}

# For each feature of interest, compute the elasticity
for feature in feature_names:
    elasticities[feature] = compute_elasticity(gb_model, feature)

print(elasticities)

{'const': 0.0, 'product_Product1:price': -0.7521, 'product_Product2:price': -0.3182, 'product_Product3:price': -0.2497, 'age_group_20-35:price': -0.8312, 'age_group_36-45:price': -0.815, 'age_group_46+:price': -1.1606, 'occupation_HEAVY BLUE:price': 0.0, 'occupation_LIGHT BLUE:price': -0.1401, 'occupation_WHITE COLLAR:price': 0.0078}

Looking at a range of price change sensitivities:

Show code

def compute_elasticity_for_changes(model, feature_name, changes):
  
  # This dictionary will store elasticities for various changes
  results = {}
  
  # Create a dummy row of zeros
  dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
  
  for change in changes:
    
    # Avoid issues with 0% change
    if change == 0:
        results['0%'] = np.nan
        continue
          
    # If the feature ends with ":price", then it's a continuous feature.
    if ":price" in feature_name:
      dummy_data[feature_name] = 100  # Setting initial price to a standard value
      original_pred = model.predict(dummy_data)[0]  # Extract the scalar
      
      # Simulate a change
      dummy_data[feature_name] = 100 + 100 * change 
      new_pred = model.predict(dummy_data)[0]  # Extract the scalar
    
    else:  # It's a binary/categorical feature
      dummy_data[feature_name] = 1
      original_pred = model.predict(dummy_data)[0]  # Extract the scalar
      
      # Toggle its value
      dummy_data[feature_name] = 0
      new_pred = model.predict(dummy_data)[0]  # Extract the scalar
    
    percent_change_qty = (new_pred - original_pred) / original_pred
    elasticity = round(percent_change_qty / change, 4)
    
    results[str(int(change * 100)) + '%'] = elasticity
  
  return results

# Define a continuous range of sensitivities from -50% to 50% in 1% increments
changes = [i/100 for i in range(-50, 51)]
feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]

elasticity_table = pd.DataFrame(index=feature_names)

for feature in feature_names:
    elasticities = compute_elasticity_for_changes(gb_model, feature, changes)
    for key, value in elasticities.items():
        elasticity_table.at[feature, key] = value

print(elasticity_table)

                                 -50%    -49%    -48%  ...     48%     49%     50%
const                         -0.0000 -0.0000 -0.0000  ...  0.0000  0.0000  0.0000
product_Product1:price        -0.0256 -0.0000 -0.0000  ... -0.1769 -0.1733 -0.1698
product_Product2:price        -0.0978 -0.0998 -0.1019  ... -0.1778 -0.1741 -0.1707
product_Product3:price        -0.0236 -0.0241 -0.0246  ... -0.0535 -0.0525 -0.0514
age_group_20-35:price         -0.4954 -0.5055 -0.5160  ... -0.2911 -0.2851 -0.2794
age_group_36-45:price         -0.1829 -0.1867 -0.1906  ... -0.2840 -0.2782 -0.2726
age_group_46+:price           -0.1886 -0.1925 -0.1965  ... -0.4674 -0.4579 -0.4487
occupation_HEAVY BLUE:price   -0.0214 -0.0219 -0.0223  ... -0.0183 -0.0179 -0.0175
occupation_LIGHT BLUE:price   -0.0977 -0.0997 -0.1018  ... -0.0636 -0.0623 -0.0611
occupation_WHITE COLLAR:price -0.0258 -0.0263 -0.0269  ... -0.0182 -0.0178 -0.0175

[10 rows x 99 columns]

Plotting the elasticities:

Show code

# Adjust the names in the dictionaries to match the format in the dataframe
product_elasticities <- list('product_Product1:price' = 1.1, 'product_Product2:price' = 1.1, 'product_Product3:price' = 1)
age_elasticities <- list('age_group_20-35:price' = 1.5, 'age_group_36-45:price' = 1.2, 'age_group_46+:price' = 1.1)
occupation_elasticities <- list('occupation_LIGHT BLUE:price' = 1.2, 'occupation_HEAVY BLUE:price' = 1.05, 'occupation_WHITE COLLAR:price' = 1.05)

# Combine all the dictionaries into a data frame
original_elasticities <- data.frame(
  feature = c(names(product_elasticities), names(age_elasticities), names(occupation_elasticities)),
  elasticity = c(unlist(product_elasticities), unlist(age_elasticities), unlist(occupation_elasticities))
)

# Adjust the elasticity values
original_elasticities$elasticity <- original_elasticities$elasticity * (-1)

# Convert python df to r for plotting
elasticity_table <- py$elasticity_table

# Convert data to long format
long_data <- elasticity_table %>%
  rownames_to_column(var = "feature") %>%
  gather(key = "change", value = "elasticity", -feature)

# Extract numerical values from the "change" column for sizing
long_data$change_num <- as.numeric(gsub("%", "", long_data$change))/100

# Plot the data
ggplot(long_data, aes(x = elasticity, y = feature)) +
  geom_point(aes(size = change_num), color = "steelblue") +  # single color and size is based on change_num
  geom_point(data = original_elasticities, aes(x = elasticity, y = feature), color = "red", size = 4) +  # Data points for original elasticities
  geom_vline(aes(xintercept = 0), linetype = "dashed", color = "grey50") +  # line at x=0 for reference
  labs(title = "Elasticities by Feature for Different Percentage Changes",
       x = "Elasticity",
       y = "Feature") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(size = guide_legend(title = "Price Change"))

FE

Show code

# Convert to panel data
panel_data = df.set_index(['product', 'date'])

# Define exogenous variables
exog = panel_data[['age_group', 'occupation', 'price']]
exog = sm.add_constant(exog, has_constant='add')  # Add constant by product

# Define endogenous variable
endog = panel_data['normalized_recommendations']

# Create a model with fixed effects
fe_model = PanelOLS(endog, exog, entity_effects=True, drop_absorbed=True)

# Fit the model
fe_results = fe_model.fit()

# Use the model to make predictions
fe_predictions = fe_results.predict(exog)

# Calculate evaluation metrics
fe_mse = mean_squared_error(endog, fe_predictions)
fe_r2 = r2_score(endog, fe_predictions)

# Print evaluation metrics
print(f"FE Model MSE: {fe_mse:.4f}")

FE Model MSE: 0.0009

Show code

print(f"FE Model R-squared: {fe_r2:.4f}")

FE Model R-squared: 0.9594

Show code

# Print model summary
print(fe_results)

                              PanelOLS Estimation Summary                               
========================================================================================
Dep. Variable:     normalized_recommendations   R-squared:                        0.9547
Estimator:                           PanelOLS   R-squared (Between):              0.9917
No. Observations:                        1782   R-squared (Within):               0.9547
Date:                        Wed, Jul 26 2023   R-squared (Overall):              0.9594
Time:                                13:36:21   Log-likelihood                    3765.5
Cov. Estimator:                    Unadjusted                                           
                                                F-statistic:                      7470.5
Entities:                                   3   P-value                           0.0000
Avg Obs:                               594.00   Distribution:                  F(5,1774)
Min Obs:                               594.00                                           
Max Obs:                               594.00   F-statistic (robust):             7470.5
                                                P-value                           0.0000
Time periods:                              66   Distribution:                  F(5,1774)
Avg Obs:                               27.000                                           
Min Obs:                               27.000                                           
Max Obs:                               27.000                                           
                                                                                        
                                    Parameter Estimates                                    
===========================================================================================
                         Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------------------
const                       2.4563     0.0081     304.18     0.0000      2.4405      2.4722
age_group.36-45            -0.0077     0.0017    -4.4776     0.0000     -0.0111     -0.0043
age_group.46+               0.0062     0.0018     3.3776     0.0007      0.0026      0.0098
occupation.LIGHT BLUE       0.0053     0.0017     3.1193     0.0018      0.0020      0.0087
occupation.WHITE COLLAR     0.0043     0.0017     2.5084     0.0122      0.0009      0.0076
price                      -0.0143  8.051e-05    -177.81     0.0000     -0.0145     -0.0142
===========================================================================================

F-test for Poolability: 23.944
P-value: 0.0000
Distribution: F(2,1774)

Included effects: Entity

Next steps

Translate the above into price elasticities for each product, age group, and occupation combination. Plot them against the actual elasticities to assess the model’s performance.
Goodness of fit testing on same - do the models adequately captures the variations and patterns in the data. Compare the performance across the models.
Conduct a sensitivity analysis to assess the robustness of the price elasticity model.
Explore additional features that could potentially improve the model’s performance. For example, consider incorporating demographic or socio-economic factors.

Exploring price elasticity

Background and application

Further reading

Libraries

Data

Hypothetical data - generate

Function to apply elasticity

Data generation

Exploratory plots

Models

Model variations

GLM

GB

FE

Next steps