Price elasticity is the responsiveness of sales to changes in price. Initial exploration looking at demand/ recommendation modelling given price and other variables. Extendable to elasticity questions to help drive pricing strategies.
…How might we predict changes in product sales patterns based upon price? Impact on advisor recommendations?…
Price elasticity is the responsiveness of sales to changes in price and price elasticity modelling helps drive pricing strategies.
Factors such as age and occupation are likely to impact price elasticity. Insurer specific factors or external factors influencing advisers might be present but hidden/ not explicit in the data. There might be a lag in the effect of price changes on recommendations.
Price elasticity of sales shares similarities with lapse rate modelling, but there the lag effect/ inertia effect might be more stark.
Initial exploration looking at demand/ recommendations modelling given price and other variables. Extendable to elasticity questions.
A few articles of interest:
Setting up the environment as well as a list of some of the packages used in the recipe.
# calling python from r
library(reticulate)
# create environment if does not exist
# conda_create("r-reticulate")
# py_config() # to check configuration
# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks
# if not already installed, install the below. if env specified, can drop envname parameter
# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# py_install("scipy",envname = "r-reticulate")
# py_install("matplotlib",envname = "r-reticulate")
# py_install("seaborn",envname = "r-reticulate")
# py_install("scikit-learn",envname = "r-reticulate")
# py_install("linearmodels",envname = "r-reticulate")
# py_install("tabulate",envname = "r-reticulate")
# py_install("rpy2",envname = "r-reticulate")
# ...etc
Libraries:
# import some standard libraries
import pandas as pd
import numpy as np
from tabulate import tabulate
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from linearmodels.panel import PanelOLS
Creating data split into 3 age groups, 3 occupations and 3 products/ insurers. Defining some initial price and assumping the initial recommendations are split roughly equally across products (in spite of initial price differences - implying a hidden preference not captured in the data which will perhaps be a feature of the model by product). A series of price changes are defined as well as their impact on the level of recomendations. Recommendations are assumed to peak around the middle of the year and a random noise term is thrown in.
def adjust_recommendations(row):
# get the product, age and occupation specific elasticities
= product_elasticities[row['product']]
product_elasticity = age_elasticities[row['age_group']]
age_elasticity = occupation_elasticities[row['occupation']]
occupation_elasticity
# calculate the combined elasticity
= age_elasticity * occupation_elasticity
combined_elasticity
# calculate the percentage change in price
= (row['price'] - row['base_price']) / row['base_price']
price_change
# adjust the recommendations based on the price change and elasticity
= row['recommendations'] * (1 - price_change * combined_elasticity)
adjusted_recommendations
return adjusted_recommendations
# Define date range
= pd.date_range(start='2019-01-01', end='2024-06-30', freq='M', normalize=True)
date_range
# Define age groups, occupations and products
= ['20-35', '36-45', '46+']
age_groups = ['LIGHT BLUE', 'HEAVY BLUE', 'WHITE COLLAR']
occupations = ['Product1', 'Product2', 'Product3']
products
# Define baseline price for each product
= {'Product1': 100.00, 'Product2': 100.00, 'Product3': 100.00}
price
# Create initial dataframe
= []
data for date in date_range:
for product in products:
for age_group in age_groups:
for occupation in occupations:
= {
row 'product': product,
'date': date.date(),
'age_group': age_group,
'occupation': occupation,
'base_price': price[product],
'price': price[product],
'price_changed': 0,
'change_direction': 'no change'
}
data.append(row)
= pd.DataFrame(data)
df
# Apply price changes
= [
price_changes # (product, age_group, occupation, date, price multiplier)
'Product1', '46+', None, pd.to_datetime('2019-07-01').date(), 1.15),
('Product3', None, 'HEAVY BLUE', pd.to_datetime('2019-09-01').date(), 1.05),
('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
('Product2', '46+', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
('Product3', None, 'LIGHT BLUE', pd.to_datetime('2020-03-01').date(), 0.9),
('Product2', '46+', 'WHITE COLLAR', pd.to_datetime('2020-12-01').date(), 1.3),
('Product2', '20-35', None, pd.to_datetime('2021-04-01').date(), 0.8),
('Product1', None, 'HEAVY BLUE', pd.to_datetime('2021-12-01').date(), 1.05),
('Product1', '20-35', 'WHITE COLLAR', pd.to_datetime('2022-03-01').date(), 1.2),
('Product3', '46+', 'HEAVY BLUE', pd.to_datetime('2022-06-01').date(), 0.85),
('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2022-09-01').date(), 1.1),
('Product1', '36-45', 'LIGHT BLUE', pd.to_datetime('2023-01-01').date(), 1.05),
('Product2', '20-35', 'HEAVY BLUE', pd.to_datetime('2023-04-01').date(), 1.3),
('Product3', '46+', 'WHITE COLLAR', pd.to_datetime('2023-07-01').date(), 0.9),
('Product1', '20-35', None, pd.to_datetime('2023-10-01').date(), 1.1),
('Product2', None, 'HEAVY BLUE', pd.to_datetime('2024-01-01').date(), 0.95),
('Product3', '36-45', 'WHITE COLLAR', pd.to_datetime('2024-04-01').date(), 1.15),
(
]
# Apply price changes
for product, age_group, occupation, date, multiplier in price_changes:
= (df['product'] == product) & (df['date'] >= date)
mask if age_group is not None:
= mask & (df['age_group'] == age_group)
mask if occupation is not None:
= mask & (df['occupation'] == occupation)
mask
'price'] *= multiplier
df.loc[mask,
# Flag price change
for product, age_group, occupation, date, multiplier in price_changes:
= (df['product'] == product) & (df['date'].apply(lambda x: x.year)== date.year) & (df['date'].apply(lambda x: x.month) == date.month)
mask if age_group is not None:
= mask & (df['age_group'] == age_group)
mask if occupation is not None:
= mask & (df['occupation'] == occupation)
mask
'price_changed'] = 1
df.loc[mask, 'change_direction'] = 'up' if multiplier > 1 else 'down'
df.loc[mask,
# Ensure that each product, age, occ gets roughly equal recommendations initially
'recommendations'] = 100
df[
# Noise terms
# Monthly
0) # for reproducibility
np.random.seed(= 10, 5 # mean and standard deviation for the normal distribution
mu, sigma = np.random.normal(mu, sigma, df.shape[0]) # generate random numbers from the distribution
s 'month'] = df['date'].apply(lambda x: x.month)
df['month_norm'] = 1 + norm.pdf(df['month'], mu, sigma) # generate the monthly pattern
df['recommendations'] = df['recommendations'] *df['month_norm']
df[
# Random
= 1 + (np.random.normal(0, 0.02, df.shape[0]))
noise_factor 'recommendations'] *= noise_factor
df[
# Apply price changes and adjust recommendations based on price elasticity
= {'Product1': 1.1, 'Product2': 1.1, 'Product3': 1}
product_elasticities = {'20-35': 1.5, '36-45': 1.2, '46+': 1.1}
age_elasticities = {'LIGHT BLUE': 1.2, 'HEAVY BLUE': 1.05, 'WHITE COLLAR': 1.05}
occupation_elasticities
'recommendations'] = df.apply(adjust_recommendations, axis=1)
df[# Round the recommendations to nearest integer as we're dealing with counts
'recommendations'] = df['recommendations'].round().astype(int)
df[
# We know that the data has seasonality, normalising this for later visuals:
# Calculate monthly averages
= df.groupby('month')['recommendations'].mean().reset_index()
monthly_averages = ['month', 'avg_recommendations']
monthly_averages.columns
# Merge and norm
= df.merge(monthly_averages, on='month', how='left')
df 'normalized_recommendations'] = df['recommendations'] / df['avg_recommendations']
df[
df.head()
product date ... avg_recommendations normalized_recommendations
0 Product1 2019-01-31 ... 99.345679 0.986455
1 Product1 2019-01-31 ... 99.345679 1.016652
2 Product1 2019-01-31 ... 99.345679 1.006586
3 Product1 2019-01-31 ... 99.345679 1.026718
4 Product1 2019-01-31 ... 99.345679 0.976389
[5 rows x 13 columns]
# Convert python df to r for plotting
df <- py$df
# Convert the date
df$date <- as.Date(sapply(df$date, as.character))
# Plot
p <- ggplot(df, aes(x = date, y = price, color = product)) +
geom_line() +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Price change history") +
theme_minimal() +
ylim(80, 150) +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1)) # This line makes the x-axis labels diagonal
print(p)
Plots show the seasonality. Arrows reflect the timing and direction of the rate changes. There is a fair amount of noise in the recommendations:
p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
geom_line(size=0.5) +
# Plot
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = recommendations, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Recommendations history") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
From our input we know we have overlayed a seasonality factor:
# A tibble: 6 × 2
date avg_month_norm
<date> <dbl>
1 2019-01-31 1.02
2 2019-02-28 1.02
3 2019-03-31 1.03
4 2019-04-30 1.04
5 2019-05-31 1.05
6 2019-06-30 1.06
# Plot
p <- ggplot(df_avg, aes(x = date, y = avg_month_norm)) +
geom_line() +
scale_y_continuous(labels = percent_format()) +
labs(title = "Monthly seasonality (data param)") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
Taking a 3 month moving average reduces the noise, and some correlation with the rate change visable:
# Calculate the 3-month moving average
df <- df %>%
group_by(age_group, occupation, product) %>%
arrange(date) %>%
mutate(moving_avg = zoo::rollmean(recommendations, k = 3, align = "right", fill = NA))
p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
# Adding a 3-month moving average line
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
geom_line(aes(y = moving_avg), linetype = "solid", size = 0.5) +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Recommendations history with 3-Month moving average") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
Removing the seasonality by standardising the monthly recommendations by their averages, retaining the 3 month moving average:
# Add moving average based upon normalised recommendations
df <- df %>%
group_by(age_group, occupation, product) %>%
arrange(date) %>%
mutate(moving_avg = zoo::rollmean(normalized_recommendations, k = 3, align = "right", fill = NA))
# Plot the normalized data
p <- ggplot(df, aes(x = date, y = moving_avg, color = product)) +
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
geom_line() +
facet_grid(age_group ~ occupation) +
labs(title = "Normalized recommendations history") +
theme_minimal() +
theme(legend.position = "bottom")
print(p)
From our inputs, the price elasticity factors are:
# Convert the dictionaries to lists of tuples
= list(product_elasticities.items())
product_data = list(age_elasticities.items())
age_data = list(occupation_elasticities.items())
occupation_data
# Print the tables
print("Product Elasticities:")
Product Elasticities:
print(tabulate(product_data, headers=["Product", "Elasticity"], tablefmt="grid"))
+-----------+--------------+
| Product | Elasticity |
+===========+==============+
| Product1 | 1.1 |
+-----------+--------------+
| Product2 | 1.1 |
+-----------+--------------+
| Product3 | 1 |
+-----------+--------------+
print("\nAge Elasticities:")
Age Elasticities:
print(tabulate(age_data, headers=["Age Group", "Elasticity"], tablefmt="grid"))
+-------------+--------------+
| Age Group | Elasticity |
+=============+==============+
| 20-35 | 1.5 |
+-------------+--------------+
| 36-45 | 1.2 |
+-------------+--------------+
| 46+ | 1.1 |
+-------------+--------------+
print("\nOccupation Elasticities:")
Occupation Elasticities:
print(tabulate(occupation_data, headers=["Occupation", "Elasticity"], tablefmt="grid"))
+--------------+--------------+
| Occupation | Elasticity |
+==============+==============+
| LIGHT BLUE | 1.2 |
+--------------+--------------+
| HEAVY BLUE | 1.05 |
+--------------+--------------+
| WHITE COLLAR | 1.05 |
+--------------+--------------+
Looking at the correlations between price and recommendations, we see a stronger negative correlation for Products 1 and 2 as expected. Looking at base recommendations and normalised for seasonality:
# Subset the data
= df[['product', 'price', 'normalized_recommendations', 'recommendations']]
subset_df
# Group the data by
= subset_df.groupby('product').apply(lambda x: pd.Series({
correlations 'Corr Price v Norm Recomment': x['price'].corr(x['normalized_recommendations']),
'Corr Price v Recommend': x['price'].corr(x['recommendations'])
}))
# Round the correlations to 3 decimal points
= correlations.round(3)
correlations
# Reset the index and rename the columns
= correlations.reset_index().rename(columns={'level_1': 'Correlation Type'})
correlation_table
# Display the correlation table
print(correlation_table)
product Corr Price v Norm Recomment Corr Price v Recommend
0 Product1 -0.970 -0.953
1 Product2 -0.984 -0.977
2 Product3 -0.957 -0.929
Normalising for seasonability, we see a stronger negative correlation for younger ages (<36) but stronger than expected for older ages given the inputs:
# Subset the data
= df[['product', 'age_group', 'price', 'normalized_recommendations']]
subset_df
# Group the data by
= subset_df.groupby(['product', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
correlations
# Round the correlations to 3 decimal points
= correlations.round(3)
correlations
# Display the correlation table
print(correlations)
age_group 20-35 36-45 46+
product
Product1 -0.991 -0.789 -0.945
Product2 -0.986 -0.983 -0.991
Product3 -0.980 -0.969 -0.956
Data gets a bit more sparse when we add in all of product, age and occupation, but we do largely see a stronger negative correlation for light blue as expected:
# Subset the data for the desired columns
= df[['product', 'occupation', 'age_group', 'price', 'normalized_recommendations']]
subset_df
# Create an empty dictionary to store the correlation tables
= {}
correlation_tables
# Iterate over each product
for product in subset_df['product'].unique():
= subset_df[subset_df['product'] == product]
product_df
# Group the data by 'occupation' and 'age_group' and calculate the correlations
= product_df.groupby(['occupation', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
correlations
# Round the correlations to 3 decimal points
= correlations.round(3)
correlations
# Store the correlation table in the dictionary
= correlations
correlation_tables[product]
# Display the correlation tables
for product, correlations in correlation_tables.items():
print(f"Correlation table for {product}")
print(correlations)
print()
Correlation table for Product1
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.974 -0.813 -0.968
LIGHT BLUE -0.952 -0.826 -0.958
WHITE COLLAR -0.997 NaN -0.930
Correlation table for Product2
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.992 -0.631 -0.699
LIGHT BLUE -0.992 -0.992 -0.973
WHITE COLLAR -0.987 NaN -0.996
Correlation table for Product3
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.864 -0.750 -0.967
LIGHT BLUE -0.958 -0.932 -0.935
WHITE COLLAR NaN -0.865 -0.906
Initially we are modelling recommendations as explained by price and other explanatory variables. Models considered were a Gradient Boosting Regressor, Generalized Linear Model (GLM), and Fixed Effects (FE) Panel Model:
To start with, let’s model the normalised recommendations. We see that month is not significant. Params can be directly interpreted as price sensitivities. No differential by age, occ and product.
# Set date format
'date'] = pd.to_datetime(df['date'])
df[
# Define predictors
= pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)
df_encoded
# Interactions
= ['product', 'age_group', 'occupation']
interaction_columns for col in interaction_columns:
for dummy in df_encoded.filter(like=col).columns:
f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
df_encoded[
= df_encoded.astype(float)
df_encoded
# Define target and predictors
= df_encoded[['month'] + [col for col in df_encoded if ':price' in col]]
X = df['normalized_recommendations']
Y
# Split data into train and test sets
= train_test_split(X, Y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Add constant to predictors - statsmodels requires this for correct model specification
= sm.add_constant(X_train)
X_train = sm.add_constant(X_test)
X_test
# Create Negative Binomial model
= sm.GLM(y_train, X_train, family=sm.families.Gaussian())
glm_model
# Train model
= glm_model.fit()
glm_results
# Use the model to make predictions
= glm_results.predict(X_test)
glm_predictions
# Calculate evaluation metrics
= mean_squared_error(y_test, glm_predictions)
glm_mse = r2_score(y_test, glm_predictions)
glm_r2
# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")
GLM MSE: 0.0008
print(f"GLM R-squared: {glm_r2:.4f}")
GLM R-squared: 0.9588
# Summary
glm_results.summary()
Dep. Variable: | normalized_recommendations | No. Observations: | 1425 |
---|---|---|---|
Model: | GLM | Df Residuals: | 1416 |
Model Family: | Gaussian | Df Model: | 8 |
Link Function: | Identity | Scale: | 0.00087058 |
Method: | IRLS | Log-Likelihood: | 3008.5 |
Date: | Wed, 26 Jul 2023 | Deviance: | 1.2234 |
Time: | 13:36:11 | Pearson chi2: | 1.22 |
No. Iterations: | 3 | Pseudo R-squ. (CS): | 1.000 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.4810 | 0.009 | 268.747 | 0.000 | 2.463 | 2.499 |
month | 0.0003 | 0.000 | 1.168 | 0.243 | -0.000 | 0.001 |
product_Product1:price | -0.0049 | 2.8e-05 | -173.554 | 0.000 | -0.005 | -0.005 |
product_Product2:price | -0.0048 | 3.2e-05 | -149.704 | 0.000 | -0.005 | -0.005 |
product_Product3:price | -0.0049 | 3.53e-05 | -138.976 | 0.000 | -0.005 | -0.005 |
age_group_20-35:price | -0.0049 | 3.52e-05 | -138.179 | 0.000 | -0.005 | -0.005 |
age_group_36-45:price | -0.0049 | 3.21e-05 | -153.434 | 0.000 | -0.005 | -0.005 |
age_group_46+:price | -0.0048 | 2.8e-05 | -170.282 | 0.000 | -0.005 | -0.005 |
occupation_HEAVY BLUE:price | -0.0049 | 3.15e-05 | -154.921 | 0.000 | -0.005 | -0.005 |
occupation_LIGHT BLUE:price | -0.0048 | 3.33e-05 | -145.694 | 0.000 | -0.005 | -0.005 |
occupation_WHITE COLLAR:price | -0.0048 | 3.04e-05 | -158.408 | 0.000 | -0.005 | -0.005 |
Modifying this to model recommendations but with month as a polynomial:
# Set date format
'date'] = pd.to_datetime(df['date'])
df[
# Define predictors
= pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)
df_encoded
# Add polynomial term for the month
'month_squared'] = df_encoded['month']**2
df_encoded[
# Interactions
= ['product', 'age_group', 'occupation']
interaction_columns for col in interaction_columns:
for dummy in df_encoded.filter(like=col).columns:
f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
df_encoded[
= df_encoded.astype(float)
df_encoded
# Define target and predictors
= df_encoded[['month', 'month_squared'] + [col for col in df_encoded if ':price' in col]]
X = df['recommendations']
Y
# Split data into train and test sets
= train_test_split(X, Y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Add constant to predictors - statsmodels requires this for correct model specification
= sm.add_constant(X_train)
X_train = sm.add_constant(X_test)
X_test
# Create Negative Binomial model
= sm.GLM(y_train, X_train, family=sm.families.Gaussian())
glm_model
# Train model
= glm_model.fit()
glm_results
# Use the model to make predictions
= glm_results.predict(X_test)
glm_predictions
# Calculate evaluation metrics
= mean_squared_error(y_test, glm_predictions)
glm_mse = r2_score(y_test, glm_predictions)
glm_r2
# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")
GLM MSE: 8.9030
print(f"GLM R-squared: {glm_r2:.4f}")
GLM R-squared: 0.9574
# Summary
glm_results.summary()
Dep. Variable: | recommendations | No. Observations: | 1425 |
---|---|---|---|
Model: | GLM | Df Residuals: | 1415 |
Model Family: | Gaussian | Df Model: | 9 |
Link Function: | Identity | Scale: | 9.2575 |
Method: | IRLS | Log-Likelihood: | -3602.7 |
Date: | Wed, 26 Jul 2023 | Deviance: | 13101. |
Time: | 13:36:12 | Pearson chi2: | 1.31e+04 |
No. Iterations: | 3 | Pseudo R-squ. (CS): | 1.000 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 250.7090 | 0.983 | 254.945 | 0.000 | 248.782 | 252.636 |
month | 1.3797 | 0.100 | 13.787 | 0.000 | 1.184 | 1.576 |
month_squared | -0.0587 | 0.008 | -7.734 | 0.000 | -0.074 | -0.044 |
product_Product1:price | -0.5012 | 0.003 | -173.719 | 0.000 | -0.507 | -0.496 |
product_Product2:price | -0.4943 | 0.003 | -149.876 | 0.000 | -0.501 | -0.488 |
product_Product3:price | -0.5070 | 0.004 | -139.169 | 0.000 | -0.514 | -0.500 |
age_group_20-35:price | -0.5018 | 0.004 | -138.316 | 0.000 | -0.509 | -0.495 |
age_group_36-45:price | -0.5091 | 0.003 | -153.678 | 0.000 | -0.516 | -0.503 |
age_group_46+:price | -0.4915 | 0.003 | -170.436 | 0.000 | -0.497 | -0.486 |
occupation_HEAVY BLUE:price | -0.5039 | 0.003 | -155.091 | 0.000 | -0.510 | -0.498 |
occupation_LIGHT BLUE:price | -0.5006 | 0.003 | -145.874 | 0.000 | -0.507 | -0.494 |
occupation_WHITE COLLAR:price | -0.4980 | 0.003 | -158.591 | 0.000 | -0.504 | -0.492 |
# Create Gradient Boosting model
= GradientBoostingRegressor(random_state=42)
gb_model
# Train model
gb_model.fit(X_train, y_train)
GradientBoostingRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=42)
# Use the model to make predictions
= gb_model.predict(X_test)
gb_predictions
# Calculate metrics
= mean_squared_error(y_test, gb_predictions)
gb_mse = r2_score(y_test, gb_predictions)
gb_r2
print(f'Gradient Boosting MSE: {gb_mse:.4f}')
Gradient Boosting MSE: 5.2963
print(f'Gradient Boosting R^2: {gb_r2:.4f}')
Gradient Boosting R^2: 0.9747
# Get feature importances
= gb_model.feature_importances_
importance
# Create a DataFrame to display the feature importances
= pd.DataFrame({'Feature': X_train.columns, 'Importance': importance})
feature_importance_df
# Sort the DataFrame by importance in descending order
= feature_importance_df.sort_values('Importance', ascending=False)
feature_importance_df
# Display the top features
print(feature_importance_df)
Feature Importance
6 age_group_20-35:price 0.354780
8 age_group_46+:price 0.270391
7 age_group_36-45:price 0.124483
4 product_Product2:price 0.094404
3 product_Product1:price 0.074612
10 occupation_LIGHT BLUE:price 0.025876
1 month 0.014360
5 product_Product3:price 0.014010
9 occupation_HEAVY BLUE:price 0.011436
11 occupation_WHITE COLLAR:price 0.008229
2 month_squared 0.007420
0 const 0.000000
Extracting elasticities:
def compute_elasticity(model, feature_name):
# Create a dummy row of zeros
= pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
dummy_data
# If the feature ends with ":price", then it's a continuous feature.
if ":price" in feature_name:
= 100 # Setting initial price to a standard value
dummy_data[feature_name] = model.predict(dummy_data)
original_pred
# Simulate a 1% increase
= 110
dummy_data[feature_name] = model.predict(dummy_data)
new_pred
else: # It's a binary/categorical feature
= 1
dummy_data[feature_name] = model.predict(dummy_data)
original_pred
# Toggle its value
= 0
dummy_data[feature_name] = model.predict(dummy_data)
new_pred
= (new_pred - original_pred) / original_pred
percent_change_qty = percent_change_qty / 0.1
elasticity
return round(elasticity[0],4) # Extract value from the array
= [f for f in X_train.columns if f not in ['month', 'month_squared']]
feature_names
# Create a dictionary to store computed elasticities
= {}
elasticities
# For each feature of interest, compute the elasticity
for feature in feature_names:
= compute_elasticity(gb_model, feature)
elasticities[feature]
print(elasticities)
{'const': 0.0, 'product_Product1:price': -0.7521, 'product_Product2:price': -0.3182, 'product_Product3:price': -0.2497, 'age_group_20-35:price': -0.8312, 'age_group_36-45:price': -0.815, 'age_group_46+:price': -1.1606, 'occupation_HEAVY BLUE:price': 0.0, 'occupation_LIGHT BLUE:price': -0.1401, 'occupation_WHITE COLLAR:price': 0.0078}
Looking at a range of price change sensitivities:
def compute_elasticity_for_changes(model, feature_name, changes):
# This dictionary will store elasticities for various changes
= {}
results
# Create a dummy row of zeros
= pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
dummy_data
for change in changes:
# Avoid issues with 0% change
if change == 0:
'0%'] = np.nan
results[continue
# If the feature ends with ":price", then it's a continuous feature.
if ":price" in feature_name:
= 100 # Setting initial price to a standard value
dummy_data[feature_name] = model.predict(dummy_data)[0] # Extract the scalar
original_pred
# Simulate a change
= 100 + 100 * change
dummy_data[feature_name] = model.predict(dummy_data)[0] # Extract the scalar
new_pred
else: # It's a binary/categorical feature
= 1
dummy_data[feature_name] = model.predict(dummy_data)[0] # Extract the scalar
original_pred
# Toggle its value
= 0
dummy_data[feature_name] = model.predict(dummy_data)[0] # Extract the scalar
new_pred
= (new_pred - original_pred) / original_pred
percent_change_qty = round(percent_change_qty / change, 4)
elasticity
str(int(change * 100)) + '%'] = elasticity
results[
return results
# Define a continuous range of sensitivities from -50% to 50% in 1% increments
= [i/100 for i in range(-50, 51)]
changes = [f for f in X_train.columns if f not in ['month', 'month_squared']]
feature_names
= pd.DataFrame(index=feature_names)
elasticity_table
for feature in feature_names:
= compute_elasticity_for_changes(gb_model, feature, changes)
elasticities for key, value in elasticities.items():
= value
elasticity_table.at[feature, key]
print(elasticity_table)
-50% -49% -48% ... 48% 49% 50%
const -0.0000 -0.0000 -0.0000 ... 0.0000 0.0000 0.0000
product_Product1:price -0.0256 -0.0000 -0.0000 ... -0.1769 -0.1733 -0.1698
product_Product2:price -0.0978 -0.0998 -0.1019 ... -0.1778 -0.1741 -0.1707
product_Product3:price -0.0236 -0.0241 -0.0246 ... -0.0535 -0.0525 -0.0514
age_group_20-35:price -0.4954 -0.5055 -0.5160 ... -0.2911 -0.2851 -0.2794
age_group_36-45:price -0.1829 -0.1867 -0.1906 ... -0.2840 -0.2782 -0.2726
age_group_46+:price -0.1886 -0.1925 -0.1965 ... -0.4674 -0.4579 -0.4487
occupation_HEAVY BLUE:price -0.0214 -0.0219 -0.0223 ... -0.0183 -0.0179 -0.0175
occupation_LIGHT BLUE:price -0.0977 -0.0997 -0.1018 ... -0.0636 -0.0623 -0.0611
occupation_WHITE COLLAR:price -0.0258 -0.0263 -0.0269 ... -0.0182 -0.0178 -0.0175
[10 rows x 99 columns]
Plotting the elasticities:
# Adjust the names in the dictionaries to match the format in the dataframe
product_elasticities <- list('product_Product1:price' = 1.1, 'product_Product2:price' = 1.1, 'product_Product3:price' = 1)
age_elasticities <- list('age_group_20-35:price' = 1.5, 'age_group_36-45:price' = 1.2, 'age_group_46+:price' = 1.1)
occupation_elasticities <- list('occupation_LIGHT BLUE:price' = 1.2, 'occupation_HEAVY BLUE:price' = 1.05, 'occupation_WHITE COLLAR:price' = 1.05)
# Combine all the dictionaries into a data frame
original_elasticities <- data.frame(
feature = c(names(product_elasticities), names(age_elasticities), names(occupation_elasticities)),
elasticity = c(unlist(product_elasticities), unlist(age_elasticities), unlist(occupation_elasticities))
)
# Adjust the elasticity values
original_elasticities$elasticity <- original_elasticities$elasticity * (-1)
# Convert python df to r for plotting
elasticity_table <- py$elasticity_table
# Convert data to long format
long_data <- elasticity_table %>%
rownames_to_column(var = "feature") %>%
gather(key = "change", value = "elasticity", -feature)
# Extract numerical values from the "change" column for sizing
long_data$change_num <- as.numeric(gsub("%", "", long_data$change))/100
# Plot the data
ggplot(long_data, aes(x = elasticity, y = feature)) +
geom_point(aes(size = change_num), color = "steelblue") + # single color and size is based on change_num
geom_point(data = original_elasticities, aes(x = elasticity, y = feature), color = "red", size = 4) + # Data points for original elasticities
geom_vline(aes(xintercept = 0), linetype = "dashed", color = "grey50") + # line at x=0 for reference
labs(title = "Elasticities by Feature for Different Percentage Changes",
x = "Elasticity",
y = "Feature") +
theme_minimal() +
theme(legend.position = "bottom") +
guides(size = guide_legend(title = "Price Change"))
# Convert to panel data
= df.set_index(['product', 'date'])
panel_data
# Define exogenous variables
= panel_data[['age_group', 'occupation', 'price']]
exog = sm.add_constant(exog, has_constant='add') # Add constant by product
exog
# Define endogenous variable
= panel_data['normalized_recommendations']
endog
# Create a model with fixed effects
= PanelOLS(endog, exog, entity_effects=True, drop_absorbed=True)
fe_model
# Fit the model
= fe_model.fit()
fe_results
# Use the model to make predictions
= fe_results.predict(exog)
fe_predictions
# Calculate evaluation metrics
= mean_squared_error(endog, fe_predictions)
fe_mse = r2_score(endog, fe_predictions)
fe_r2
# Print evaluation metrics
print(f"FE Model MSE: {fe_mse:.4f}")
FE Model MSE: 0.0009
print(f"FE Model R-squared: {fe_r2:.4f}")
FE Model R-squared: 0.9594
# Print model summary
print(fe_results)
PanelOLS Estimation Summary
========================================================================================
Dep. Variable: normalized_recommendations R-squared: 0.9547
Estimator: PanelOLS R-squared (Between): 0.9917
No. Observations: 1782 R-squared (Within): 0.9547
Date: Wed, Jul 26 2023 R-squared (Overall): 0.9594
Time: 13:36:21 Log-likelihood 3765.5
Cov. Estimator: Unadjusted
F-statistic: 7470.5
Entities: 3 P-value 0.0000
Avg Obs: 594.00 Distribution: F(5,1774)
Min Obs: 594.00
Max Obs: 594.00 F-statistic (robust): 7470.5
P-value 0.0000
Time periods: 66 Distribution: F(5,1774)
Avg Obs: 27.000
Min Obs: 27.000
Max Obs: 27.000
Parameter Estimates
===========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------------------
const 2.4563 0.0081 304.18 0.0000 2.4405 2.4722
age_group.36-45 -0.0077 0.0017 -4.4776 0.0000 -0.0111 -0.0043
age_group.46+ 0.0062 0.0018 3.3776 0.0007 0.0026 0.0098
occupation.LIGHT BLUE 0.0053 0.0017 3.1193 0.0018 0.0020 0.0087
occupation.WHITE COLLAR 0.0043 0.0017 2.5084 0.0122 0.0009 0.0076
price -0.0143 8.051e-05 -177.81 0.0000 -0.0145 -0.0142
===========================================================================================
F-test for Poolability: 23.944
P-value: 0.0000
Distribution: F(2,1774)
Included effects: Entity