Price elasticity is the responsiveness of sales to changes in price. Initial exploration looking at demand/ recommendation modelling given price and other variables. Extendable to elasticity questions to help drive pricing strategies.
…How might we predict changes in product sales patterns based upon price? Impact on advisor recommendations?…
Price elasticity is the responsiveness of sales to changes in price and price elasticity modelling helps drive pricing strategies.
Factors such as age and occupation are likely to impact price elasticity. Insurer specific factors or external factors influencing advisers might be present but hidden/ not explicit in the data. There might be a lag in the effect of price changes on recommendations.
Price elasticity of sales shares similarities with lapse rate modelling, but there the lag effect/ inertia effect might be more stark.
Initial exploration looking at demand/ recommendations modelling given price and other variables. Extendable to elasticity questions.
A few articles of interest:
Setting up the environment as well as a list of some of the packages used in the recipe.
# calling python from r
library(reticulate)
# create environment if does not exist
# conda_create("r-reticulate")
# py_config() # to check configuration
# activate environment
use_condaenv("r-reticulate", required=TRUE) # set environment before running any Python chunks
# if not already installed, install the below. if env specified, can drop envname parameter
# py_install("pandas",envname = "r-reticulate")
# py_install("numpy",envname = "r-reticulate")
# py_install("scipy",envname = "r-reticulate")
# py_install("matplotlib",envname = "r-reticulate")
# py_install("seaborn",envname = "r-reticulate")
# py_install("scikit-learn",envname = "r-reticulate")
# py_install("linearmodels",envname = "r-reticulate")
# py_install("tabulate",envname = "r-reticulate")
# py_install("rpy2",envname = "r-reticulate")
# ...etc
Libraries:
# import some standard libraries
import pandas as pd
import numpy as np
from tabulate import tabulate
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from linearmodels.panel import PanelOLSCreating data split into 3 age groups, 3 occupations and 3 products/ insurers. Defining some initial price and assumping the initial recommendations are split roughly equally across products (in spite of initial price differences - implying a hidden preference not captured in the data which will perhaps be a feature of the model by product). A series of price changes are defined as well as their impact on the level of recomendations. Recommendations are assumed to peak around the middle of the year and a random noise term is thrown in.
def adjust_recommendations(row):
# get the product, age and occupation specific elasticities
product_elasticity = product_elasticities[row['product']]
age_elasticity = age_elasticities[row['age_group']]
occupation_elasticity = occupation_elasticities[row['occupation']]
# calculate the combined elasticity
combined_elasticity = age_elasticity * occupation_elasticity
# calculate the percentage change in price
price_change = (row['price'] - row['base_price']) / row['base_price']
# adjust the recommendations based on the price change and elasticity
adjusted_recommendations = row['recommendations'] * (1 - price_change * combined_elasticity)
return adjusted_recommendations# Define date range
date_range = pd.date_range(start='2019-01-01', end='2024-06-30', freq='M', normalize=True)
# Define age groups, occupations and products
age_groups = ['20-35', '36-45', '46+']
occupations = ['LIGHT BLUE', 'HEAVY BLUE', 'WHITE COLLAR']
products = ['Product1', 'Product2', 'Product3']
# Define baseline price for each product
price = {'Product1': 100.00, 'Product2': 100.00, 'Product3': 100.00}
# Create initial dataframe
data = []
for date in date_range:
for product in products:
for age_group in age_groups:
for occupation in occupations:
row = {
'product': product,
'date': date.date(),
'age_group': age_group,
'occupation': occupation,
'base_price': price[product],
'price': price[product],
'price_changed': 0,
'change_direction': 'no change'
}
data.append(row)
df = pd.DataFrame(data)
# Apply price changes
price_changes = [
# (product, age_group, occupation, date, price multiplier)
('Product1', '46+', None, pd.to_datetime('2019-07-01').date(), 1.15),
('Product3', None, 'HEAVY BLUE', pd.to_datetime('2019-09-01').date(), 1.05),
('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
('Product2', '46+', 'LIGHT BLUE', pd.to_datetime('2020-05-01').date(), 1.1),
('Product3', None, 'LIGHT BLUE', pd.to_datetime('2020-03-01').date(), 0.9),
('Product2', '46+', 'WHITE COLLAR', pd.to_datetime('2020-12-01').date(), 1.3),
('Product2', '20-35', None, pd.to_datetime('2021-04-01').date(), 0.8),
('Product1', None, 'HEAVY BLUE', pd.to_datetime('2021-12-01').date(), 1.05),
('Product1', '20-35', 'WHITE COLLAR', pd.to_datetime('2022-03-01').date(), 1.2),
('Product3', '46+', 'HEAVY BLUE', pd.to_datetime('2022-06-01').date(), 0.85),
('Product2', '36-45', 'LIGHT BLUE', pd.to_datetime('2022-09-01').date(), 1.1),
('Product1', '36-45', 'LIGHT BLUE', pd.to_datetime('2023-01-01').date(), 1.05),
('Product2', '20-35', 'HEAVY BLUE', pd.to_datetime('2023-04-01').date(), 1.3),
('Product3', '46+', 'WHITE COLLAR', pd.to_datetime('2023-07-01').date(), 0.9),
('Product1', '20-35', None, pd.to_datetime('2023-10-01').date(), 1.1),
('Product2', None, 'HEAVY BLUE', pd.to_datetime('2024-01-01').date(), 0.95),
('Product3', '36-45', 'WHITE COLLAR', pd.to_datetime('2024-04-01').date(), 1.15),
]
# Apply price changes
for product, age_group, occupation, date, multiplier in price_changes:
mask = (df['product'] == product) & (df['date'] >= date)
if age_group is not None:
mask = mask & (df['age_group'] == age_group)
if occupation is not None:
mask = mask & (df['occupation'] == occupation)
df.loc[mask, 'price'] *= multiplier
# Flag price change
for product, age_group, occupation, date, multiplier in price_changes:
mask = (df['product'] == product) & (df['date'].apply(lambda x: x.year)== date.year) & (df['date'].apply(lambda x: x.month) == date.month)
if age_group is not None:
mask = mask & (df['age_group'] == age_group)
if occupation is not None:
mask = mask & (df['occupation'] == occupation)
df.loc[mask, 'price_changed'] = 1
df.loc[mask, 'change_direction'] = 'up' if multiplier > 1 else 'down'
# Ensure that each product, age, occ gets roughly equal recommendations initially
df['recommendations'] = 100
# Noise terms
# Monthly
np.random.seed(0) # for reproducibility
mu, sigma = 10, 5 # mean and standard deviation for the normal distribution
s = np.random.normal(mu, sigma, df.shape[0]) # generate random numbers from the distribution
df['month'] = df['date'].apply(lambda x: x.month)
df['month_norm'] = 1 + norm.pdf(df['month'], mu, sigma) # generate the monthly pattern
df['recommendations'] = df['recommendations'] *df['month_norm']
# Random
noise_factor = 1 + (np.random.normal(0, 0.02, df.shape[0]))
df['recommendations'] *= noise_factor
# Apply price changes and adjust recommendations based on price elasticity
product_elasticities = {'Product1': 1.1, 'Product2': 1.1, 'Product3': 1}
age_elasticities = {'20-35': 1.5, '36-45': 1.2, '46+': 1.1}
occupation_elasticities = {'LIGHT BLUE': 1.2, 'HEAVY BLUE': 1.05, 'WHITE COLLAR': 1.05}
df['recommendations'] = df.apply(adjust_recommendations, axis=1)
# Round the recommendations to nearest integer as we're dealing with counts
df['recommendations'] = df['recommendations'].round().astype(int)
# We know that the data has seasonality, normalising this for later visuals:
# Calculate monthly averages
monthly_averages = df.groupby('month')['recommendations'].mean().reset_index()
monthly_averages.columns = ['month', 'avg_recommendations']
# Merge and norm
df = df.merge(monthly_averages, on='month', how='left')
df['normalized_recommendations'] = df['recommendations'] / df['avg_recommendations']
df.head() product date ... avg_recommendations normalized_recommendations
0 Product1 2019-01-31 ... 99.345679 0.986455
1 Product1 2019-01-31 ... 99.345679 1.016652
2 Product1 2019-01-31 ... 99.345679 1.006586
3 Product1 2019-01-31 ... 99.345679 1.026718
4 Product1 2019-01-31 ... 99.345679 0.976389
[5 rows x 13 columns]
# Convert python df to r for plotting
df <- py$df
# Convert the date
df$date <- as.Date(sapply(df$date, as.character))
# Plot
p <- ggplot(df, aes(x = date, y = price, color = product)) +
geom_line() +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Price change history") +
theme_minimal() +
ylim(80, 150) +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1)) # This line makes the x-axis labels diagonal
print(p)
Plots show the seasonality. Arrows reflect the timing and direction of the rate changes. There is a fair amount of noise in the recommendations:
p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
geom_line(size=0.5) +
# Plot
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = recommendations, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Recommendations history") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
From our input we know we have overlayed a seasonality factor:
# A tibble: 6 × 2
date avg_month_norm
<date> <dbl>
1 2019-01-31 1.02
2 2019-02-28 1.02
3 2019-03-31 1.03
4 2019-04-30 1.04
5 2019-05-31 1.05
6 2019-06-30 1.06
# Plot
p <- ggplot(df_avg, aes(x = date, y = avg_month_norm)) +
geom_line() +
scale_y_continuous(labels = percent_format()) +
labs(title = "Monthly seasonality (data param)") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
Taking a 3 month moving average reduces the noise, and some correlation with the rate change visable:
# Calculate the 3-month moving average
df <- df %>%
group_by(age_group, occupation, product) %>%
arrange(date) %>%
mutate(moving_avg = zoo::rollmean(recommendations, k = 3, align = "right", fill = NA))
p <- ggplot(df, aes(x = date, y = recommendations, color = product)) +
# Adding a 3-month moving average line
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
geom_line(aes(y = moving_avg), linetype = "solid", size = 0.5) +
facet_grid(age_group ~ occupation) +
scale_x_date(breaks = unique(df$date)[seq(1, length(unique(df$date)), by = 10)]) +
labs(title = "Recommendations history with 3-Month moving average") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
Removing the seasonality by standardising the monthly recommendations by their averages, retaining the 3 month moving average:
# Add moving average based upon normalised recommendations
df <- df %>%
group_by(age_group, occupation, product) %>%
arrange(date) %>%
mutate(moving_avg = zoo::rollmean(normalized_recommendations, k = 3, align = "right", fill = NA))
# Plot the normalized data
p <- ggplot(df, aes(x = date, y = moving_avg, color = product)) +
geom_point(data = subset(df, price_changed == 1), aes(x = date, y = moving_avg, color = product, fill=product, shape = change_direction), size = 3) +
scale_shape_manual(values = c(up = 24, down = 25)) +
geom_line() +
facet_grid(age_group ~ occupation) +
labs(title = "Normalized recommendations history") +
theme_minimal() +
theme(legend.position = "bottom")
print(p)
From our inputs, the price elasticity factors are:
# Convert the dictionaries to lists of tuples
product_data = list(product_elasticities.items())
age_data = list(age_elasticities.items())
occupation_data = list(occupation_elasticities.items())
# Print the tables
print("Product Elasticities:")Product Elasticities:
print(tabulate(product_data, headers=["Product", "Elasticity"], tablefmt="grid"))+-----------+--------------+
| Product | Elasticity |
+===========+==============+
| Product1 | 1.1 |
+-----------+--------------+
| Product2 | 1.1 |
+-----------+--------------+
| Product3 | 1 |
+-----------+--------------+
print("\nAge Elasticities:")
Age Elasticities:
print(tabulate(age_data, headers=["Age Group", "Elasticity"], tablefmt="grid"))+-------------+--------------+
| Age Group | Elasticity |
+=============+==============+
| 20-35 | 1.5 |
+-------------+--------------+
| 36-45 | 1.2 |
+-------------+--------------+
| 46+ | 1.1 |
+-------------+--------------+
print("\nOccupation Elasticities:")
Occupation Elasticities:
print(tabulate(occupation_data, headers=["Occupation", "Elasticity"], tablefmt="grid"))+--------------+--------------+
| Occupation | Elasticity |
+==============+==============+
| LIGHT BLUE | 1.2 |
+--------------+--------------+
| HEAVY BLUE | 1.05 |
+--------------+--------------+
| WHITE COLLAR | 1.05 |
+--------------+--------------+
Looking at the correlations between price and recommendations, we see a stronger negative correlation for Products 1 and 2 as expected. Looking at base recommendations and normalised for seasonality:
# Subset the data
subset_df = df[['product', 'price', 'normalized_recommendations', 'recommendations']]
# Group the data by
correlations = subset_df.groupby('product').apply(lambda x: pd.Series({
'Corr Price v Norm Recomment': x['price'].corr(x['normalized_recommendations']),
'Corr Price v Recommend': x['price'].corr(x['recommendations'])
}))
# Round the correlations to 3 decimal points
correlations = correlations.round(3)
# Reset the index and rename the columns
correlation_table = correlations.reset_index().rename(columns={'level_1': 'Correlation Type'})
# Display the correlation table
print(correlation_table) product Corr Price v Norm Recomment Corr Price v Recommend
0 Product1 -0.970 -0.953
1 Product2 -0.984 -0.977
2 Product3 -0.957 -0.929
Normalising for seasonability, we see a stronger negative correlation for younger ages (<36) but stronger than expected for older ages given the inputs:
# Subset the data
subset_df = df[['product', 'age_group', 'price', 'normalized_recommendations']]
# Group the data by
correlations = subset_df.groupby(['product', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
# Round the correlations to 3 decimal points
correlations = correlations.round(3)
# Display the correlation table
print(correlations)age_group 20-35 36-45 46+
product
Product1 -0.991 -0.789 -0.945
Product2 -0.986 -0.983 -0.991
Product3 -0.980 -0.969 -0.956
Data gets a bit more sparse when we add in all of product, age and occupation, but we do largely see a stronger negative correlation for light blue as expected:
# Subset the data for the desired columns
subset_df = df[['product', 'occupation', 'age_group', 'price', 'normalized_recommendations']]
# Create an empty dictionary to store the correlation tables
correlation_tables = {}
# Iterate over each product
for product in subset_df['product'].unique():
product_df = subset_df[subset_df['product'] == product]
# Group the data by 'occupation' and 'age_group' and calculate the correlations
correlations = product_df.groupby(['occupation', 'age_group']).apply(lambda x: x['price'].corr(x['normalized_recommendations'])).unstack()
# Round the correlations to 3 decimal points
correlations = correlations.round(3)
# Store the correlation table in the dictionary
correlation_tables[product] = correlations
# Display the correlation tables
for product, correlations in correlation_tables.items():
print(f"Correlation table for {product}")
print(correlations)
print()Correlation table for Product1
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.974 -0.813 -0.968
LIGHT BLUE -0.952 -0.826 -0.958
WHITE COLLAR -0.997 NaN -0.930
Correlation table for Product2
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.992 -0.631 -0.699
LIGHT BLUE -0.992 -0.992 -0.973
WHITE COLLAR -0.987 NaN -0.996
Correlation table for Product3
age_group 20-35 36-45 46+
occupation
HEAVY BLUE -0.864 -0.750 -0.967
LIGHT BLUE -0.958 -0.932 -0.935
WHITE COLLAR NaN -0.865 -0.906
Initially we are modelling recommendations as explained by price and other explanatory variables. Models considered were a Gradient Boosting Regressor, Generalized Linear Model (GLM), and Fixed Effects (FE) Panel Model:
To start with, let’s model the normalised recommendations. We see that month is not significant. Params can be directly interpreted as price sensitivities. No differential by age, occ and product.
# Set date format
df['date'] = pd.to_datetime(df['date'])
# Define predictors
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)
# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
for dummy in df_encoded.filter(like=col).columns:
df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
df_encoded = df_encoded.astype(float)
# Define target and predictors
X = df_encoded[['month'] + [col for col in df_encoded if ':price' in col]]
Y = df['normalized_recommendations']
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())
# Train model
glm_results = glm_model.fit()
# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)
# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)
# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")GLM MSE: 0.0008
print(f"GLM R-squared: {glm_r2:.4f}")GLM R-squared: 0.9588
# Summary
glm_results.summary()| Dep. Variable: | normalized_recommendations | No. Observations: | 1425 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 1416 |
| Model Family: | Gaussian | Df Model: | 8 |
| Link Function: | Identity | Scale: | 0.00087058 |
| Method: | IRLS | Log-Likelihood: | 3008.5 |
| Date: | Wed, 26 Jul 2023 | Deviance: | 1.2234 |
| Time: | 13:36:11 | Pearson chi2: | 1.22 |
| No. Iterations: | 3 | Pseudo R-squ. (CS): | 1.000 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 2.4810 | 0.009 | 268.747 | 0.000 | 2.463 | 2.499 |
| month | 0.0003 | 0.000 | 1.168 | 0.243 | -0.000 | 0.001 |
| product_Product1:price | -0.0049 | 2.8e-05 | -173.554 | 0.000 | -0.005 | -0.005 |
| product_Product2:price | -0.0048 | 3.2e-05 | -149.704 | 0.000 | -0.005 | -0.005 |
| product_Product3:price | -0.0049 | 3.53e-05 | -138.976 | 0.000 | -0.005 | -0.005 |
| age_group_20-35:price | -0.0049 | 3.52e-05 | -138.179 | 0.000 | -0.005 | -0.005 |
| age_group_36-45:price | -0.0049 | 3.21e-05 | -153.434 | 0.000 | -0.005 | -0.005 |
| age_group_46+:price | -0.0048 | 2.8e-05 | -170.282 | 0.000 | -0.005 | -0.005 |
| occupation_HEAVY BLUE:price | -0.0049 | 3.15e-05 | -154.921 | 0.000 | -0.005 | -0.005 |
| occupation_LIGHT BLUE:price | -0.0048 | 3.33e-05 | -145.694 | 0.000 | -0.005 | -0.005 |
| occupation_WHITE COLLAR:price | -0.0048 | 3.04e-05 | -158.408 | 0.000 | -0.005 | -0.005 |
Modifying this to model recommendations but with month as a polynomial:
# Set date format
df['date'] = pd.to_datetime(df['date'])
# Define predictors
df_encoded = pd.get_dummies(df[['product', 'age_group', 'occupation', 'price', 'month']], drop_first=False)
# Add polynomial term for the month
df_encoded['month_squared'] = df_encoded['month']**2
# Interactions
interaction_columns = ['product', 'age_group', 'occupation']
for col in interaction_columns:
for dummy in df_encoded.filter(like=col).columns:
df_encoded[f'{dummy}:price'] = df_encoded[dummy] * df_encoded['price']
df_encoded = df_encoded.astype(float)
# Define target and predictors
X = df_encoded[['month', 'month_squared'] + [col for col in df_encoded if ':price' in col]]
Y = df['recommendations']
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Add constant to predictors - statsmodels requires this for correct model specification
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# Create Negative Binomial model
glm_model = sm.GLM(y_train, X_train, family=sm.families.Gaussian())
# Train model
glm_results = glm_model.fit()
# Use the model to make predictions
glm_predictions = glm_results.predict(X_test)
# Calculate evaluation metrics
glm_mse = mean_squared_error(y_test, glm_predictions)
glm_r2 = r2_score(y_test, glm_predictions)
# Print evaluation metrics
print(f"GLM MSE: {glm_mse:.4f}")GLM MSE: 8.9030
print(f"GLM R-squared: {glm_r2:.4f}")GLM R-squared: 0.9574
# Summary
glm_results.summary()| Dep. Variable: | recommendations | No. Observations: | 1425 |
|---|---|---|---|
| Model: | GLM | Df Residuals: | 1415 |
| Model Family: | Gaussian | Df Model: | 9 |
| Link Function: | Identity | Scale: | 9.2575 |
| Method: | IRLS | Log-Likelihood: | -3602.7 |
| Date: | Wed, 26 Jul 2023 | Deviance: | 13101. |
| Time: | 13:36:12 | Pearson chi2: | 1.31e+04 |
| No. Iterations: | 3 | Pseudo R-squ. (CS): | 1.000 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 250.7090 | 0.983 | 254.945 | 0.000 | 248.782 | 252.636 |
| month | 1.3797 | 0.100 | 13.787 | 0.000 | 1.184 | 1.576 |
| month_squared | -0.0587 | 0.008 | -7.734 | 0.000 | -0.074 | -0.044 |
| product_Product1:price | -0.5012 | 0.003 | -173.719 | 0.000 | -0.507 | -0.496 |
| product_Product2:price | -0.4943 | 0.003 | -149.876 | 0.000 | -0.501 | -0.488 |
| product_Product3:price | -0.5070 | 0.004 | -139.169 | 0.000 | -0.514 | -0.500 |
| age_group_20-35:price | -0.5018 | 0.004 | -138.316 | 0.000 | -0.509 | -0.495 |
| age_group_36-45:price | -0.5091 | 0.003 | -153.678 | 0.000 | -0.516 | -0.503 |
| age_group_46+:price | -0.4915 | 0.003 | -170.436 | 0.000 | -0.497 | -0.486 |
| occupation_HEAVY BLUE:price | -0.5039 | 0.003 | -155.091 | 0.000 | -0.510 | -0.498 |
| occupation_LIGHT BLUE:price | -0.5006 | 0.003 | -145.874 | 0.000 | -0.507 | -0.494 |
| occupation_WHITE COLLAR:price | -0.4980 | 0.003 | -158.591 | 0.000 | -0.504 | -0.492 |
# Create Gradient Boosting model
gb_model = GradientBoostingRegressor(random_state=42)
# Train model
gb_model.fit(X_train, y_train)GradientBoostingRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor(random_state=42)
# Use the model to make predictions
gb_predictions = gb_model.predict(X_test)
# Calculate metrics
gb_mse = mean_squared_error(y_test, gb_predictions)
gb_r2 = r2_score(y_test, gb_predictions)
print(f'Gradient Boosting MSE: {gb_mse:.4f}')Gradient Boosting MSE: 5.2963
print(f'Gradient Boosting R^2: {gb_r2:.4f}')Gradient Boosting R^2: 0.9747
# Get feature importances
importance = gb_model.feature_importances_
# Create a DataFrame to display the feature importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': importance})
# Sort the DataFrame by importance in descending order
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
# Display the top features
print(feature_importance_df) Feature Importance
6 age_group_20-35:price 0.354780
8 age_group_46+:price 0.270391
7 age_group_36-45:price 0.124483
4 product_Product2:price 0.094404
3 product_Product1:price 0.074612
10 occupation_LIGHT BLUE:price 0.025876
1 month 0.014360
5 product_Product3:price 0.014010
9 occupation_HEAVY BLUE:price 0.011436
11 occupation_WHITE COLLAR:price 0.008229
2 month_squared 0.007420
0 const 0.000000
Extracting elasticities:
def compute_elasticity(model, feature_name):
# Create a dummy row of zeros
dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
# If the feature ends with ":price", then it's a continuous feature.
if ":price" in feature_name:
dummy_data[feature_name] = 100 # Setting initial price to a standard value
original_pred = model.predict(dummy_data)
# Simulate a 1% increase
dummy_data[feature_name] = 110
new_pred = model.predict(dummy_data)
else: # It's a binary/categorical feature
dummy_data[feature_name] = 1
original_pred = model.predict(dummy_data)
# Toggle its value
dummy_data[feature_name] = 0
new_pred = model.predict(dummy_data)
percent_change_qty = (new_pred - original_pred) / original_pred
elasticity = percent_change_qty / 0.1
return round(elasticity[0],4) # Extract value from the array
feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]
# Create a dictionary to store computed elasticities
elasticities = {}
# For each feature of interest, compute the elasticity
for feature in feature_names:
elasticities[feature] = compute_elasticity(gb_model, feature)
print(elasticities){'const': 0.0, 'product_Product1:price': -0.7521, 'product_Product2:price': -0.3182, 'product_Product3:price': -0.2497, 'age_group_20-35:price': -0.8312, 'age_group_36-45:price': -0.815, 'age_group_46+:price': -1.1606, 'occupation_HEAVY BLUE:price': 0.0, 'occupation_LIGHT BLUE:price': -0.1401, 'occupation_WHITE COLLAR:price': 0.0078}
Looking at a range of price change sensitivities:
def compute_elasticity_for_changes(model, feature_name, changes):
# This dictionary will store elasticities for various changes
results = {}
# Create a dummy row of zeros
dummy_data = pd.DataFrame(np.zeros((1, len(X_train.columns))), columns=X_train.columns)
for change in changes:
# Avoid issues with 0% change
if change == 0:
results['0%'] = np.nan
continue
# If the feature ends with ":price", then it's a continuous feature.
if ":price" in feature_name:
dummy_data[feature_name] = 100 # Setting initial price to a standard value
original_pred = model.predict(dummy_data)[0] # Extract the scalar
# Simulate a change
dummy_data[feature_name] = 100 + 100 * change
new_pred = model.predict(dummy_data)[0] # Extract the scalar
else: # It's a binary/categorical feature
dummy_data[feature_name] = 1
original_pred = model.predict(dummy_data)[0] # Extract the scalar
# Toggle its value
dummy_data[feature_name] = 0
new_pred = model.predict(dummy_data)[0] # Extract the scalar
percent_change_qty = (new_pred - original_pred) / original_pred
elasticity = round(percent_change_qty / change, 4)
results[str(int(change * 100)) + '%'] = elasticity
return results
# Define a continuous range of sensitivities from -50% to 50% in 1% increments
changes = [i/100 for i in range(-50, 51)]
feature_names = [f for f in X_train.columns if f not in ['month', 'month_squared']]
elasticity_table = pd.DataFrame(index=feature_names)
for feature in feature_names:
elasticities = compute_elasticity_for_changes(gb_model, feature, changes)
for key, value in elasticities.items():
elasticity_table.at[feature, key] = value
print(elasticity_table) -50% -49% -48% ... 48% 49% 50%
const -0.0000 -0.0000 -0.0000 ... 0.0000 0.0000 0.0000
product_Product1:price -0.0256 -0.0000 -0.0000 ... -0.1769 -0.1733 -0.1698
product_Product2:price -0.0978 -0.0998 -0.1019 ... -0.1778 -0.1741 -0.1707
product_Product3:price -0.0236 -0.0241 -0.0246 ... -0.0535 -0.0525 -0.0514
age_group_20-35:price -0.4954 -0.5055 -0.5160 ... -0.2911 -0.2851 -0.2794
age_group_36-45:price -0.1829 -0.1867 -0.1906 ... -0.2840 -0.2782 -0.2726
age_group_46+:price -0.1886 -0.1925 -0.1965 ... -0.4674 -0.4579 -0.4487
occupation_HEAVY BLUE:price -0.0214 -0.0219 -0.0223 ... -0.0183 -0.0179 -0.0175
occupation_LIGHT BLUE:price -0.0977 -0.0997 -0.1018 ... -0.0636 -0.0623 -0.0611
occupation_WHITE COLLAR:price -0.0258 -0.0263 -0.0269 ... -0.0182 -0.0178 -0.0175
[10 rows x 99 columns]
Plotting the elasticities:
# Adjust the names in the dictionaries to match the format in the dataframe
product_elasticities <- list('product_Product1:price' = 1.1, 'product_Product2:price' = 1.1, 'product_Product3:price' = 1)
age_elasticities <- list('age_group_20-35:price' = 1.5, 'age_group_36-45:price' = 1.2, 'age_group_46+:price' = 1.1)
occupation_elasticities <- list('occupation_LIGHT BLUE:price' = 1.2, 'occupation_HEAVY BLUE:price' = 1.05, 'occupation_WHITE COLLAR:price' = 1.05)
# Combine all the dictionaries into a data frame
original_elasticities <- data.frame(
feature = c(names(product_elasticities), names(age_elasticities), names(occupation_elasticities)),
elasticity = c(unlist(product_elasticities), unlist(age_elasticities), unlist(occupation_elasticities))
)
# Adjust the elasticity values
original_elasticities$elasticity <- original_elasticities$elasticity * (-1)
# Convert python df to r for plotting
elasticity_table <- py$elasticity_table
# Convert data to long format
long_data <- elasticity_table %>%
rownames_to_column(var = "feature") %>%
gather(key = "change", value = "elasticity", -feature)
# Extract numerical values from the "change" column for sizing
long_data$change_num <- as.numeric(gsub("%", "", long_data$change))/100
# Plot the data
ggplot(long_data, aes(x = elasticity, y = feature)) +
geom_point(aes(size = change_num), color = "steelblue") + # single color and size is based on change_num
geom_point(data = original_elasticities, aes(x = elasticity, y = feature), color = "red", size = 4) + # Data points for original elasticities
geom_vline(aes(xintercept = 0), linetype = "dashed", color = "grey50") + # line at x=0 for reference
labs(title = "Elasticities by Feature for Different Percentage Changes",
x = "Elasticity",
y = "Feature") +
theme_minimal() +
theme(legend.position = "bottom") +
guides(size = guide_legend(title = "Price Change"))
# Convert to panel data
panel_data = df.set_index(['product', 'date'])
# Define exogenous variables
exog = panel_data[['age_group', 'occupation', 'price']]
exog = sm.add_constant(exog, has_constant='add') # Add constant by product
# Define endogenous variable
endog = panel_data['normalized_recommendations']
# Create a model with fixed effects
fe_model = PanelOLS(endog, exog, entity_effects=True, drop_absorbed=True)
# Fit the model
fe_results = fe_model.fit()
# Use the model to make predictions
fe_predictions = fe_results.predict(exog)
# Calculate evaluation metrics
fe_mse = mean_squared_error(endog, fe_predictions)
fe_r2 = r2_score(endog, fe_predictions)
# Print evaluation metrics
print(f"FE Model MSE: {fe_mse:.4f}")FE Model MSE: 0.0009
print(f"FE Model R-squared: {fe_r2:.4f}")FE Model R-squared: 0.9594
# Print model summary
print(fe_results) PanelOLS Estimation Summary
========================================================================================
Dep. Variable: normalized_recommendations R-squared: 0.9547
Estimator: PanelOLS R-squared (Between): 0.9917
No. Observations: 1782 R-squared (Within): 0.9547
Date: Wed, Jul 26 2023 R-squared (Overall): 0.9594
Time: 13:36:21 Log-likelihood 3765.5
Cov. Estimator: Unadjusted
F-statistic: 7470.5
Entities: 3 P-value 0.0000
Avg Obs: 594.00 Distribution: F(5,1774)
Min Obs: 594.00
Max Obs: 594.00 F-statistic (robust): 7470.5
P-value 0.0000
Time periods: 66 Distribution: F(5,1774)
Avg Obs: 27.000
Min Obs: 27.000
Max Obs: 27.000
Parameter Estimates
===========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------------------
const 2.4563 0.0081 304.18 0.0000 2.4405 2.4722
age_group.36-45 -0.0077 0.0017 -4.4776 0.0000 -0.0111 -0.0043
age_group.46+ 0.0062 0.0018 3.3776 0.0007 0.0026 0.0098
occupation.LIGHT BLUE 0.0053 0.0017 3.1193 0.0018 0.0020 0.0087
occupation.WHITE COLLAR 0.0043 0.0017 2.5084 0.0122 0.0009 0.0076
price -0.0143 8.051e-05 -177.81 0.0000 -0.0145 -0.0142
===========================================================================================
F-test for Poolability: 23.944
P-value: 0.0000
Distribution: F(2,1774)
Included effects: Entity