Causality modelling in Python for data scientists
Data science is increasingly commonplace in industry and the enterprise. Industrial data scientists have a vast toolbox for descriptive and predictive analyses at their disposal. However, data science tools for decision-making in industry and the enterprise are less well established. Here we survey Python packages that can aid industrial data scientists facilitate intelligent decision-making through causality modelling.
- The need for causality modelling
- How to do causality modelling
- Use case: The impact of direct marketing on customer behavior
The need for causality modelling
Intelligent planning and decision-making lie at the heart of most business success.
The decisions that our business needs to evaluate can range from those that are relatively low effort and we take potentially thousands or millions of times a day to those that are high effort and are taken every couple of months:
- What will happen if I show an advertising banner to a particular user?
- What will happen if I change the retail prices for certain products in my shop?
- What will happen if I alter my manufacturing process?
- What will happen if I swap out a particular mechanical piece in a vehicle I develop?
- What will happen if I invest in new property, machinery, or processes?
- What will happen if I hire this applicant?
- What if I increase remuneration of my workforce?
As industrial data scientists we are oftentimes called upon to evaluate these proposed business decisions using analytics, machine learning methodologies, and past data.
What we may end up doing for the above proposed business decisions is:
- Compute and rank past click-through rates for given pairs of ad banner and user,
- Correlate past demand with set retail prices for product groups of interest,
- Correlate past manufacturing parameters with achieved output quality,
- Correlate the mechanical behavior of my vehicles with the mechnical parts used in it,
- Use past data to forecast the development of real estate prices,
- Use past data to correlate and predict the productivity of my team given e.g. its size or makeup, and
- Use past data to correlate productivity and remuneration levels.
The way I formulated these is already pretty suggestive - but essentially some of our common approaches to evaluating business decisions do not compare our business outcomes with and without said business decisions but they rather look at our data outside the context of decision-making.
Put another way, we oftentimes analyze past data without considering the state our business or customer is in when those data were generated. For illustration:
So really when tasked with evaluating the above proposed business decisions we should instead think in terms of questions akin the following:
- How would the user of interest behave differently if we didn't show them (and pay for) a banner now?
- For each Euro we shave off a price tag how much higher will our revenue be since more customers are inclinced to place an order?
- 4. 5. 6. 7.
How to do causality modelling
The authors Hünermund and Bareinboim (https://arxiv.org/abs/1912.09104) proposed a methodology they called data-fusion process.
The data-fusion process maps out the individual steps necessary for evaluating the impact of past and potential future decisions:
Use case: The impact of direct marketing on customer behavior
We'll use a data set provided by UCI (https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) that demonstrates the potential impact of direct marketing on customer success.
Let's dive right in, download the data set and see what we are working with.
!wget --quiet https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
!unzip -oqq bank.zip
import pandas as pd
df = pd.read_csv('bank.csv', delimiter=';')
df['success'] = df['y']
del df['y']
df['success'] = df['success'].replace('no', 0)
df['success'] = df['success'].replace('yes', 1)
del df['duration']
df['no_contacts'] = df['campaign']
del df['campaign']
df.head()
Our tabular marketing and sales data shows a number of features we observe about a given customer and our interaction with them:
- The customer's age, job, marital status, education, current account balance, and whether or not they already took out a loan are recorded,
- Our direct marketing interaction with a given customer is also recorded, for instance, how often we already contacted them.
A more detailed description of the features in our data can be found here:
target = 'success'
features = [column for column in df.columns if column != target]
import lightgbm as lgb
from sklearn.preprocessing import OrdinalEncoder
model = lgb.LGBMClassifier()
X, y = df[features], df[target]
numerical_features = ['age', 'balance', 'no_contacts', 'previous', 'pdays']
categorical_features = [feature for feature in features if feature not in numerical_features]
encoder = OrdinalEncoder(dtype=int)
X_numeric = pd.concat(
[
X[numerical_features],
pd.DataFrame(
data=encoder.fit_transform(X[categorical_features]),
columns=categorical_features
)
],
axis=1
)
X_numeric.head()
model.fit(X_numeric, y)
%matplotlib inline
lgb.plot_importance(model);
There are numerous ways to compute feature importance and this one implemented in the LightGBM library measures the number of times a given feature is used in the constructed trees:
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html
In general, feature importance gives us a measure of how well a given measured variable correlates with the target (marketing success in our case).
The question here is: How can we use our trained success predictor and our feature importances to aid intelligent plannning and decision-making in our business?