An Intro to Predictive Modeling for Customer Lifetime Value (CLV) -- Tutorial Notebook

In this notebook, you will be introduced to the workflow necessary to train a Pareto/NBD model (e.g. Schmittlein et al. 1987) on a transactional dataset. An extension to the Pareto/NBD model includes predictions for the monetary value as well (Gamma-Gamma model -- Fader et al. 2004).

The Pareto/NBD model is a good introductory probabilistic model to the non-contractual setting with continous purchase opportunity. It's a simple enough model that is easy to train and generally produces good results when the assumptions behind the model are met. It's a good first shot at CLV modeling !

A few words on the CDNOW Dataset

The CDNOW dataset is a very popular dataset used in academic papers addressing CLV models. CDNOW used to be an online retailer of CDs in the 1990's. The dataset in question includes the transactional data of a cohort of customers who have made their first purchase in the first quarter of 1997. All transactions from these customers between their purchase and June 1998 are included. The transactional data was downsampled to contain transactions of 10% of the customers population (2357 customers).

The CDNOW dataset is a good example of a non-contractual setting with a continuous purchasing opportunity. It has been used extensively in the CLV literature.

A Few Warnings/Disclaimers :

In this notebook, I favored code simplicity over performance. Some of you may find that operations done on dataframes and in the STAN code could be optimized for performance. That is certainly the case.
The sample log-likelihood of the Pareto/NBD model can be derived analytically and model parameters can be found via standard optimization routines. In principle, one does not have to use STAN to obtain the parameters of the Pareto/NBD model. The purpose of the STAN code is to allow users to potentially extend the model beyond sample log-likelihoods that cannot be expressed analytically and require MCMC evaluation.
The main focus of this notebook is to predict the purchase count. I left it for you as an exercise to implement the gamma-gamma model for the monetary value component, as described in this paper by Fader et al. (2005). I put the solution at the bottom of the notebook.

Requirements

This notebook was tested with

ipython version 3.0.0
Mac OS X El Captian version 10.11.6

We recommend a few Gb of RAM (2-3).

Please download and place the file cdnow_transaction_log.csv in the same directory as this file.

Training a Pareto/NBD Model on the CDNOW Dataset

# Doing the necessary installations. You can install these from your notebook or 
# from a terminal window if you are prompted for a password. 

# You may have to do `sudo pip ...` for some or all all of the packages below. 

!pip install numpy==1.12.0 
!pip install pandas==0.19.2
!pip install scipy==0.18.1
!pip install matplotlib==2.0.0
!pip install pickle
!pip install cython==0.23.5
!pip install pystan==2.10.0

# Doing all the necessary imports here 

import os 
import sys 
import pandas as pd
import numpy as np 
import pystan 
import matplotlib.pyplot as plt
import pickle
from datetime import datetime 
from scipy.stats import gaussian_kde
from hashlib import md5
%matplotlib inline 
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 10)

# if you are having issues with matplotlib on macosx, I recommend taking a look 
# at this stakoverflow thread : 
# http://stackoverflow.com/questions/21784641/installation-issue-with-matplotlib-python 
# 1) cd ~/.matplotlib
# 2) vim matplotlibrc
# 3) add to the file :  backend: TkAgg
# 4) restart the ipython kernel

Populating the interactive namespace from numpy and matplotlib

An Intro to Predictive Modeling for Customer Lifetime Value (CLV) -- Tutorial Notebook

A few words on the CDNOW Dataset

A Few Warnings/Disclaimers :

Requirements

Training a Pareto/NBD Model on the CDNOW Dataset

Import the dataset into a Pandas DataFrame