Sign In

Inferential Statistics III - Bayesian


In the last two subunits, you've encountered two schools for performing inference from samples. The Frequentist school calls upon a body of theory established over the past couple of centuries or so. Under certain assumptions and conditions, this allows us to calculate what we would expect to see if an experiment were to be repeated again and again and again. The expected value of the average of a sample is one such statistic we can calculate a result for, even if the originating distribution is far from normal. The bootstrap school, on the other hand, literally does (virtually) run that experiment again and again and again and empirically observes the multitude of outcomes. It then empirically calculates a statistic of interest. While this can be for exactly the same statistics that frequentism calculates (e.g. the mean of a sample) this empirical approach can also perform inference on statistics that do not have well known sampling distributions. Because of the requirement to repeat many, many redraws (with replacement) from the sample, this approach only became feasible with modern computing power.

And thus we come to the Bayesian school of inference. Here we frame our probabilities not so much in terms of "how many times would I expect this event to occur if the experiment were to be rerun many times" but rather in terms of "what is my belief in the likelihood of this event occurring?" In a Bayesian probabilistic programming context, we can build models for systems and then let the data tell us how likely certain values for our model parameters are. This can be a very useful way to incorporate prior knowledge and deal with limited data. It can just be more than a little fiddly to produce a good model!

Medical charge data set

For the final mini-project of the stats unit, you'll once again return tot he medical charge data you've used for the other mini-projects. Previously, we considered whether we believed that the actual average(non-insured) charge had fallen below a certain threshold.

The hospital is now reviewing its financial resiliency plan, which requires a model for revenue under a range of conditions that include the number of patients treated. Its current model is based on a confidence interval for the mean, and scaling that by different numbers of patients for each scenario. This approach has a number of limitations, most acutely the breakdown of the central limit theorem for low patient volumes; the current model does not do a good job of reflecting the variability in revenue you would see as the number of cases drops. A bootstrap approach would return samples of the same size as the original. Taking subsamples would restrict the sampling to the values already present in the original sample and would not do a good job of representing the actual variability you might see. What is needed is a better model of individual charges.

So the problem here is that we want to model the distribution of individual charges and we also really want to be able to capture our uncertainty about that distribution so we can better capture the range of values we might see. This naturally leads us to a powerful, probabilistic approach — we'll use the pymc3 library to perform Bayesian inference.

Loading the data and performing an initial view

In [11]:
import pymc3 as pm
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
from scipy.stats import gamma
from IPython.core.pylabtools import figsize
import seaborn as sns
# there has been some incompatibilty between theano and numpy, if you encounter
# an error with the latest packages from anaconda, then the included
# package-list-txt should allow you to create a conda environment with compatible
# packages.
In [12]:
medical = pd.read_csv('data/insurance2.csv')
In [13]:
In [14]:
insurance = medical.charges[medical.insuranceclaim == 1]
no_insurance = medical.charges[medical.insuranceclaim == 0]
n_ins = len(insurance)
n_no_ins = len(no_insurance)
In [15]:
_ = plt.hist(insurance, bins=30, alpha=0.5, label='insurance claim')
_ = plt.hist(no_insurance, bins=30, alpha=0.5, label='no insurance claim')
_ = plt.xlabel('Charge amount')
_ = plt.ylabel('Frequency')
_ = plt.legend()
Notebook Image

We may suspect from the above that there is some sort of exponential-like distribution at play here. The charges that were not insurance claims seem most like this. The insurance claim charges may possibly be multimodal. The gamma distribution may be applicable and we could test this for the distribution of charges that weren't insurance claims first. Developing our new method for the easiest looking case first is a common and sound approach that can demonstrate a minimum viable solution/product and get, or keep, stakeholders on board.

Initial parameter estimation

An initial guess for the gamma distribution's \(\alpha\) and \(\beta\) parameters can be made as described here.

In [16]:
alpha_est = np.mean(no_insurance)**2 / np.var(no_insurance)
beta_est = np.var(no_insurance) / np.mean(no_insurance)
alpha_est, beta_est
(1.8759059725250895, 4702.486170152818)

Initial simulation

Let's draw the same number of random variates from this distribution and compare to our observed data.

In [17]:
no_ins_model_rvs = gamma(alpha_est, scale=beta_est).rvs(n_no_ins)
In [18]:
_ = plt.hist(no_ins_model_rvs, bins=30, alpha=0.5, label='simulated')
_ = plt.hist(no_insurance, bins=30, alpha=0.5, label='observed')
_ = plt.xlabel('Charge amount')
_ = plt.ylabel('Frequency')
_ = plt.legend()
Notebook Image

Well it doesn't look too bad! We're not a million miles off. But can we do better? We have a plausible form for the distribution of charge amounts and potential values for that distribution's parameters so we can already draw random variates from that distribution to perform simulations. But we don't know if we have a best estimate for the population parameters, and we also only have a single estimate each for \(\alpha\) and \(\beta\); we aren't capturing our uncertainty in their values. Can we take a Bayesian inference approach to estimate the parameters?

Creating a PyMC3 model

In [19]:
# PyMC3 Gamma seems to use rate = 1/beta
rate_est = 1/beta_est
# Initial parameter estimates we'll use below
alpha_est, rate_est
(1.8759059725250895, 0.00021265346963636103)

Q: You are now going to create your own PyMC3 model!

  1. Use an exponential prior for alpha. Call this stochastic variable alpha_.
  2. Similarly, use an exponential prior for the rate (\(1/\beta\)) parameter in PyMC3's Gamma. Call this stochastic variable rate_ (but it will be supplied as pm.Gamma's beta parameter). Hint: to set up a prior with an exponential distribution for \(x\) where you have an initial estimate for \(x\) of \(x_0\), use a scale parameter of \(1/x_0\).
  3. Create your Gamma distribution with your alpha_ and rate_ stochastic variables and the observed data.
  4. Perform 10000 draws.

Hint: you may find it helpful to work backwards. Start with your pm.Gamma, and note the required stochastic variables alpha and beta. Then, before that, you need to create those stochastic variables using pm.Exponential and the correct parameters.


In [20]:
with pm.Model() as model:
    alpha_ = pm.Exponential('alpha',alpha_est)
    gamma_ = pm.Gamma('gamma',alpha=alpha_est,beta=1/rate_est)
    rate_ = pm.Exponential('rate',gamma_)
    obs = pm.Gamma('obs',alpha=alpha_,beta=1/rate_,observed=no_insurance)
    trace = pm.sample(10000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [rate, gamma, alpha] Sampling 2 chains: 100%|██████████████████████████████████████████████████████| 21000/21000 [26:03<00:00, 8.25draws/s]

If you get a warning about acceptance probability not matching the target, and that it's around 0.88 when it should be close to 0.8, don't worry. We encourage you to read up on this and see if you can adjust the parameters and/or arguments to pm.sample, but if your model runs without any additional warnings or errors then you should be doing great!

Q: Explore your posteriors for \(\alpha\) and \(\beta\) (from the trace).

  • Calculate the 95% credible interval for \(\alpha\) and \(\beta\).
  • Plot your posterior values of \(\alpha\) and \(\beta\) (both line plots and histograms).
  • Mark your CIs on the histograms.
  • Do they look okay? What would bad plots look like?


In [21]:
stats_data  = pm.stats.summary(trace)
c:\users\kavin\appdata\local\programs\python\python37\lib\site-packages\pymc3\ FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality. axis=1, join_axes=[dforg.index])
In [22]:
alpha_interval = [stats_data.loc['alpha','hpd_2.5'],stats_data.loc['alpha','hpd_97.5']]
beta_interval = [stats_data.loc['rate','hpd_2.5'],stats_data.loc['rate','hpd_97.5']]
print(f'Alpha 95% CI :{alpha_interval}')
print(f'Beta 95% CI :{beta_interval}')
Alpha 95% CI :[1.9803353509037076, 2.4583310388939816] Beta 95% CI :[3514.2715352248388, 4488.189293651638]
In [23]:
figsize(12.5, 4)
plt.title("Posterior distribution of alpha")
plt.axvline(alpha_interval[0],linestyle="--", label="lower interval 95% CI",color='red')
plt.axvline(alpha_interval[1],linestyle="--", label="higher interval 95% CI",color='green')
plt.hist(trace['alpha'], bins=25, histtype="step", normed=True)
c:\users\kavin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\axes\ MatplotlibDeprecationWarning: The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead. alternative="'density'", removal="3.1")
Notebook Image
In [24]:
figsize(12.5, 4)
plt.title("Posterior distribution of beta")
plt.axvline(beta_interval[0],linestyle="--", label="lower interval 95% CI",color='red')
plt.axvline(beta_interval[1],linestyle="--", label="higher interval 95% CI",color='green')
plt.hist(trace['rate'], bins=25, histtype="step", normed=True)
Notebook Image
In [ ]:

Q: Play around with some of the built-in diagnostic plots for your model. We suggest at least checking out the traceplot for alpha and beta. How do they look?


In [ ]:
In [ ]:

Q: Take your best shot at a new simulated sequence of medical charges using scipy.stat's gamma distribution. Don't forget the difference between functions that take \(\beta\) and functions that use \(1/\beta\) for the scale parameter. Simulate a data set the same size as the number of observations in the data and overlay the two histograms (simulated and observed).


In [27]:
best_shot_simulated = None
gamma_dist = gamma.rvs(a=alpha_est,scale=1/rate_est,size=555)
plt.hist(no_insurance,alpha=0.5,label='Observed data')
plt.hist(gamma_dist,alpha=0.5,label='Simulated data')
Notebook Image

In this exercise, we have postulated a distribution to describe the individual charge amounts for non-insured cases. This distribution has two required parameters, which we do not know, but we used PyMC3 to perform Bayesian inference to find our level of "belief" in a range of values for them. We then used the average parameter values to create one simulated data set of the same size as the original, but the distribution of our posteriors for these parameters will allow us to perform simulations of any sample size we desire and for a range of scenarios of different \(\alpha\) and \(\beta\). This could be a powerful tool to model different financial conditions for the hospital.

Well done making it through this tricky subject. Starting think Bayesian and starting to get to grips with something like PyMC3 is no easy task. As a data scientist, the most important thing is to be aware that this statistical approach exists, though you may not actually use this approach as much as you use the other approaches you've learned about. Still, we encourage you to think of ways that this approach could apply to the work that you do in this course and throughout your career.