Join the “Zero to Data Analyst” Bootcamp. Limited seats.

A 20-week program covering 7 courses, 12 assignments, 4 real-world projects, and 12 months of career support.

This assignment takes you on a brief journey through frequentist statistics. You will explore

- the
*z*-statistic - the
*t*-statistic - the difference and relationship between the two
- the Central Limit Theorem, its assumptions and consequences
- how to estimate the population mean and standard deviation from a sample
- the concept of a sampling distribution of a test statistic, particularly for the mean
- how to combine these concepts to calculate confidence intervals and p-values
- how those confidence intervals and p-values allow you to perform hypothesis (or A/B) tests

For working through this notebook, you are expected to have an understanding of:

- the idea of a random variable
- what a probability density function (pdf) is
- and the cumulative density function
- what the Normal distribution is at a high level

It will be helpful if you are familiar with the concept of a sampling distribution, but this assignment will introduce this and give you hands on experience. As such, this notebook will take you from a basic understanding of random variables, and probability and bridge the gap to applying it in Python before moving on to a real world application.

In the previous notebook, we used only data from a known normal distribution. Now we tackle real data, rather than simulated data, and look at answering some relevant real-world business problems from the data.

You're now in the position of a data analyst working for a hospital. An administrator is working on the hospital's business operations plan and needs you to help them answer some business questions. The next few assignment notebooks are designed to illustrate how each of the inferential statistics methods have their uses for different use cases. In this assignment notebook, you're going to use statistical inference on a data sample to answer the questions:

- has the hospital's revenue stream fallen below a key threshold?
- are patients with insurance really charged different amounts than those without?
Answering that last question with a frequentist approach makes some assumptions, or requires some knowledge, about the two groups. In the following assignment notebook you'll use bootstrapping to test that assumption. And in the final notebook you're going to create a model for simulating
*individual*charges (not a sampling distribution) that the hospital can use to model a range of scenarios.

We are going to use some data on medical charges obtained from Kaggle. For the purposes of this exercise, assume the observations are the result of random sampling from our one hospital. Recall in the previous assignment, we introduced the Central Limit Theorem (CLT), and how it tells us that the distributions of sample statistics approach a normal distribution as \(n\) increases. The amazing thing about this is that it applies to the sampling distributions of statistics that have been calculated from even highly non-normal distributions of data. Remember, also, that hypothesis testing is very much based on making inferences about such sample statistics. You're going to rely heavily on the CLT to apply frequentist (parametric) tests to answer the questions in this notebook.

In [1]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
from numpy.random import seed
medical = pd.read_csv('data/insurance2.csv')
```

In [2]:

`medical.shape`

Out[2]:

`(1338, 8)`

In [3]:

`medical.head()`

Out[3]:

**Q:** Plot the histogram of charges and calculate the mean and standard deviation. Comment on the appropriateness of these statistics for the data.

**A:**

In [4]:

```
_ = plt.hist(medical.charges, bins=30)
_ = plt.xlabel('Charge amount ($)')
_ = plt.ylabel('Counts')
_ = plt.title('Medical charges')
```

In [5]:

`np.mean(medical.charges), np.std(medical.charges)`

Out[5]:

`(13270.422265141257, 12105.484975561612)`

The distribution of charges is clearly non-normal, but the mean is still a quantity that can be calculated. You can still interpret it as an expected amount (for a given sample size). The standard deviation is clearly not a very useful measure for describing the variability in the distribution of values. But it is still a useful quantity to calculate for performing a frequentist hypothesis test.

**Q:** The administrator is concerned that the actual average charge has fallen below 12000, threatening the hospital's operational model. On the assumption that these data represent a random sample of charges, how would you justify that these data allow you to answer that question? And what would be the most appropriate frequentist test, of the ones discussed so far, to apply?

**A:** The distribution of individual values is clearly non normal. As such, the mean and standard deviation are poor summary statistics for that distribution. However, the hospital is not particularly concerned here with the charges in individual cases, but rather the charges in aggregate. A metric of real interest to an administrator would be the expected total charge for a given number of cases; in other words, for a given number of patients treated, what charges would we expect to bill? Here, the number of cases is simply a scaling factor and we may equivalently talk about the mean value. A key value of interest here is indeed the mean and we are interested in making inference about that. The CLT tells us that we can expect this statistic to approach a normal distribution with mean \(\mu\) and standard deviation \(\sigma / \sqrt n\), where \(\mu\) and \(\sigma\) are the population mean and standard deviation. We do not, however, know these parameters and they must be estimated from our sample. Whilst we can generally trust that this consequence of the CLT holds true, we can never know how close we are to achieving it for a given sample size in any particular problem. In this case, the appropriate test is the *t*-test.

**Q:** Given the nature of the administrator's concern, what is the appropriate confidence interval in this case? A one-sided or two-sided interval? Calculate the critical value and the relevant 95% confidence interval for the mean and comment on whether the administrator should be concerned?

**A:** The administrator is concerned as to whether the average charge had fallen below a particular value. They are not concerned with whether the average charge is higher. Presumably they would not be concerned in that case! The appropriate interval, therefore, is a one-sided interval.

In [6]:

```
n = len(medical.charges)
pop_mean_est = np.mean(medical.charges)
pop_std_est = np.std(medical.charges, ddof=1)
n, pop_mean_est, pop_std_est
```

Out[6]:

`(1338, 13270.422265141257, 12110.011236694001)`

In [7]:

```
t_crit = t.ppf(.05, df=n-1)
t_crit
```

Out[7]:

`-1.6459941145571324`

In [8]:

```
lower_limit = pop_mean_est + t_crit * pop_std_est / np.sqrt(n)
lower_limit
```

Out[8]:

`12725.48718381623`

The lower limit on our confidence interval is thus around 12725, above the 12000 level of concern. The administrator is relieved

The administrator then wants to know whether people with insurance really are charged a different amount to those without.

**Q:** State the null and alternative hypothesis here. Use the *t*-test for the difference between means where the pooled standard deviation of the two groups is given by
\begin{equation}
s_p = \sqrt{\frac{(n_0 - 1)s^2_0 + (n_1 - 1)s^2_1}{n_0 + n_1 - 2}}
\end{equation}

and the *t* test statistic is then given by

\begin{equation} t = \frac{\bar{x}_0 - \bar{x}_1}{s_p \sqrt{1/n_0 + 1/n_1}}. \end{equation}

What assumption about the variances of the two groups are we making here?

**A:** The null hypothesis is that the average charge for patients without insurance is the same as that for those with insurance. The alternative hypothesis is that these means are different. The test to use here is the two-sample t-test. We are assuming the two groups have equal variance.

**Q:** Perform this hypothesis test both manually, using the above formulae, and then using the appropriate function from scipy.stats (hint, you're looking for a function to perform a *t*-test on two independent samples). For the manual approach, calculate the value of the test statistic and then its probability (the p-value). Verify you get the same results from both.

**A:**

In [9]:

```
x0 = medical.charges[medical.insuranceclaim == 0]
x1 = medical.charges[medical.insuranceclaim == 1]
n0 = len(x0)
n1 = len(x1)
n0, n1
```

Out[9]:

`(555, 783)`

In [10]:

```
xbar0 = np.mean(x0)
xbar1 = np.mean(x1)
s0 = np.std(x0, ddof=1)
s1 = np.std(x1, ddof=1)
sp = np.sqrt( ((n0 - 1) * s0**2 + (n1 - 1) * s1**2) / ( n0 + n1 - 2) )
t_stat = ( xbar0 - xbar1 ) / ( sp * np.sqrt( 1/n0 + 1/n1 ) )
total_dof = n0 + n1 - 2
p_value = 2 * t.cdf(t_stat, df=total_dof)
t_stat, p_value
```

Out[10]:

`(-11.89329903087671, 4.461230231620972e-31)`

The test statistic is large and extremely unlikely to have occurred by chance. We choose to reject the null hypothesis in favour of the alternative that the means of the two groups are different. Now verifying the above using ttest_ind from scipy.stats.

In [11]:

`from scipy.stats import ttest_ind`

In [12]:

`ttest_ind(x0, x1)`

Out[12]:

`Ttest_indResult(statistic=-11.893299030876712, pvalue=4.461230231620717e-31)`

Congratulations! Hopefully you got the exact same numerical results. This shows that you correctly calculated the numbers by hand. Secondly, you used the correct function and that is much easier to use. All you need to do is pass it your data.

**Q:** In the above calculations, we assumed the sample variances were equal. We may well suspect they are not (we'll explore this in another assignment). The calculation becomes a little more complicated to do by hand in this case, but we now know of a helpful function. Check the documentation for the function to tell it not to assume equal variances and perform the test again.

**A:**

In [13]:

`ttest_ind(x0, x1, equal_var=False)`

Out[13]:

`Ttest_indResult(statistic=-13.298031957975649, pvalue=1.1105103216309125e-37)`

We get very different values now, but the consequence is the same; we reject the null hypothesis.

**Q:** Conceptual question: look through the documentation for statistical test functions in scipy.stats. You'll see the above *t*-test for a sample, but can you see an equivalent one for performing a *z*-test from a sample? Comment on your answer.

**A:** There is no such equivalent function for the *z*-test. The *z*-test is applicable when we know the population parameters, so we would not be performing it using a sample.

Having completed this project notebook, you have good hands-on experience of

- how you can use the central limit theorem to help you apply frequentist techniques to answer questions that pertain to very non-normally distributed data from the real world
- how to then perform inference using such data to answer business questions
- forming a hypothesis and framing the null and alternative hypotheses
- testing this using a
*t*-test