Learn practical skills, build real-world projects, and advance your career
Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2017 L.A. Barba, N.C. Clementi

Cheers! Stats with Beers

Welcome to the second module in Engineering Computations, our series in computational thinking for undergraduate engineering students. This module explores practical statistical analysis with Python.

This first lesson explores how we can answer questions using data combined with practical methods from statistics.

We'll need some fun data to work with. We found a neat data set of canned craft beers in the US, scraped from the web and cleaned up by Jean-Nicholas Hould (@NicholasHould on Twitter)—who we want to thank for having a permissive license on his GitHub repository so we can reuse his work!

The data source (@craftcans on Twitter) doesn't say that the set includes all the canned beers brewed in the country. So we have to asume that the data is a sample and may contain biases.

We'll manipulate the data using NumPy—the array library for Python that we learned about in Module 1, lesson 4. But we'll also learn about a new Python library for data analysis called pandas.

pandas is an open-source library providing high-performance, easy-to-use data structures and data-analysis tools. Even though pandas is great for data analysis, we won't exploit all its power in this lesson. But we'll learn more about it later on!

We'll use pandas to read the data file (in csv format, for comma-separated values), display it in a nice table, and extract the columns that we need—which we'll convert to numpy arrays to work with.

Let's start by importing the two Python libraries that we need.

import pandas
import numpy

Step 1: Read the data file

Below, we'll take a peek into the data file, beers.csv, using the system command head (which we can use with a bang, thanks to IPython).

But first, we will download the data using a Python library for opening a URL on the Internet. We created a short URL for the data file in the public repository with our course materials.

The cell below should download the data in your current working directory. The next cell shows you the first few lines of the data.

from urllib.request import urlretrieve
URL = 'http://go.gwu.edu/engcomp2data1'
urlretrieve(URL, 'beers.csv')
('beers.csv', <http.client.HTTPMessage at 0x11d88c9e8>)