Exploratory Data Analysis on U.S Pollution Data

banner-image

Introduction

Air pollution is one of the outstanding environmental concerns in the world today, affecting both developed and developing regions. In 2016 alone, it accounted for 4.2 million deaths globally and over 59,000 in the U.S, because of it's association with stroke, lung canncer, ischemic heart disease, acute lower respiratory disease and chronic obstructive pulmonary disease. Although there has been a significant downward trend in the number of deaths from 1990 to 2016 in the U.S, which could be as a result of stricter regulations, it is still imperative that we monitor these pollutants, by using measures such as air quality index, which translates numerical data into a descriptive rating scale and makes it easier for citizens of all ages to understand the level of pollution in the air they breathe.

What is an air quality index?

Pollution often takes the form of particulate matter(PM2.5_{2.5} and PM10_{10}), nitrogen oxide(NO), nitrogen dioxide(NO2_{2}), sulphur dioxide(SO2_{2}), carbon monoxide(CO), and ozone(O3_{3}). Now, an air quality index is a scale used to show how polluted the air is with these pollutants, along with the risks associated with each rating.

Think of the AQI as a yardstick that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 or below represents good air quality, while an AQI value over 300 represents hazardous air quality.

For each pollutant an AQI value of 100 generally corresponds to an ambient air concentration that equals the level of the short-term national ambient air quality standard for protection of public health. AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is unhealthy: at first for certain sensitive groups of people, then for everyone as AQI values get higher.

The AQI is divided into six categories. Each category corresponds to a different level of health concern. Each category also has a specific color. The color makes it easy for people to quickly determine whether air quality is reaching unhealthy levels in their communities.

alt

Project Objective

In this project, we will perform exploratory data analysis (EDA) on a dataset documented by the U.S. EPA on pollution caused by four major pollutants. This dataset has a total of 28 fields. The four pollutants ( NO2_{2}, SO2_{2}, CO, and O3_{3}) each have 5 specific columns. Observations totaled to over 1.7 million.

Why EDA?

Exploratory Data Analysis (EDA) refers to the process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations. EDA helps to understand the data first and try to gather as many insights from it as possible.

For our analysis, we will be specifically looking at air quality index(AQI) of these pollutants in some key states and be making comparisons of their yearly, monthly and daily concentrations.

Source: The data used for this analysis was scrapped from the database of U.S. EPA: Link

Here's an outline of the steps we will follow:

  1. Download the dataset from kaggle using the opendatasets library
  2. Preprocess the data and clean using the pandas library
  3. Perform exploratory analysis and visualization using some python visualization libraries such as matplotlib, seaborn, plotly and folium
  4. Ask and answer relevant questions using the data to gain deep understanding and make useful inferences

How to run the code