Learn practical skills, build real-world projects, and advance your career

Data Analysis with Python: Zero to Pandas - Course Project Guidelines

(remove this cell before submission)

Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project

This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.

Step 1: Select a real-world dataset

  • Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).

  • The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.

  • The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.

Step 2: Perform data preparation & cleaning

  • Load the dataset into a data frame using Pandas
  • Explore the number of rows & columns, ranges of values etc.
  • Handle missing, incorrect and invalid data
  • Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)

Step 3: Perform exploratory Analysis & Visualization

  • Compute the mean, sum, range and other interesting statistics for numeric columns
  • Explore distributions of numeric columns using histograms etc.
  • Explore relationship between columns using scatter plots, bar charts etc.
  • Make a note of interesting insights from the exploratory analysis

Step 4: Ask & answer questions about the data

  • Ask at least 5 interesting questions about your dataset
  • Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
  • Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
  • Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does

Step 5: Summarize your inferences & write a conclusion

  • Write a summary of what you've learned from the analysis
  • Include interesting insights and graphs from previous sections
  • Share ideas for future work on the same topic using other relevant datasets
  • Share links to resources you found useful during your analysis

Step 6: Make a submission & share your work

(Optional) Step 7: Write a blog post

  • A blog post is a great way to present and showcase your work.
  • Sign up on Medium.com to write a blog post for your project.
  • Copy over the explanations from your Jupyter notebook into your blog post, and embed code cells & outputs
  • Check out the Jovian.ml Medium publication for inspiration: https://medium.com/jovianml

Recommended Datasets

Use the following resources for finding interesting datasets:

Example Projects

Refer to these projects for inspiration:

Evaluation Criteria

Your submission will be evaluated using the following criteria:

  • Dataset must contain at least 3 columns and 150 rows of data
  • You must ask and answer at least 5 questions about the dataset
  • Your submission must include at least 5 visualizations (graphs)
  • Your submission must include explanations using markdown cells, apart from the code.
  • Your work must not be plagiarized i.e. copy-pasted for somewhere else.

NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.

Flood Archive Data

Floods are one of the most common, yet devastating natural hazard affecting most of the places in the world. The severity of flood events differ and it depends on the damages caused, both livelihood and property. This project will use the Global Active Archive of Large Flood Events (retrieved from https://data.humdata.org/dataset/global-active-archive-of-large-flood-events) to infer insights regarding the occurrences and severity of flood events globally. This project aims to provide insights like vulnerability to give a

This dataset contains an active archive of flood event records from 1985 to present. Details such as the country affected, the number of people killed, the number of people displaced, the cost of damages, and a measure of the magnitude of the flood are included for each flood event. The archive is updated on an ongoing basis and new flood event are added immediately. The information presented in this Archive is derived from news, governmental, instrumental, and remote sensing sources.

Caveats - Each entry in the table and related shape file represents a discrete flood event. The listing is comprehensive and global in scope. Deaths and damage estimates for tropical storms are totals from all causes, but tropical storms without significant river flooding are not included.

The statistics presented in the Dartmouth Flood Observatory Global Archive of Large Flood Events are derived from a wide variety of news and governmental sources. The quality and quantity of information available about a particular flood is not always in proportion to its actual magnitude, and the intensity of news coverage varies from nation to nation. In general, news from floods in low-tech countries tend to arrive later and be less detailed than information from 'first world' countries. Here are some category-specific notes to be aware of when you are using our data:

DFO# - An archive number is assigned to any flood that appears to be "large", with, for example, significant damage to structures or agriculture, long (decades) reported intervals since the last similar event, and/or fatalities.

GLIDE# - GLobal IDEntifier Number. A globally common Unique ID code for disasters.

Country - Primary country of flooding. Other affected countries are listed in three separate fields to the right of the main Country column.

Locations - Includes names of the states, provinces, counties, towns, and cities.

Rivers - Names of rivers.

Begin - Ended - Ocassionally there is no specific beginning date mentioned in news reports, only a month; in that case the DFO date will be the middle of that month. Ending dates are often harder to determine - sometimes the news will note when the floods start to recede. We make an estimate based on a qualitative judgement concerning the flood event.

Duration - Derived from start and end dates.

Known Dead - News reports are usually specific about this, but occasionally there is only mention of 'hundreds' or 'scores' killed; in this case we estimate as follows: "hundreds"=300; "scores"=30; "more than a hundred =110 (number given plus 10%). If there is information on the number of people 'missing', the DFO does not include them in the total of deaths. We require an exact number for analytical purposes, but caution that our numbers are never more than estimates.

Number Displaced - This number is sometimes the total number of people left homeless after the incident, and sometimes it is the number evacuated during the flood. News reports will often mention a number of people that are 'affected', but we do not use this. If the only information is the number of houses destroyed or damaged, then DFO assumes that 4 people live in each house. If the news report only mentions that "thousands were evacuated", the number is estimated at 3000. If the news reports mention that "more than 10,000" were displaced then the DFO number is 11,000 (number plus 10%). If the only information is the number of families left homeless, then DFO assumes that there are 4 people in each family.

Damage (US $) - This number is never more than an estimate and we use no independent criteria for determining such. Instead we accept the latest and apparently most accurate number available in all the relevant sources.

Main Cause - One of eleven main causes is selected: Heavy rain, Tropical cyclone, Extra-tropical cyclone, Monsoonal rain, Snowmelt, Rain and snowmelt, Ice jam/break-up, Dam/Levy, break or release, Brief torrential rain, Tidal surge, Avalanche related. Information about secondary causes is in the Notes and Comments section of the table.

Severity Class - Assessment is on 1-2 scale. These floods are then divided into three classes. Class 1: large flood events: significant damage to structures or agriculture; fatalities; and/or 1-2 decades-long reported interval since the last similar event. Class 1.5: very large events: with a greater than 2 decades but less than 100 year estimated recurrence interval, and/or a local recurrence interval of at 1-2 decades and affecting a large geographic region (> 5000 sq. km). Class 2: Extreme events: with an estimated recurrence interval greater than 100 years.

Geographic Flood Extents (sq km) - This is derived from our global map of news detected floods. Polygons representing the areas affected by flooding are drawn in a GIS program based upon information acquired from news sources. Note: These are not actual flooded areas but rather the extent of geographic regions affected by flooding.

Magnitude (M) - Flood Magnitude =LOG(Duration x Severity x Affected Area)

project_name = "insights-from-global-flood-archive-data" 
!pip install jovian --upgrade -q
!pip install xlrd --upgrade -q
import jovian
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

%matplotlib inline