Jovian
⭐️
Sign In

Data Analysis with Python: Zero to Pandas - Course Project Guidelines

(remove this cell before submission)

Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project

This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.

Step 1: Select a real-world dataset
  • Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).

  • The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.

  • The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.

Step 2: Perform data preparation & cleaning
  • Load the dataset into a data frame using Pandas
  • Explore the number of rows & columns, ranges of values etc.
  • Handle missing, incorrect and invalid data
  • Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
Step 3: Perform exploratory Analysis & Visualization
  • Compute the mean, sum, range and other interesting statistics for numeric columns
  • Explore distributions of numeric columns using histograms etc.
  • Explore relationship between columns using scatter plots, bar charts etc.
  • Make a note of interesting insights from the exploratory analysis
Step 4: Ask & answer questions about the data
  • Ask at least 5 interesting questions about your dataset
  • Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
  • Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
  • Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
Step 5: Summarize your inferences & write a conclusion
  • Write a summary of what you've learned from the analysis
  • Include interesting insights and graphs from previous sections
  • Share ideas for future work on the same topic using other relevant datasets
  • Share links to resources you found useful during your analysis
Step 6: Make a submission & share your work
(Optional) Step 7: Write a blog post

Recommended Datasets

Use the following resources for finding interesting datasets:

Example Projects

Refer to these projects for inspiration:

Evaluation Criteria

Your submission will be evaluated using the following criteria:

  • Dataset must contain at least 3 columns and 150 rows of data
  • You must ask and answer at least 5 questions about the dataset
  • Your submission must include at least 5 visualizations (graphs)
  • Your submission must include explanations using markdown cells, apart from the code.
  • Your work must not be plagiarized i.e. copy-pasted for somewhere else.

NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.

Analyzing K-12 Student Data for School Year 2021

Year 2020 has been a challenging year. On March 13, 2020, all public schools in California were shut down due to COVID-19. All students ended their 2020 school year in a remote learning setting. Early August, all students had to start school year 2021 remotely still. I work in a school district as a Database Manager in Southern California. Our school district is a K-12 school district and have roughly 50,000 students and 50 schools. I work in Information Technology department and my main role is to maintain and suport data in any information system that school district is using. Because of school closures since March this year, a lot of changes in the systems were made because students can physically go to schools to get instructions. Because of these changes, we were able to collect some new data points this school year to analyze. It is interesting to see what these new data points can tell us.

As a first step, let's upload our Jupyter notebook to Jovian.ml.

In [1]:
project_name = "analyzing-k12-student-data-2021" # change this
In [2]:
!pip install jovian --upgrade -q
In [3]:
import jovian
In [ ]:
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..

Data Preparation and Cleaning

I have the privilage working as a Database Manager in a school district. I have access to a SIS (Student Information System). The first step I am using is to write a SQL script to pull data from SIS. The data I am targeting is:

File name: data-science project-2021-k12.csv

  • Students who are currently enrolled in school district.
  • Since student data is confidential, I didn't include any sensitive student data. The unique identifier for each school and eash student are some unique id that I created just for this project.
  • Fields included in the file are
    • SchoolType
    • EnrollmentType
    • SchoolCode
    • Track
    • Grade
    • GradeLevel
    • StudentID'
    • LaptopCheckedOut
    • LaptopCheckoutDate
    • CompletedPCSurvey
    • PCSurveyLaptop
    • MiFiCheckedOut
    • MiFiCheckoutDate
    • PCSurveyInternet
    • CompletedInstrProg
    • PCSurveyInstrProg
    • TotalStudentInHousehold
    • IsSED
    • IsSPED
    • SPEDProgram
    • IsEL
    • ELPAC
    • IsDLI
    • IsGATE
    • IsGNP
    • IsFoster
    • Gender
    • Ethnicity
    • ParentHighestEdLevel
    • DaysEnrolled
    • DaysPresent
    • PresentRate
    • CountT1P_N_IDF
    • CountT1_N_DF
    • CountQ1P_IDF
    • CountQ1I_DF
    • CountQ2P_IDF
    • CountS1_DF
In [ ]:
import numpy as np
In [ ]:
import pandas as pd
In [ ]:
data_raw_df = pd.read_csv('data-science-project-2021-k12.csv')
In [ ]:
data_raw_df
In [ ]:
data_raw_df.columns
In [ ]:
data_raw_df.info()
In [ ]:
data_raw_df.describe()

Here are some findings based on these integer columns:

  • Total active student count is 51,360.
  • Average number of students in a household in the school district is 1.776. Max number of students in a household is 41.
  • Average student enrolled days in this school year is 32.79. Max is 57. Because there are some elementary schools are year-round schools in our school district and their first day of school was 7/6/2020 so it is resonable students having 57 enrolled days.
In [ ]:
k12_data = data_raw_df.copy()

Add a new column "StudentCount" so it is easier to count student in different category later.

In [ ]:
k12_data['StudentCount'] = 1
In [ ]:
import jovian
In [ ]:
jovian.commit()

Exploratory Analysis and Visualization

TODO

In [ ]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Virtual vs Traditional Setting

Because of COVID-19, all California schools began in a virtual settings. However, during summer time, our school district sent out a survey asking parents' intention of having their students beginning the school year in what settings. There are two selections: Virtual and Traditional.

In [ ]:
k12_data.EnrollmentType.unique()
In [ ]:
virtual_traditional = k12_data.EnrollmentType.value_counts()
virtual_traditional
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Enrollment Type in School Year 2021")
sns.barplot(virtual_traditional.index, virtual_traditional);

Grade Level

In [ ]:
grade_level = k12_data.Grade.value_counts()
grade_level
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Grade Level in School Year 2021")
sns.barplot(grade_level.index, grade_level);

Device Checked Out Status

In [ ]:
device_checked_out = k12_data.LaptopCheckedOut.value_counts()
device_checked_out
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Device Checked Out Status in School Year 2021")
sns.barplot(device_checked_out.index, device_checked_out);

Student Attendance

In [ ]:
plt.figure(figsize=(12, 6))
plt.title("Student Present Rate")
plt.xlabel('Present Rate')
plt.ylabel('Number of Student')

plt.hist(k12_data.PresentRate, bins=np.arange(0,1,0.1), color='blue');
In [ ]:
import jovian
In [ ]:
jovian.commit()

Asking and Answering Questions

TODO

Q:

In [ ]:
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.EnrollmentType, s=100);
In [ ]:
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.LaptopCheckedOut, s=100);
In [ ]:
 

Q: Which combination of Ethnicity and Gender group has the highest Virtual School Rate?

In [ ]:
eth_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_virtual
In [ ]:
eth_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_traditional
In [ ]:
eth_erollmenttype = eth_virtual.merge(eth_traditional, on=["Ethnicity", "Gender"])
eth_erollmenttype
In [ ]:
eth_erollmenttype['VirtualRate'] = eth_erollmenttype.StudentCount_x / (eth_erollmenttype.StudentCount_x + eth_erollmenttype.StudentCount_y)
eth_erollmenttype
In [ ]:
pivot_eth = eth_erollmenttype.pivot("Ethnicity", "Gender", "VirtualRate")
pivot_eth
In [ ]:
plt.title("Which combination of Ethnicity and Gender group has the highest Virtual School Rate?")
sns.heatmap(pivot_eth, annot=True, cmap="Blues");

Q: Which combination of SED and SPED group has the highest Virtual School Rate?

In [ ]:
stugroup_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_virtual
In [ ]:
stugroup_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_traditional
In [ ]:
stugroup_erollmenttype = stugroup_virtual.merge(stugroup_traditional, on=["IsSED", "IsSPED"])
stugroup_erollmenttype
In [ ]:
stugroup_erollmenttype['VirtualRate'] = stugroup_erollmenttype.StudentCount_x / (stugroup_erollmenttype.StudentCount_x + stugroup_erollmenttype.StudentCount_y)
stugroup_erollmenttype
In [ ]:
pivot_stugroup = stugroup_erollmenttype.pivot("IsSED", "IsSPED", "VirtualRate")
pivot_stugroup
In [ ]:
plt.title("Which combination of SED and SPED group has the highest Virtual School Rate?")
sns.heatmap(pivot_stugroup, annot=True, cmap="Blues");

Q: Elementary: Lack of Internet or lack of Laptop affects student performance?

In [ ]:
k6_data = k12_data[k12_data.GradeLevel == "Elementary"]
k6_data
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.EnrollmentType, s=100);
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.Gender, s=100);
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.HasInternet, s=100);
In [ ]:
k6_data
In [ ]:
k6_T1P = k6_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountT1P_N_IDF']].mean()
k6_T1P
In [ ]:
pivot_T1P = k6_T1P.pivot("HasLaptop", "HasInternet", "CountT1P_N_IDF")
pivot_T1P
In [ ]:
plt.title("Elementary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_T1P, annot=True, cmap="Blues");

Q: Secondary

In [ ]:
secondary_data = k12_data[k12_data.GradeLevel != "Elementary"]
secondary_data
In [ ]:
secondary_Q1P = secondary_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountQ1P_IDF']].mean()
secondary_Q1P
In [ ]:
pivot_Q1P = secondary_Q1P.pivot("HasLaptop", "HasInternet", "CountQ1P_IDF")
pivot_Q1P
In [ ]:
plt.title("Secondary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_Q1P, annot=True, cmap="Blues");
In [ ]:
sns.barplot('Ethnicity', 'PresentRate', hue='HasLaptop', data=secondary_data)
plt.xticks(rotation=45);

Q: Checkout Laptop Rate by Ethnicity and SED?

In [ ]:
laptop_yes = k12_data[k12_data.LaptopCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_yes
In [ ]:
laptop_no = k12_data[k12_data.LaptopCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_no
In [ ]:
merge_laptop = laptop_yes.merge(laptop_no, on=["Ethnicity", "IsSED"])
merge_laptop
In [ ]:
merge_laptop['CheckoutLaptopRate'] = merge_laptop.StudentCount_x / (merge_laptop.StudentCount_x + merge_laptop.StudentCount_y)
merge_laptop
In [ ]:
pivot_laptop = merge_laptop.pivot("Ethnicity", "IsSED", "CheckoutLaptopRate")
pivot_laptop
In [ ]:
plt.title("Checkout Laptop Rate by Ethnicity and SED?")
sns.heatmap(pivot_laptop, annot=True, cmap="Blues");

Q: Checkout Hot Spot Rate by Ethnicity and SED?

In [ ]:
mifi_yes = k12_data[k12_data.MiFiCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_yes
In [ ]:
mifi_no = k12_data[k12_data.MiFiCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_no
In [ ]:
merge_mifi = mifi_yes.merge(mifi_no, on=["Ethnicity", "IsSED"])
merge_mifi
In [ ]:
merge_mifi['CheckoutMiFiRate'] = merge_mifi.StudentCount_x / (merge_mifi.StudentCount_x + merge_mifi.StudentCount_y)
merge_mifi
In [ ]:
pivot_mifi = merge_mifi.pivot("Ethnicity", "IsSED", "CheckoutMiFiRate")
pivot_mifi
In [ ]:
plt.title("Checkout Hot Spot Rate by Ethnicity and SED?")
sns.heatmap(pivot_mifi, annot=True, cmap="Blues");
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

Inferences and Conclusion

Note:

  • Merge Filipino into Asian?
  • Remove Non-Binary (too little population)
  • Need to add a column for laptop help and internet help
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

References and Future Work

TODO

In [ ]:
import jovian
In [ ]:
jovian.commit()
In [ ]: