Year 2020 has been a challenging year. On March 13, 2020, all public schools in California were shut down due to COVID-19. All students ended their 2020 school year in a remote learning setting. Early August, all students had to start school year 2021 remotely still. I work in a school district as a Database Manager in Southern California. Our school district is a K-12 school district and have roughly 50,000 students and 50 schools. I work in Information Technology department and my main role is to maintain and suport data in any information system that school district is using. Because of school closures since March this year, a lot of changes in the systems were made because students can physically go to schools to get instructions. Because of these changes, we were able to collect some new data points this school year to analyze. It is interesting to see what these new data points can tell us.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "analyzing-k12-student-data-2021" # change this
!pip install jovian --upgrade -q
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
[jovian] Updating notebook "yinyinw/analyzing-k12-student-data-2021" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/yinyinw/analyzing-k12-student-data-2021
Data source will be from our SIS (Student Information System). Our SIS is using Microsoft SQL Server as database server so I wrote a T-SQL script to pull student data from database server. Since student data is confidential, I didn't include any sensitive student data. I also replaced unique identififer for each school and student in SIS with some unique ID that I created just for this project.
Here is some information of data file I am using:
import numpy as np
import pandas as pd
data_raw_df = pd.read_csv('data-science-project-2021-k12.csv')
data_raw_df
There are 34 columns and 51,354 rows in this data file. That means there are currently 51,354 students enrolled in our school district.
data_raw_df.columns
Index(['SchoolType', 'EnrollmentType', 'SchoolCode', 'Track', 'Grade',
'GradeLevel', 'StudentID', 'HasLaptop', 'HasInternet',
'LaptopCheckedOut', 'PCSurveyLaptop', 'MiFiCheckedOut',
'PCSurveyInternet', 'PCSurveyInstrProg', 'TotalStudentInHousehold',
'IsSED', 'IsSPED', 'SPEDProgram', 'IsEL', 'ELPAC', 'IsDLI', 'IsGATE',
'IsGNP', 'IsFoster', 'Gender', 'Ethnicity', 'ParentHighestEdLevel',
'DaysEnrolled', 'DaysPresent', 'PresentRate', 'CountT1P_N_IDF',
'CountT1_N_DF', 'CountQ1P_IDF', 'CountQ1I_DF'],
dtype='object')
New data points since school year 2021 are EnrollmentType, HasLaptop, HasInternet, LaptopCheckedOut, PCSurveyLaptop, MiFiCheckedOut, PCSurveyInternet, PCSurveyInstrProg.
data_raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51354 entries, 0 to 51353
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SchoolType 51354 non-null object
1 EnrollmentType 51354 non-null object
2 SchoolCode 51354 non-null int64
3 Track 51354 non-null object
4 Grade 51354 non-null object
5 GradeLevel 51354 non-null object
6 StudentID 51354 non-null int64
7 HasLaptop 51354 non-null object
8 HasInternet 51354 non-null object
9 LaptopCheckedOut 51354 non-null object
10 PCSurveyLaptop 51354 non-null object
11 MiFiCheckedOut 51354 non-null object
12 PCSurveyInternet 51354 non-null object
13 PCSurveyInstrProg 51354 non-null object
14 TotalStudentInHousehold 51354 non-null int64
15 IsSED 51354 non-null object
16 IsSPED 51354 non-null object
17 SPEDProgram 6910 non-null object
18 IsEL 51354 non-null object
19 ELPAC 5672 non-null float64
20 IsDLI 51354 non-null object
21 IsGATE 51354 non-null object
22 IsGNP 51354 non-null object
23 IsFoster 51354 non-null object
24 Gender 51354 non-null object
25 Ethnicity 51354 non-null object
26 ParentHighestEdLevel 51354 non-null object
27 DaysEnrolled 51354 non-null float64
28 DaysPresent 51354 non-null float64
29 PresentRate 51354 non-null float64
30 CountT1P_N_IDF 51354 non-null int64
31 CountT1_N_DF 51354 non-null int64
32 CountQ1P_IDF 51354 non-null int64
33 CountQ1I_DF 51354 non-null int64
dtypes: float64(4), int64(7), object(23)
memory usage: 13.3+ MB
data_raw_df.describe()
It looks like max number of students in a household is 41. Not sure if this is bad data since 41 students in a household seems high. Let's find out.
data_raw_df[data_raw_df.TotalStudentInHousehold == 41]
Since the data file was created without any confidential student information, there is no way to identify why there is a household with 41 students so I checked SIS directly. It turns out that there are 41 students with no Household ID in SIS (bad data). I need to contact school sites to have them to correct the data and also I need to change their "TotalStudentInHousehold" to 0.
k12_data = data_raw_df.copy()
Always make a copy of raw data before change data.
k12_data.loc[k12_data.TotalStudentInHousehold == 41, 'TotalStudentInHousehold'] = 0
k12_data.describe()
Now max number of students in a household seems making more sense.
Here are some findings based on these integer columns:
k12_data['StudentCount'] = 1
Add a new column "StudentCount" so it is easier to count student in different category later.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
This section is to explore data points in the data file using visualization. There are some existing data points that I list here for people to understand our school district's basic demographic. New data points that we collected in school year 2021 due to COVID-19 school closures are listed here for exploring the possible analysis.
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let's check how grade level distribution looks like in our school district.
grade_level = k12_data.Grade.value_counts()
grade_level
10 4350
09 4331
11 4298
12 4167
08 4084
07 4015
05 3832
06 3822
04 3813
03 3609
02 3552
01 3474
KN 3035
KA 538
PS 303
TR 130
IN 1
Name: Grade, dtype: int64
plt.figure(figsize=(10,6))
plt.title("Student Grade Level in School Year 2021")
sns.barplot(grade_level.index, grade_level);
In general, K-12 student counts are evenly distributed with high and intermediate school students (7-12) slightly more than elementary students (K-12). Some other grade levels that are not K-12 are for some Special Education students that receive our services so they have relatively low counts.
How about ethnicty distribution? Let's find out.
ethnicity_count = k12_data.Ethnicity.value_counts()
ethnicity_count
Hispanic 27536
White 11953
Asian 7758
Black or African American 3284
Other 823
Name: Ethnicity, dtype: int64
plt.figure(figsize=(10,6))
plt.title("Student Ethnicity in School Year 2021")
plt.xticks(rotation=45)
sns.barplot(ethnicity_count.index, ethnicity_count);
Hispanic ethnicity is a clearly winner here followed by White and Asian.
plt.pie(ethnicity_count, labels=ethnicity_count.index, autopct='%1.1f%%', startangle=180);
Same data but differnt chart here to show percentage of each ethnicity using pie chart. A little bit more than half of our studens are hispanic.
Because of COVID-19, all California schools began school year 2021 in a remote learning (virtual) setting. During summer time, our school district sent out a survey asking parents' preferred learning models (instruction programs) for their students. There are two selections: Virtual and Traditional. Schools schedule students into differnt enrollment types based on their survey answers. For those parents who didn't answer the survey, default enrollment type is Traditional.
k12_data.EnrollmentType.unique()
array(['Traditional', 'Virtual'], dtype=object)
virtual_traditional = k12_data.EnrollmentType.value_counts()
virtual_traditional
Traditional 38678
Virtual 12676
Name: EnrollmentType, dtype: int64
plt.figure(figsize=(10,6))
plt.title("Student Enrollment Type in School Year 2021")
plt.pie(virtual_traditional, labels=virtual_traditional.index, autopct='%1.1f%%', startangle=180);
Based on this pie chart, about a quarter of students are enrolled in Virtual settings in 2021.
Because of school closures, students can't physically go to school to receive instructions. Students now have to rely on technologies to attend online meetings with teachers or do their homework. In order to support some families who can't afford these technologies, school district has budget to purchase Chromebooks and hot spots for students to check out to use at home. Parents can make requests for Chromebook or/and hot spot on parent portal. Information Technology department gathers these requests and notifies parents to come to district office to pick up devices when inventory allows. Let's check how many devices have been checked out since March, 2020.
device_checked_out = k12_data.LaptopCheckedOut.value_counts()
device_checked_out
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-f4f131803c10> in <module>
----> 1 device_checked_out = k12_data.LaptopCheckedOut.value_counts()
2 device_checked_out
NameError: name 'k12_data' is not defined
plt.figure(figsize=(10,6))
plt.title("Student Device Checked Out Status in School Year 2021")
plt.pie(device_checked_out, labels=device_checked_out.index, autopct='%1.1f%%', startangle=180);
plt.figure(figsize=(10, 6))
plt.title("Student Present Rate")
plt.xlabel('Present Rate')
plt.ylabel('Number of Student')
plt.hist(k12_data.PresentRate, bins=np.arange(0,1,0.1), color='blue');
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "yinyinw/analyzing-k12-student-data-2021" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/yinyinw/analyzing-k12-student-data-2021
TODO
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.EnrollmentType, s=100);
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.LaptopCheckedOut, s=100);
eth_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_virtual
eth_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_traditional
eth_erollmenttype = eth_virtual.merge(eth_traditional, on=["Ethnicity", "Gender"])
eth_erollmenttype
eth_erollmenttype['VirtualRate'] = eth_erollmenttype.StudentCount_x / (eth_erollmenttype.StudentCount_x + eth_erollmenttype.StudentCount_y)
eth_erollmenttype
pivot_eth = eth_erollmenttype.pivot("Ethnicity", "Gender", "VirtualRate")
pivot_eth
plt.title("Which combination of Ethnicity and Gender group has the highest Virtual School Rate?")
sns.heatmap(pivot_eth, annot=True, cmap="Blues");
stugroup_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_virtual
stugroup_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_traditional
stugroup_erollmenttype = stugroup_virtual.merge(stugroup_traditional, on=["IsSED", "IsSPED"])
stugroup_erollmenttype
stugroup_erollmenttype['VirtualRate'] = stugroup_erollmenttype.StudentCount_x / (stugroup_erollmenttype.StudentCount_x + stugroup_erollmenttype.StudentCount_y)
stugroup_erollmenttype
pivot_stugroup = stugroup_erollmenttype.pivot("IsSED", "IsSPED", "VirtualRate")
pivot_stugroup
plt.title("Which combination of SED and SPED group has the highest Virtual School Rate?")
sns.heatmap(pivot_stugroup, annot=True, cmap="Blues");
k6_data = k12_data[k12_data.GradeLevel == "Elementary"]
k6_data
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.EnrollmentType, s=100);
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.Gender, s=100);
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.HasInternet, s=100);
k6_data
k6_T1P = k6_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountT1P_N_IDF']].mean()
k6_T1P
pivot_T1P = k6_T1P.pivot("HasLaptop", "HasInternet", "CountT1P_N_IDF")
pivot_T1P
plt.title("Elementary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_T1P, annot=True, cmap="Blues");
secondary_data = k12_data[k12_data.GradeLevel != "Elementary"]
secondary_data
secondary_Q1P = secondary_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountQ1P_IDF']].mean()
secondary_Q1P
pivot_Q1P = secondary_Q1P.pivot("HasLaptop", "HasInternet", "CountQ1P_IDF")
pivot_Q1P
plt.title("Secondary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_Q1P, annot=True, cmap="Blues");
sns.barplot('Ethnicity', 'PresentRate', hue='HasLaptop', data=secondary_data)
plt.xticks(rotation=45);
laptop_yes = k12_data[k12_data.LaptopCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_yes
laptop_no = k12_data[k12_data.LaptopCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_no
merge_laptop = laptop_yes.merge(laptop_no, on=["Ethnicity", "IsSED"])
merge_laptop
merge_laptop['CheckoutLaptopRate'] = merge_laptop.StudentCount_x / (merge_laptop.StudentCount_x + merge_laptop.StudentCount_y)
merge_laptop
pivot_laptop = merge_laptop.pivot("Ethnicity", "IsSED", "CheckoutLaptopRate")
pivot_laptop
plt.title("Checkout Laptop Rate by Ethnicity and SED?")
sns.heatmap(pivot_laptop, annot=True, cmap="Blues");
mifi_yes = k12_data[k12_data.MiFiCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_yes
mifi_no = k12_data[k12_data.MiFiCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_no
merge_mifi = mifi_yes.merge(mifi_no, on=["Ethnicity", "IsSED"])
merge_mifi
merge_mifi['CheckoutMiFiRate'] = merge_mifi.StudentCount_x / (merge_mifi.StudentCount_x + merge_mifi.StudentCount_y)
merge_mifi
pivot_mifi = merge_mifi.pivot("Ethnicity", "IsSED", "CheckoutMiFiRate")
pivot_mifi
plt.title("Checkout Hot Spot Rate by Ethnicity and SED?")
sns.heatmap(pivot_mifi, annot=True, cmap="Blues");
import jovian
jovian.commit()
Note:
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()