Jovian
⭐️
Sign In

Analyzing K-12 Student Data for School Year 2021

Year 2020 has been a challenging year. On March 13, 2020, all public schools in California were shut down due to COVID-19. All students ended their 2020 school year in a remote learning setting. Early August, all students had to start school year 2021 remotely still. I work in a school district as a Database Manager in Southern California. Our school district is a K-12 school district and have roughly 50,000 students and 50 schools. I work in Information Technology department and my main role is to maintain and suport data in any information system that school district is using. Because of school closures since March this year, a lot of changes in the systems were made because students can physically go to schools to get instructions. Because of these changes, we were able to collect some new data points this school year to analyze. It is interesting to see what these new data points can tell us.

As a first step, let's upload our Jupyter notebook to Jovian.ml.

In [1]:
project_name = "analyzing-k12-student-data-2021" # change this
In [2]:
!pip install jovian --upgrade -q
In [3]:
import jovian
In [4]:
jovian.commit(project=project_name)
[jovian] Attempting to save notebook.. [jovian] Please enter your API key ( from https://jovian.ml/ ): API KEY: ········ [jovian] Creating a new project "yinyinw/analyzing-k12-student-data-2021" [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/yinyinw/analyzing-k12-student-data-2021

Data Preparation and Cleaning

Data source will be from our SIS (Student Information System). Our SIS is using Microsoft SQL Server as database server so I wrote a T-SQL script to pull student data from database server. Since student data is confidential, I didn't include any sensitive student data. I also replaced unique identififer for each school and student in SIS with some unique ID that I created just for this project.

Here is some information of data file I am using:

  • File name is data-science-project-2021-k12.csv.
  • Only students who are currently enrolled in school district are included in the file.
  • Fields included in the file are below:
    • SchoolType: Traditional or Year Round. (Year Round means these schools have year round schedule. Students are divided into different tracks to accommdate student populations)
    • EnrollmentType: Tradional or Virtual. (In summer, we gave parents oppurtunity to select learning model they would like to have for their kids once students can return to school. Virtual meaning remotely learning and Traditional meaning students would go back to school physically when condision is allowed.)
    • SchoolCode: Student's enrolled school. This data is created just for this project.
    • Track: Student's primary track.
    • Grade: Student's grade level in school year 2021.
    • GradeLevel: Elementary, Intermediate or High
    • StudentID: Student's unique identifier. This data is created just for this project.
    • HasLaptop: This is created on these conditions: 1. if student answered "Yes" on survey question "Do you have laptop access?" or 2. if student has checked out a laptop from district.
    • HasInternet: This is created on these conditions: 1. if student answered "Yes" on survey question "Do you have internet at home?" or 2. if student has checked out a hot spot from district.
    • LaptopCheckedOut: The student has checked out a laptop from district or not.
    • PCSurveyLaptop: Student's answer on survey question "Do you have laptop access?"
    • MiFiCheckedOut: The student has checked out a hot spot from district or not.
    • PCSurveyInternet: Student's answer on survey question "Do you have internet at home?"
    • PCSurveyInstrProg: Which instruction program (learning model) do you want to have? We assigned students into different learning models based on their response.
    • TotalStudentInHousehold: Total current students in our school district in the same household.
    • IsSED: Is the student in Socio-Economic Disadvantage group?
    • IsSPED: Is the student in Special Education program?
    • SPEDProgram: If the student is in SPED, which SPED program he/she is in?
    • IsEL: Is the student in English Learner program?
    • ELPAC: ELPAC proficiency level. ELPAC is an English Learner exam in California.
    • IsDLI: Is the student in Dual Immersion program? Dual Immersion is a program which offers bi-lingual learning environment.
    • IsGATE: Is the student in GATE program? GATE means Gift And Talented Education.
    • IsGNP: Is the student in GATE Magnet Program?
    • IsFoster: Is the student a foster student?
    • Gender: Student's gender.
    • Ethnicity: Student's ethnicity.
    • ParentHighestEdLevel: Parent's highest education level.
    • DaysEnrolled: Total days enrolled in school year 2021 up to 9/25/20.
    • DaysPresent: Total present days in school year 2021 up to 9/25/20.
    • PresentRate: This is calcualted by using DaysPresent / DaysEnrolled.
    • CountT1P_N_IDF: Count "N's" (for grade K-3) or "In Danger of Failing" (for grade 4-6) courses for elementary students for T1 term progress report.
    • CountT1_N_DF: Count of "N's" (for grade K-3) or "D's and F's" (for grade 4-6) courses for elementary students for T1 term marks. Only some students in year round schools have T1 marks so far.
    • CountQ1P_IDF: Count of "In Danger of Failing" courses for secondary (intermediate and high school) students for Q1 term progress report.
    • CountQ1I_DF: Count of "D's and F's" courses for secondary (intermediate and high school) students for Q1 term interim report. Only a few students get this mark.
In [2]:
import numpy as np
In [3]:
import pandas as pd
In [4]:
data_raw_df = pd.read_csv('data-science-project-2021-k12.csv')
In [5]:
data_raw_df
Out[5]:

There are 34 columns and 51,354 rows in this data file. That means there are currently 51,354 students enrolled in our school district.

In [6]:
data_raw_df.columns
Out[6]:
Index(['SchoolType', 'EnrollmentType', 'SchoolCode', 'Track', 'Grade',
       'GradeLevel', 'StudentID', 'HasLaptop', 'HasInternet',
       'LaptopCheckedOut', 'PCSurveyLaptop', 'MiFiCheckedOut',
       'PCSurveyInternet', 'PCSurveyInstrProg', 'TotalStudentInHousehold',
       'IsSED', 'IsSPED', 'SPEDProgram', 'IsEL', 'ELPAC', 'IsDLI', 'IsGATE',
       'IsGNP', 'IsFoster', 'Gender', 'Ethnicity', 'ParentHighestEdLevel',
       'DaysEnrolled', 'DaysPresent', 'PresentRate', 'CountT1P_N_IDF',
       'CountT1_N_DF', 'CountQ1P_IDF', 'CountQ1I_DF'],
      dtype='object')

New data points since school year 2021 are EnrollmentType, HasLaptop, HasInternet, LaptopCheckedOut, PCSurveyLaptop, MiFiCheckedOut, PCSurveyInternet, PCSurveyInstrProg.

In [10]:
data_raw_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51354 entries, 0 to 51353 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SchoolType 51354 non-null object 1 EnrollmentType 51354 non-null object 2 SchoolCode 51354 non-null int64 3 Track 51354 non-null object 4 Grade 51354 non-null object 5 GradeLevel 51354 non-null object 6 StudentID 51354 non-null int64 7 HasLaptop 51354 non-null object 8 HasInternet 51354 non-null object 9 LaptopCheckedOut 51354 non-null object 10 PCSurveyLaptop 51354 non-null object 11 MiFiCheckedOut 51354 non-null object 12 PCSurveyInternet 51354 non-null object 13 PCSurveyInstrProg 51354 non-null object 14 TotalStudentInHousehold 51354 non-null int64 15 IsSED 51354 non-null object 16 IsSPED 51354 non-null object 17 SPEDProgram 6910 non-null object 18 IsEL 51354 non-null object 19 ELPAC 5672 non-null float64 20 IsDLI 51354 non-null object 21 IsGATE 51354 non-null object 22 IsGNP 51354 non-null object 23 IsFoster 51354 non-null object 24 Gender 51354 non-null object 25 Ethnicity 51354 non-null object 26 ParentHighestEdLevel 51354 non-null object 27 DaysEnrolled 51354 non-null float64 28 DaysPresent 51354 non-null float64 29 PresentRate 51354 non-null float64 30 CountT1P_N_IDF 51354 non-null int64 31 CountT1_N_DF 51354 non-null int64 32 CountQ1P_IDF 51354 non-null int64 33 CountQ1I_DF 51354 non-null int64 dtypes: float64(4), int64(7), object(23) memory usage: 13.3+ MB
In [11]:
data_raw_df.describe()
Out[11]:

Here are some findings based on these integer columns:

  • Total active student count is 51,354.
  • Average number of students in a household in the school district is 1.776. Max number of students in a household is 41. (Not sure if this is bad data since 41 students in a household seems high.)
  • Average student enrolled days in this school year is 32.79. Max is 57. Because some elementary schools are year-round schools in our school district and their first day of school was 7/6/2020 so it is resonable students having 57 enrolled days.
In [12]:
k12_data = data_raw_df.copy()

Add a new column "StudentCount" so it is easier to count student in different category later.

In [13]:
k12_data['StudentCount'] = 1
In [14]:
import jovian
In [ ]:
jovian.commit()
[jovian] Attempting to save notebook..

Exploratory Analysis and Visualization

TODO

In [ ]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Virtual vs Traditional Setting

Because of COVID-19, all California schools began in a virtual settings. However, during summer time, our school district sent out a survey asking parents' intention of having their students beginning the school year in what settings. There are two selections: Virtual and Traditional.

In [ ]:
k12_data.EnrollmentType.unique()
In [ ]:
virtual_traditional = k12_data.EnrollmentType.value_counts()
virtual_traditional
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Enrollment Type in School Year 2021")
sns.barplot(virtual_traditional.index, virtual_traditional);

Grade Level

In [ ]:
grade_level = k12_data.Grade.value_counts()
grade_level
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Grade Level in School Year 2021")
sns.barplot(grade_level.index, grade_level);

Device Checked Out Status

In [ ]:
device_checked_out = k12_data.LaptopCheckedOut.value_counts()
device_checked_out
In [ ]:
plt.figure(figsize=(10,6))
plt.title("Student Device Checked Out Status in School Year 2021")
sns.barplot(device_checked_out.index, device_checked_out);

Student Attendance

In [ ]:
plt.figure(figsize=(12, 6))
plt.title("Student Present Rate")
plt.xlabel('Present Rate')
plt.ylabel('Number of Student')

plt.hist(k12_data.PresentRate, bins=np.arange(0,1,0.1), color='blue');
In [ ]:
import jovian
In [ ]:
jovian.commit()

Asking and Answering Questions

TODO

Q:

In [ ]:
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.EnrollmentType, s=100);
In [ ]:
sns.scatterplot(k12_data.PresentRate, k12_data.CountT1P_N_IDF, hue=k12_data.LaptopCheckedOut, s=100);
In [ ]:
 

Q: Which combination of Ethnicity and Gender group has the highest Virtual School Rate?

In [ ]:
eth_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_virtual
In [ ]:
eth_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['Ethnicity', 'Gender', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
eth_traditional
In [ ]:
eth_erollmenttype = eth_virtual.merge(eth_traditional, on=["Ethnicity", "Gender"])
eth_erollmenttype
In [ ]:
eth_erollmenttype['VirtualRate'] = eth_erollmenttype.StudentCount_x / (eth_erollmenttype.StudentCount_x + eth_erollmenttype.StudentCount_y)
eth_erollmenttype
In [ ]:
pivot_eth = eth_erollmenttype.pivot("Ethnicity", "Gender", "VirtualRate")
pivot_eth
In [ ]:
plt.title("Which combination of Ethnicity and Gender group has the highest Virtual School Rate?")
sns.heatmap(pivot_eth, annot=True, cmap="Blues");

Q: Which combination of SED and SPED group has the highest Virtual School Rate?

In [ ]:
stugroup_virtual = k12_data[k12_data.EnrollmentType == "Virtual"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_virtual
In [ ]:
stugroup_traditional = k12_data[k12_data.EnrollmentType == "Traditional"].groupby(['IsSED', 'IsSPED', 'EnrollmentType'], as_index=False)[['StudentCount']].count()
stugroup_traditional
In [ ]:
stugroup_erollmenttype = stugroup_virtual.merge(stugroup_traditional, on=["IsSED", "IsSPED"])
stugroup_erollmenttype
In [ ]:
stugroup_erollmenttype['VirtualRate'] = stugroup_erollmenttype.StudentCount_x / (stugroup_erollmenttype.StudentCount_x + stugroup_erollmenttype.StudentCount_y)
stugroup_erollmenttype
In [ ]:
pivot_stugroup = stugroup_erollmenttype.pivot("IsSED", "IsSPED", "VirtualRate")
pivot_stugroup
In [ ]:
plt.title("Which combination of SED and SPED group has the highest Virtual School Rate?")
sns.heatmap(pivot_stugroup, annot=True, cmap="Blues");

Q: Elementary: Lack of Internet or lack of Laptop affects student performance?

In [ ]:
k6_data = k12_data[k12_data.GradeLevel == "Elementary"]
k6_data
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.EnrollmentType, s=100);
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.Gender, s=100);
In [ ]:
sns.scatterplot(k6_data.PresentRate, k6_data.CountT1P_N_IDF, hue=k6_data.HasInternet, s=100);
In [ ]:
k6_data
In [ ]:
k6_T1P = k6_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountT1P_N_IDF']].mean()
k6_T1P
In [ ]:
pivot_T1P = k6_T1P.pivot("HasLaptop", "HasInternet", "CountT1P_N_IDF")
pivot_T1P
In [ ]:
plt.title("Elementary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_T1P, annot=True, cmap="Blues");

Q: Secondary

In [ ]:
secondary_data = k12_data[k12_data.GradeLevel != "Elementary"]
secondary_data
In [ ]:
secondary_Q1P = secondary_data.groupby(['HasLaptop', 'HasInternet'], as_index=False)[['CountQ1P_IDF']].mean()
secondary_Q1P
In [ ]:
pivot_Q1P = secondary_Q1P.pivot("HasLaptop", "HasInternet", "CountQ1P_IDF")
pivot_Q1P
In [ ]:
plt.title("Secondary: Lack of Internet or lack of Laptop affects student performance?")
sns.heatmap(pivot_Q1P, annot=True, cmap="Blues");
In [ ]:
sns.barplot('Ethnicity', 'PresentRate', hue='HasLaptop', data=secondary_data)
plt.xticks(rotation=45);

Q: Checkout Laptop Rate by Ethnicity and SED?

In [ ]:
laptop_yes = k12_data[k12_data.LaptopCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_yes
In [ ]:
laptop_no = k12_data[k12_data.LaptopCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'LaptopCheckedOut'], as_index=False)[['StudentCount']].count()
laptop_no
In [ ]:
merge_laptop = laptop_yes.merge(laptop_no, on=["Ethnicity", "IsSED"])
merge_laptop
In [ ]:
merge_laptop['CheckoutLaptopRate'] = merge_laptop.StudentCount_x / (merge_laptop.StudentCount_x + merge_laptop.StudentCount_y)
merge_laptop
In [ ]:
pivot_laptop = merge_laptop.pivot("Ethnicity", "IsSED", "CheckoutLaptopRate")
pivot_laptop
In [ ]:
plt.title("Checkout Laptop Rate by Ethnicity and SED?")
sns.heatmap(pivot_laptop, annot=True, cmap="Blues");

Q: Checkout Hot Spot Rate by Ethnicity and SED?

In [ ]:
mifi_yes = k12_data[k12_data.MiFiCheckedOut == "Yes"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_yes
In [ ]:
mifi_no = k12_data[k12_data.MiFiCheckedOut == "No"].groupby(['Ethnicity', 'IsSED', 'MiFiCheckedOut'], as_index=False)[['StudentCount']].count()
mifi_no
In [ ]:
merge_mifi = mifi_yes.merge(mifi_no, on=["Ethnicity", "IsSED"])
merge_mifi
In [ ]:
merge_mifi['CheckoutMiFiRate'] = merge_mifi.StudentCount_x / (merge_mifi.StudentCount_x + merge_mifi.StudentCount_y)
merge_mifi
In [ ]:
pivot_mifi = merge_mifi.pivot("Ethnicity", "IsSED", "CheckoutMiFiRate")
pivot_mifi
In [ ]:
plt.title("Checkout Hot Spot Rate by Ethnicity and SED?")
sns.heatmap(pivot_mifi, annot=True, cmap="Blues");
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

Inferences and Conclusion

Note:

  • Merge Filipino into Asian?
  • Remove Non-Binary (too little population)
  • Need to add a column for laptop help and internet help
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

References and Future Work

TODO

In [ ]:
import jovian
In [ ]:
jovian.commit()
In [ ]: