The aim of this project is to find out which are the most relevant features that students consider to choose the preferred American university. Some of the essential questions for developing this project are related to the number of applications, admissions, and enrollments, cost of tuition and fees, cost of living on campus, types of degrees offered, and features of the states where universities are located (population and GDP).
The dataset used for this analysis was taken from https://www.kaggle.com/sumithbhongale/american-university-data-ipeds-dataset. It contains a plethora of information about American universities (that are not necessarily the top 10-20) in 2013.
Although this dataset does not contain information about all the first-ranked American universities, the patterns and insights extracted from it are highly representative of the whole behavior.
The dataset contains more than a thousand rows (universities) and 145 columns (features about those universities). Several of those features are out of the scope of this project. Only the features that have information to answer the questions to achieve the goal of the project were deployed.
The most powerful tools for data analysis used in this project are the packages Numpy
and Pandas
, and to visualize and explore the data: Matplotlib
and Seaborn
. All of these tools were meaningfully and efficiently taught in the course "Data Analysis with Python: Zero to Pandas"
given by Jovian
in partnership with freeCodeCamp
.
As a first step, I'll upload this Jupyter notebook to Jovian.ai
file_name ='student_perfere_a_university'
!pip install jovian --upgrade -q
import jovian
jovian.commit(filename=file_name)
[jovian] Attempting to save notebook..
[jovian] Updating notebook "danycg85/student-perfere-a-university" on https://jovian.ai/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ai/danycg85/student-perfere-a-university
This chapter presents the steps to read the dataset about American Universities.
In all data-analysis projects, the data preparation step is not only necessary but also vital to find and handle features that could cause some problems while making the quantitative analysis, or that could lead to low efficient coding. According to Alivia Smith[1], this step usually takes up to 80% of the entire time of a data analysis project. Therefore, missing, invalid, and inconsistent values have been addressed. Finally, this step presents a code for changing the format of column names.
Although in many cases, the exploratory data analysis and the quantitative and qualitative analysis are separated steps, in this specific project, they have been joined. This step, previous to asking questions that could lead to reaching the project's aim, presents global-useful information about the different columns of the dataset. Some initial examinations were performed as a way of beginning finding patterns, creating hypotheses, and corroborating early assumptions. Later, deeper investigations were depicted as part of reaching the project's goal.
The hypotheses and questions generated to develop this projects are:
Although all conclusions and answers are exposed in the previous section accordingly to each question, this section also presents, in a concise manner, the most significant insights.
This section reveals some compelling features or ideas to elaborate on or to consider for future works.
All the sources that have been helpful to develop this project are exposed in this section.
We're going to begin by importing all the packages we'll use in this project.
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#to center every figure in the notebook.
#from: https://stackoverflow.com/questions/18380168/center-output-plots-in-the-notebook
from IPython.core.display import HTML as Center
Center(""" <style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style> """)
with open('universities_data.csv') as file:
universities_df=pd.read_csv(file)
Because the file is a comma-separated value (CSV) file, we use the pandas function pd.read_csv()
to open and read the file.
type(universities_df)
pandas.core.frame.DataFrame
We have verified that the file has been created as a DataFrame variable. And we can see the first five rows of the dataset as a preview.
universities_df.head()
jovian.commit(filename=file_name,files=['universities_data.csv'])
[jovian] Attempting to save notebook..
[jovian] Updating notebook "danycg85/student-perfere-a-university" on https://jovian.ai/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Uploading additional files...
[jovian] Committed successfully! https://jovian.ai/danycg85/student-perfere-a-university
Let's begin this step by looking at the shape (number of rows and columns) of the dataset.
universities_df.shape
(1534, 145)
print('The dataset contains {} rows and {} columns'.format(universities_df.shape[0],universities_df.shape[1]))
The dataset contains 1534 rows and 145 columns
Now, let's see some additional information.
universities_df.info(max_cols=len(universities_df))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 0 to 1533
Data columns (total 145 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID number 1534 non-null int64
1 Name 1534 non-null object
2 year 1534 non-null int64
3 ZIP code 1534 non-null object
4 Highest degree offered 1534 non-null object
5 County name 1534 non-null object
6 Longitude location of institution 1534 non-null object
7 Latitude location of institution 1534 non-null object
8 Religious affiliation 1532 non-null object
9 Offers Less than one year certificate 1532 non-null object
10 Offers One but less than two years certificate 1532 non-null object
11 Offers Associate's degree 1532 non-null object
12 Offers Two but less than 4 years certificate 1532 non-null object
13 Offers Bachelor's degree 1532 non-null object
14 Offers Postbaccalaureate certificate 1532 non-null object
15 Offers Master's degree 1532 non-null object
16 Offers Post-master's certificate 1532 non-null object
17 Offers Doctor's degree - research/scholarship 1532 non-null object
18 Offers Doctor's degree - professional practice 1532 non-null object
19 Offers Doctor's degree - other 1532 non-null object
20 Offers Other degree 1532 non-null object
21 Applicants total 1377 non-null float64
22 Admissions total 1377 non-null float64
23 Enrolled total 1377 non-null float64
24 Percent of freshmen submitting SAT scores 1257 non-null float64
25 Percent of freshmen submitting ACT scores 1259 non-null float64
26 SAT Critical Reading 25th percentile score 1169 non-null float64
27 SAT Critical Reading 75th percentile score 1169 non-null float64
28 SAT Math 25th percentile score 1182 non-null float64
29 SAT Math 75th percentile score 1182 non-null float64
30 SAT Writing 25th percentile score 714 non-null float64
31 SAT Writing 75th percentile score 714 non-null float64
32 ACT Composite 25th percentile score 1199 non-null float64
33 ACT Composite 75th percentile score 1199 non-null float64
34 Estimated enrollment, total 1532 non-null float64
35 Estimated enrollment, full time 1531 non-null float64
36 Estimated enrollment, part time 1523 non-null float64
37 Estimated undergraduate enrollment, total 1526 non-null float64
38 Estimated undergraduate enrollment, full time 1525 non-null float64
39 Estimated undergraduate enrollment, part time 1514 non-null float64
40 Estimated freshman undergraduate enrollment, total 1515 non-null float64
41 Estimated freshman enrollment, full time 1515 non-null float64
42 Estimated freshman enrollment, part time 1451 non-null float64
43 Estimated graduate enrollment, total 1432 non-null float64
44 Estimated graduate enrollment, full time 1422 non-null float64
45 Estimated graduate enrollment, part time 1420 non-null float64
46 Associate's degrees awarded 1532 non-null float64
47 Bachelor's degrees awarded 1532 non-null float64
48 Master's degrees awarded 1532 non-null float64
49 Doctor's degrese - research/scholarship awarded 1532 non-null float64
50 Doctor's degrees - professional practice awarded 1532 non-null float64
51 Doctor's degrees - other awarded 1532 non-null float64
52 Certificates of less than 1-year awarded 1532 non-null float64
53 Certificates of 1 but less than 2-years awarded 1532 non-null float64
54 Certificates of 2 but less than 4-years awarded 1532 non-null float64
55 Postbaccalaureate certificates awarded 1532 non-null float64
56 Post-master's certificates awarded 1532 non-null float64
57 Number of students receiving an Associate's degree 1532 non-null float64
58 Number of students receiving a Bachelor's degree 1532 non-null float64
59 Number of students receiving a Master's degree 1532 non-null float64
60 Number of students receiving a Doctor's degree 1532 non-null float64
61 Number of students receiving a certificate of less than 1-year 1532 non-null float64
62 Number of students receiving a certificate of 1 but less than 4-years 1532 non-null float64
63 Number of students receiving a Postbaccalaureate or Post-master's certificate 1532 non-null float64
64 Percent admitted - total 1376 non-null float64
65 Admissions yield - total 1376 non-null float64
66 Tuition and fees, 2010-11 1490 non-null float64
67 Tuition and fees, 2011-12 1490 non-null float64
68 Tuition and fees, 2012-13 1492 non-null float64
69 Tuition and fees, 2013-14 1497 non-null float64
70 Total price for in-state students living on campus 2013-14 1400 non-null float64
71 Total price for out-of-state students living on campus 2013-14 1400 non-null float64
72 State abbreviation 1534 non-null object
73 FIPS state code 1534 non-null object
74 Geographic region 1534 non-null object
75 Sector of institution 1534 non-null object
76 Level of institution 1534 non-null object
77 Control of institution 1534 non-null object
78 Historically Black College or University 1534 non-null object
79 Tribal college 1534 non-null object
80 Degree of urbanization (Urban-centric locale) 1534 non-null object
81 Carnegie Classification 2010: Basic 1534 non-null object
82 Total enrollment 1532 non-null float64
83 Full-time enrollment 1532 non-null float64
84 Part-time enrollment 1532 non-null float64
85 Undergraduate enrollment 1532 non-null float64
86 Graduate enrollment 1532 non-null float64
87 Full-time undergraduate enrollment 1532 non-null float64
88 Part-time undergraduate enrollment 1532 non-null float64
89 Percent of total enrollment that are American Indian or Alaska Native 1532 non-null float64
90 Percent of total enrollment that are Asian 1532 non-null float64
91 Percent of total enrollment that are Black or African American 1532 non-null float64
92 Percent of total enrollment that are Hispanic/Latino 1532 non-null float64
93 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 1532 non-null float64
94 Percent of total enrollment that are White 1532 non-null float64
95 Percent of total enrollment that are two or more races 1532 non-null float64
96 Percent of total enrollment that are Race/ethnicity unknown 1532 non-null float64
97 Percent of total enrollment that are Nonresident Alien 1532 non-null float64
98 Percent of total enrollment that are Asian/Native Hawaiian/Pacific Islander 1532 non-null float64
99 Percent of total enrollment that are women 1532 non-null float64
100 Percent of undergraduate enrollment that are American Indian or Alaska Native 1522 non-null float64
101 Percent of undergraduate enrollment that are Asian 1522 non-null float64
102 Percent of undergraduate enrollment that are Black or African American 1522 non-null float64
103 Percent of undergraduate enrollment that are Hispanic/Latino 1522 non-null float64
104 Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander 1522 non-null float64
105 Percent of undergraduate enrollment that are White 1522 non-null float64
106 Percent of undergraduate enrollment that are two or more races 1522 non-null float64
107 Percent of undergraduate enrollment that are Race/ethnicity unknown 1522 non-null float64
108 Percent of undergraduate enrollment that are Nonresident Alien 1522 non-null float64
109 Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander 1522 non-null float64
110 Percent of undergraduate enrollment that are women 1522 non-null float64
111 Percent of graduate enrollment that are American Indian or Alaska Native 1269 non-null float64
112 Percent of graduate enrollment that are Asian 1269 non-null float64
113 Percent of graduate enrollment that are Black or African American 1269 non-null float64
114 Percent of graduate enrollment that are Hispanic/Latino 1269 non-null float64
115 Percent of graduate enrollment that are Native Hawaiian or Other Pacific Islander 1269 non-null float64
116 Percent of graduate enrollment that are White 1269 non-null float64
117 Percent of graduate enrollment that are two or more races 1269 non-null float64
118 Percent of graduate enrollment that are Race/ethnicity unknown 1269 non-null float64
119 Percent of graduate enrollment that are Nonresident Alien 1269 non-null float64
120 Percent of graduate enrollment that are Asian/Native Hawaiian/Pacific Islander 1269 non-null float64
121 Percent of graduate enrollment that are women 1269 non-null float64
122 Number of first-time undergraduates - in-state 911 non-null float64
123 Percent of first-time undergraduates - in-state 911 non-null float64
124 Number of first-time undergraduates - out-of-state 911 non-null float64
125 Percent of first-time undergraduates - out-of-state 911 non-null float64
126 Number of first-time undergraduates - foreign countries 911 non-null float64
127 Percent of first-time undergraduates - foreign countries 911 non-null float64
128 Number of first-time undergraduates - residence unknown 911 non-null float64
129 Percent of first-time undergraduates - residence unknown 911 non-null float64
130 Graduation rate - Bachelor degree within 4 years, total 1476 non-null float64
131 Graduation rate - Bachelor degree within 5 years, total 1476 non-null float64
132 Graduation rate - Bachelor degree within 6 years, total 1476 non-null float64
133 Percent of freshmen receiving any financial aid 1492 non-null float64
134 Percent of freshmen receiving federal, state, local or institutional grant aid 1492 non-null float64
135 Percent of freshmen receiving federal grant aid 1492 non-null float64
136 Percent of freshmen receiving Pell grants 1492 non-null float64
137 Percent of freshmen receiving other federal grant aid 1492 non-null float64
138 Percent of freshmen receiving state/local grant aid 1492 non-null float64
139 Percent of freshmen receiving institutional grant aid 1492 non-null float64
140 Percent of freshmen receiving student loan aid 1492 non-null float64
141 Percent of freshmen receiving federal student loans 1492 non-null float64
142 Percent of freshmen receiving other loan aid 1492 non-null float64
143 Endowment assets (year end) per FTE enrollment (GASB) 516 non-null float64
144 Endowment assets (year end) per FTE enrollment (FASB) 960 non-null float64
dtypes: float64(114), int64(2), object(29)
memory usage: 1.7+ MB
In this case, only three dtypes have been recognized: float64, int64, and object.
If we take a look at the number of non-null elements of each column we can see that many columns, for example the column SAT Writing 75th percentile score
, contain several null or NaN values.
Missing, incorrect, and invalid data need to be addressed to avert possible misleading mathematical operations. There are pandas functions, for example mean(), that exclude NaN values by default; however, other numerical operations consider all values, including missing values, to perform those operations. This will lead to wrong results. Incorrect and invalid data will also result in wrong outcomes. For these reasons, it is vital to tackle these different types of data according to their case.
Let's check the NaN values within the entire data frame. We will see a list of columns sorted according to the number of NaN values that they present.
universities_df.isna().sum().sort_values(ascending=False)
Endowment assets (year end) per FTE enrollment (GASB) 1018
SAT Writing 75th percentile score 820
SAT Writing 25th percentile score 820
Number of first-time undergraduates - residence unknown 623
Percent of first-time undergraduates - residence unknown 623
...
Historically Black College or University 0
Tribal college 0
Degree of urbanization (Urban-centric locale) 0
Carnegie Classification 2010: Basic 0
ID number 0
Length: 145, dtype: int64
Now, let's see only the columns with 20% or more of NaN values.
perc_nan=universities_df.isna().sum()/len(universities_df)*100
ax=perc_nan[perc_nan>=20].sort_values(ascending=False).plot.bar(title='Percentage of NaN values',figsize=(12,5));
ax.set_ylabel('% of NaN elements');
colum_off=universities_df.isna().sum()[universities_df.isna().sum()>=(0.2*len(universities_df))]
list_colum_off=colum_off.index.to_list()
All columns containing 20% of NaN values or more will be removed since those columns cannot be used as representative values. But before removing them, I'll create a copy of the data frame, and I'll continue working on this new data frame.
universitiesnw_df=universities_df.copy()
universitiesnw_df.drop(list_colum_off,axis=1,inplace=True)
From the remaining columns, only those that are related to the goal of this project will be selected to continue working on them.
interesting_columns=['Name', 'year', 'Highest degree offered', "Offers Bachelor's degree",
"Offers Master's degree",
"Offers Doctor's degree - research/scholarship",
"Offers Doctor's degree - professional practice", 'Applicants total',
'Admissions total', 'Enrolled total', 'Estimated enrollment, total',
'Tuition and fees, 2013-14',
'Total price for in-state students living on campus 2013-14',
'Total price for out-of-state students living on campus 2013-14',
'State abbreviation', 'Control of institution', 'Total enrollment',
'Full-time enrollment', 'Part-time enrollment',
'Undergraduate enrollment', 'Graduate enrollment',
'Full-time undergraduate enrollment',
'Part-time undergraduate enrollment',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are women',
'Percent of graduate enrollment that are women',
'Graduation rate - Bachelor degree within 4 years, total',
'Graduation rate - Bachelor degree within 5 years, total',
'Graduation rate - Bachelor degree within 6 years, total',
]
universitiesnw_df=universitiesnw_df[interesting_columns]
Now, some rows that contain a plethora of NaN values will also been removed.
universitiesnw_df[universitiesnw_df['Total enrollment'].isna()][['Name','Applicants total','Admissions total','Enrolled total','Total enrollment']]
a=universitiesnw_df[universitiesnw_df['Name']=='University of North Georgia'].index[0]
b=universitiesnw_df[universitiesnw_df['Name']=='Texas A & M University-Galveston'].index[0]
universitiesnw_df=universitiesnw_df.drop([a,b],axis=0)
print('The data frame now has {} columns out of the {} original columns, and {} rows out of the {} original rows.'.format(universitiesnw_df.shape[1],universities_df.shape[1],universitiesnw_df.shape[0],universities_df.shape[0]))
The data frame now has 29 columns out of the 145 original columns, and 1532 rows out of the 1534 original rows.
Now let's check if the data frame contains invalid values. These values can be more dangerous than missing values because they can produce errors that can be tricky to identify. First, we can try to find out if the data frame has negative values. It is supposed that, according to the context of this data frame, we will not find any negative value.
To perform this, we need to identify which columns contain numerical values. Only those columns will be assessed.
col=universitiesnw_df.select_dtypes(include=['float64','int64']).columns
lt=list()
for i in col:
y=any(x < 0 for x in universitiesnw_df[i])
if y==True:
lt.append(y)
print('There are {} negative values in the data frame.'.format(len(lt)))
There are 0 negative values in the data frame.
Therefore, for this dataset, we do not need to be preoccupied with handling negative invalid values.
universitiesnw_df.describe()
According to the description chart, there are some universities which, for example, in 2013, had 0 applications in total, 0 admissions in total, and 0 enrollments in total. These are inconsistent values, and they need to be handled.
Let's find the total number of entries with 0 value.
total_zero=(universitiesnw_df[universitiesnw_df.loc[0:]==0]).count().sum()
print('This data set contains {} zero values.'.format(total_zero))
This data set contains 395 zero values.
In the specific case of this data set, the entries with a zero value do not have sense; in fact, the most probable reason for them is that the information was not available at that time, but it doesn't mean they are 0. Therefore, I prefer to replace those entries with NaN values. In this way, if a pandas function is applied, none of those NaN values will be considered.
universitiesnw_df.replace(0,np.nan,inplace=True)
total_zero_nw=universitiesnw_df[universitiesnw_df.loc[0:]==0].count().sum()
print('This data set contains {} zero values.'.format(total_zero_nw))
This data set contains 0 zero values.
universitiesnw_df[['Name','Applicants total']].sort_values('Applicants total').head()
According to the chart above, in the case of the column Applicants total
, the possible inconsistent value belongs to Goddard College: Applicants total=4
. However, if we search a little more about this university, we'll find that this number of applications is, in fact, consistent with their data over the years. This is corroborated with the data found in:
https://www.collegefactual.com/colleges/goddard-college/applying/entering-class-stats/
Now, in order to use the column names as properties and take advantage of it, it's necessary to remove some special characters that some column names contain, for example, apostrophe '
, hyphen -
, quotation marks "
, colon :
, and slash /
. Also, some column names contain spaces between their words; these spaces will be removed as well.
universitiesnw_df.columns
Index(['Name', 'year', 'Highest degree offered', 'Offers Bachelor's degree',
'Offers Master's degree',
'Offers Doctor's degree - research/scholarship',
'Offers Doctor's degree - professional practice', 'Applicants total',
'Admissions total', 'Enrolled total', 'Estimated enrollment, total',
'Tuition and fees, 2013-14',
'Total price for in-state students living on campus 2013-14',
'Total price for out-of-state students living on campus 2013-14',
'State abbreviation', 'Control of institution', 'Total enrollment',
'Full-time enrollment', 'Part-time enrollment',
'Undergraduate enrollment', 'Graduate enrollment',
'Full-time undergraduate enrollment',
'Part-time undergraduate enrollment',
'Percent of total enrollment that are women',
'Percent of undergraduate enrollment that are women',
'Percent of graduate enrollment that are women',
'Graduation rate - Bachelor degree within 4 years, total',
'Graduation rate - Bachelor degree within 5 years, total',
'Graduation rate - Bachelor degree within 6 years, total'],
dtype='object')
# def remove_space(header):
# '''This function takes all the spaces between the words of column names and replaces them
# with '_' .
# The argument header corresponds to a column name. '''
# list1=list()
# words_header=header.split()
# size=int(len(words_header))
# for i in range(len(words_header)):
# if i<size-1:
# list1.append(words_header[i]+'_')
# else:
# list1.append(words_header[i])
# separator = ''
# final=separator.join(list1)
# return final
def remove_space(list_headers,charact): #charact should be: charact=[' - ',' ']
new_headers=list()
for header in list_headers:
for char in charact:
if char in header:
header=header.replace(char,'_')
header=header
new_headers.append(header)
return new_headers
def remove_sp_char(header):
'''This function takes all the special characters found in column names and replaces them
with other character accordingly to the case.
The argument header corresponds to a column name. '''
if "'" in header:
header=header.replace("'",'')
if "," in header:
header=header.replace(",",'')
if "_-_" in header:
header=header.replace("_-_",'_')
if "/" in header:
header=header.replace("/",'_or_')
if ":" in header:
header=header.replace(":",'')
if "-" in header:
header=header.replace("-",'_')
return header
def remove_sp_char(headers,chars):
new_headers=list()
for header in headers:
for char in chars:
if char=='-' or char=='/':
header=header.replace(char,'_')
if char in header:
header=header.replace(char,'')
header=header
new_headers.append(header)
return new_headers
Besides removing spaces and replacing special characters, I'll change capitalized letters to avoid a typo of this kind.
headers=remove_space(universitiesnw_df.columns,[' - ',' '])
headers=remove_sp_char(headers,["'",',',':','-','/'])
list_new_header=list()
for header in headers:
header=header.casefold() # All capitalized letters are changed.
if "degrese" in header: # One column name has a typo.
header=header.replace("degrese",'degrees')
list_new_header.append(header)
The next step is to replace the original column names with the new ones, which have the desired format.
universitiesnw_df.columns=list_new_header
universitiesnw_df.columns
Index(['name', 'year', 'highest_degree_offered', 'offers_bachelors_degree',
'offers_masters_degree', 'offers_doctors_degree_research_scholarship',
'offers_doctors_degree_professional_practice', 'applicants_total',
'admissions_total', 'enrolled_total', 'estimated_enrollment_total',
'tuition_and_fees_2013_14',
'total_price_for_in_state_students_living_on_campus_2013_14',
'total_price_for_out_of_state_students_living_on_campus_2013_14',
'state_abbreviation', 'control_of_institution', 'total_enrollment',
'full_time_enrollment', 'part_time_enrollment',
'undergraduate_enrollment', 'graduate_enrollment',
'full_time_undergraduate_enrollment',
'part_time_undergraduate_enrollment',
'percent_of_total_enrollment_that_are_women',
'percent_of_undergraduate_enrollment_that_are_women',
'percent_of_graduate_enrollment_that_are_women',
'graduation_rate_bachelor_degree_within_4_years_total',
'graduation_rate_bachelor_degree_within_5_years_total',
'graduation_rate_bachelor_degree_within_6_years_total'],
dtype='object')
For convenience, the name of the column state_abbreviation
will be replaced by just state
.
universitiesnw_df.rename(columns={'state_abbreviation':'state'}, inplace=True)
universitiesnw_df[['state']].head(2)
jovian.commit(filename=file_name,files=['universities_data.csv'])
[jovian] Attempting to save notebook..
To find insight from our dataset, we'll explore its variables. This exploratory analysis will help us to understand how the variables interact and how representative they are. In the end, it'll help us to formulate hypotheses to reach the goal of the project and think about future works.
matplotlib.rcParams['figure.facecolor']='whitesmoke'
Let's begin this step by looking for information about each column.
from IPython.display import display
with pd.option_context('display.max_columns',None):
display(universitiesnw_df.describe())
It's interesting to see that in 2013 one university received around 72000 applications; whereas, another received only 4 applications in the same year. So, let's see which universities received the highest number of applications.
high_app_df=universitiesnw_df[['name','applicants_total']].sort_values('applicants_total',ascending=False).head(20)
plt.figure(figsize=(12,8))
matplotlib.rcParams['font.size']=14
sns.barplot(x='applicants_total',y='name',data=high_app_df)
plt.title('Top 20 American Universities with the Most Applications in 2013')
plt.xlabel('Number of applications')
plt.ylabel('');
We could think that the universities with the highest number of applications are the preferred ones by students. However, one student can send several applications at the same time, but only one university could be the preferred one; the other universities could be just in case of not being admitted to the preferred one. To find out if the universities with the most applications are the preferred ones, let's find some relationships with the number of admissions and enrollments.
plt.figure(figsize=(16,6))
plt.subplot(1,3,1)
sns.histplot(universitiesnw_df.applicants_total,bins=50)
plt.title('''Histogram of Number of Applications.
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.applicants_total.mean(),universitiesnw_df.applicants_total.median()));
plt.xlabel('Number of Applications')
plt.axis([0,30000,0,350])
plt.xticks(rotation=10)
plt.grid()
plt.subplot(1,3,2)
sns.histplot(universitiesnw_df.admissions_total,bins=50)
plt.title('''Histogram of Number of Admissions.
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.admissions_total.mean(),universitiesnw_df.admissions_total.median()));
plt.axis([0,10000,0,350])
plt.xlabel('Number of Admissions')
plt.xticks(rotation=10)
plt.grid()
plt.subplot(1,3,3)
sns.histplot(universitiesnw_df.enrolled_total,bins=50)
plt.title('''Histogram of Number of Enrollments.
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.enrolled_total.mean(),universitiesnw_df.enrolled_total.median()));
plt.axis([0,5000,0,350])
plt.xlabel('Number of Enrollments')
plt.grid()
plt.xticks(rotation=10)
plt.tight_layout(pad=2);
As we can see, the majority of universities received less than 5000 applications in 2013. However, some universities received a much higher number of applications, as many as 30000 applications in the same year.
We also can see that the average number of admissions is much lower than that of applications. The majority of universities admitted to less than 2000 students in 2013; however, some other universities presented a much higher number of admissions, more than 10000 admissions. As a high rate of admissions could represent a high possibility to be accepted, we could ask: Do students prefer universities with a high rate of admission?
Talking about enrollments, we'll find lower numbers when compared to the number of applications and admissions. That happens because students usually apply to several universities at the same time, hoping that the favorite one admits them; however, when that is not the case, they still have other options.
For now, we have two closely-related questions:
The second question:
plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plt.title('APPLICATIONS VS ADMISSIONS')
sns.scatterplot(y=universitiesnw_df.admissions_total,x=universitiesnw_df.applicants_total,hue=universitiesnw_df.control_of_institution)
plt.ylabel('Number of Admissions')
plt.xlabel('Number of Applications')
plt.grid()
plt.subplot(1,2,2)
plt.title('ADMISSIONS VS ENROLLMENTS')
sns.scatterplot(x='admissions_total',y='enrolled_total',data=universitiesnw_df,hue='control_of_institution')
plt.ylabel('Number of Enrollments')
plt.xlabel('Number of Admissions')
plt.grid()
plt.tight_layout(pad=2)
If we look at the left figure, we can see that, with a few exceptions, universities with a high number of applications also have a high number of admissions, and vice versa. However, the universities with the highest number of applications are not the ones with the highest number of admissions. Additionally, there is a batch of private universities with a high amount of applications, but their number of admissions is pretty low.
If we look at the figure on the right, we can see that the higher the number of admissions, the higher the number of enrollments, too. However, it could be more meaningful to work with a rate of admissions or acceptance and an enrollment rate.
Additionally, we should address the number of applications and admissions to the number of enrollments because if a student applies to several universities, is admitted to several universities, in the end, he will enroll only in the preferred one.
Let's find the acceptance rate and enrollment rate.
The acceptance rate will be based on the number of applications; whereas, the enrollment rate on the number of admissions.
universitiesnw_df['acceptance_rate']=(universitiesnw_df.admissions_total/universitiesnw_df.applicants_total*100).round(2)
universitiesnw_df['enrollment_rate']=(universitiesnw_df.enrolled_total/universitiesnw_df.admissions_total*100).round(2)
plt.figure(figsize=(12,5))
sns.scatterplot(x='applicants_total',y='enrollment_rate',data=universitiesnw_df)
plt.title('APPLICATIONS VS ENROLLMENT RATE')
plt.ylabel('Enrollment Rate %')
plt.xlabel('Number of Applications');
This figure tells us that the universities which receive a lower number of applications are the ones with a higher enrollment rate. Obviously, there are some exceptions, but this is the strongest tendency. Consequently, we can say that a high number of applications does not mean that a university is preferred among students.
Now, let's answer the inquiry about the number of admissions.
plt.figure(figsize=(16,6))
sns.scatterplot(x='acceptance_rate',y='enrollment_rate',data=universitiesnw_df,hue=universitiesnw_df.control_of_institution)
plt.title('ACCEPTANCE VS ENROLLMENT RATES')
plt.ylabel('Enrollment Rate %')
plt.xlabel('Acceptance Rate %');
We can see that for high acceptance rates, the enrollment rate vastly varies among public and private universities; nonetheless, there is a higher concentration where the enrollment rate is not high. That leads us to think that the acceptance rate is not a feature that strongly influences the student's preference for a university. On the other hand, we can see that the lower the acceptance rate is, the higher the enrollment rate will be. Therefore, students do not necessarily prefer a university because of its high acceptance rate or because it is easier for students to be admitted to a university. Additionally, it's interesting to spot that the universities with the lowest acceptance rate are private universities, and several of them have a high enrollment rate. Thereby, though it is difficult for students to be admitted to these private institutions when they are accepted, they enroll straightforwardly. This will be part of our next question.
To corroborate the last finding, let's take a closer look at representative universities.
high_acceptance=universitiesnw_df[universitiesnw_df.acceptance_rate.notnull()][['name','acceptance_rate','enrollment_rate']].sort_values('acceptance_rate',ascending=False).head(25)
low_acceptance=universitiesnw_df[universitiesnw_df.acceptance_rate.notnull()][['name','acceptance_rate','enrollment_rate']].sort_values('acceptance_rate',ascending=False).tail(25)
plt.figure(figsize=(16,4))
plt.subplot(1,2,1)
ind = np.arange(len(high_acceptance)) #number of universities
width = 0.35 #space
plt.bar(ind, high_acceptance.acceptance_rate, width, label='Acceptance Rate')
plt.bar(ind + width, high_acceptance.enrollment_rate, width,label='Enrollment Rate')
plt.title('''Acceptance and Enrollment Rates.
25 Universities With the Highest Acceptance Rate ''')
plt.ylabel('Rates %')
plt.xticks(ind + width,high_acceptance.name.values,rotation=90 )
plt.legend(loc='best');
plt.subplot(1,2,2)
ind = np.arange(len(low_acceptance)) #number of universities
width = 0.35 #space
plt.bar(ind, low_acceptance.acceptance_rate, width, label='Acceptance Rate')
plt.bar(ind + width, low_acceptance.enrollment_rate, width,label='Enrollment Rate')
plt.title('''Acceptance and Enrollment Rates.
25 Universities With the Lowest Acceptance Rate ''')
plt.ylabel('Rates %')
plt.xticks(ind + width,high_acceptance.name.values,rotation=90 )
plt.legend(loc='best');
Now, we verified that there is no a clear pattern of enrollment rate when universities have a high acceptance rate.
In the case that universities have a low acceptance rate (more difficult to be admitted), again, there is no a specific pattern or tendency, but it is clear that the enrollment rate outnumbers the acceptance rate. Thereby, we can be sure that, in general, the students'preference is not based in how easy is for them to be admitted to a university. Again, there should be other most relevant features that influence the students' preference.
Another interesting feature that could influence on students is the type of university.
To find out whether students prefer private or public universities, one approach could be by examining the number of applications accordingly to the type of university.
#spliting the number of applications according to the type of control: private or public.
uni_private_df=universitiesnw_df[universitiesnw_df.control_of_institution=='Private not-for-profit']
uni_private_df=uni_private_df[uni_private_df.applicants_total.notnull()]
uni_public_df=universitiesnw_df[universitiesnw_df.control_of_institution=='Public']
uni_public_df=uni_public_df[uni_public_df.applicants_total.notnull()]
plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
plt.hist([uni_public_df.applicants_total,uni_private_df.applicants_total],stacked=True,bins=25)
plt.axis([0,31000,0,700])
plt.title('Distribution of Applications')
plt.xlabel('Number of Applications')
plt.ylabel('Universities')
plt.legend(['Public universities. ({})'.format(len(uni_public_df)),'Private universities. ({})'.format(len(uni_private_df))]);
plt.subplot(1,2,2)
sns.barplot(x=universitiesnw_df.control_of_institution,y=universitiesnw_df.applicants_total);
plt.title('''Average and Variation of Applications
According to the Type of Control''')
plt.xlabel('')
plt.ylabel('Number of Applications');
plt.tight_layout(pad=1)