Jovian
⭐️
Sign In

ANALYSIS OF AMERICAN UNIVERSITIES

WHAT MAKES A STUDENT PREFER A UNIVERSITY?

The aim of this project is to find out which are the most relevant features that students consider to choose the preferred American university. Some of the essential questions for developing this project are related to the number of applications, admissions, and enrollments, cost of tuition and fees, cost of living on campus, types of degrees offered, and features of the states where universities are located (population and GDP).

The dataset used for this analysis was taken from https://www.kaggle.com/sumithbhongale/american-university-data-ipeds-dataset. It contains a plethora of information about American universities (that are not necessarily the top 10-20) in 2013. Although this dataset does not contain information about all the first-ranked American universities, the patterns and insights extracted from it are highly representative of the whole behavior. The dataset contains more than a thousand rows (universities) and 145 columns (features about those universities). Several of those features are out of the scope of this project. Only the features that have information to answer the questions to achieve the goal of the project were deployed. The most powerful tools for data analysis used in this project are the packages Numpy and Pandas, and to visualize and explore the data: Matplotlib and Seaborn. All of these tools were meaningfully and efficiently taught in the course "Data Analysis with Python: Zero to Pandas" given by Jovian in partnership with freeCodeCamp.

As a first step, I'll upload this Jupyter notebook to Jovian.ai

In [1]:
file_name ='student_perfere_a_university'
In [2]:
!pip install jovian --upgrade -q
In [3]:
import jovian
In [4]:
jovian.commit(filename=file_name)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "danycg85/student-perfere-a-university" on https://jovian.ai/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/danycg85/student-perfere-a-university

TABLE OF CONTENTS

1. Reading the Dataset.

This chapter presents the steps to read the dataset about American Universities.

2. Data Preparation: Cleaning and Formatting.

In all data-analysis projects, the data preparation step is not only necessary but also vital to find and handle features that could cause some problems while making the quantitative analysis, or that could lead to low efficient coding. According to Alivia Smith[1], this step usually takes up to 80% of the entire time of a data analysis project. Therefore, missing, invalid, and inconsistent values have been addressed. Finally, this step presents a code for changing the format of column names.

3. Exploratory Data Analysis (EDA) and Visualization. Quantitative and qualitative analysis (Asking and Answering Questions).

Although in many cases, the exploratory data analysis and the quantitative and qualitative analysis are separated steps, in this specific project, they have been joined. This step, previous to asking questions that could lead to reaching the project's aim, presents global-useful information about the different columns of the dataset. Some initial examinations were performed as a way of beginning finding patterns, creating hypotheses, and corroborating early assumptions. Later, deeper investigations were depicted as part of reaching the project's goal.

Quantitative and qualitative analysis: Asking and Answering Questions.

The hypotheses and questions generated to develop this projects are:

  1. Do universities with a high number of applications are the preferred ones by students?; in other words, could the number of applications tell us that a university is one of the most preferred by students?.
  2. Do students prefer universities that have a high rate of admission?, in other words, do students prefer a university where it is easier for them to be admitted?.
  3. Do students prefere public or private universities?
  4. Do students prefer universities with low tuition and fees?
  5. Do students prefer a university for its low on-campus cost of living?
  6. Do students prefer universities from highly populated states?
  7. Do students prefer a university because it belongs to a state with a high GDP per capita?
  8. Do students prefer a university based on the possibility of a higher, additional academic degree in the same university?

4. Inferences and Conclusions.

Although all conclusions and answers are exposed in the previous section accordingly to each question, this section also presents, in a concise manner, the most significant insights.

5. Future Work.

This section reveals some compelling features or ideas to elaborate on or to consider for future works.

6. References.

All the sources that have been helpful to develop this project are exposed in this section.

1. READING THE DATASET

We're going to begin by importing all the packages we'll use in this project.

In [5]:
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [6]:
#to center every figure in the notebook.
#from: https://stackoverflow.com/questions/18380168/center-output-plots-in-the-notebook
from IPython.core.display import HTML as Center

Center(""" <style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style> """)
Out[6]:
In [7]:
with open('universities_data.csv') as file:
    universities_df=pd.read_csv(file)

Because the file is a comma-separated value (CSV) file, we use the pandas function pd.read_csv() to open and read the file.

In [8]:
type(universities_df)
Out[8]:
pandas.core.frame.DataFrame

We have verified that the file has been created as a DataFrame variable. And we can see the first five rows of the dataset as a preview.

In [9]:
universities_df.head()
Out[9]:
In [10]:
jovian.commit(filename=file_name,files=['universities_data.csv'])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "danycg85/student-perfere-a-university" on https://jovian.ai/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully! https://jovian.ai/danycg85/student-perfere-a-university

2. DATA TREATMENT: CLEANING AND FORMATTING

Let's begin this step by looking at the shape (number of rows and columns) of the dataset.

In [11]:
universities_df.shape
Out[11]:
(1534, 145)
In [12]:
print('The dataset contains {} rows and {} columns'.format(universities_df.shape[0],universities_df.shape[1]))
The dataset contains 1534 rows and 145 columns

Now, let's see some additional information.

In [13]:
universities_df.info(max_cols=len(universities_df))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1534 entries, 0 to 1533 Data columns (total 145 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID number 1534 non-null int64 1 Name 1534 non-null object 2 year 1534 non-null int64 3 ZIP code 1534 non-null object 4 Highest degree offered 1534 non-null object 5 County name 1534 non-null object 6 Longitude location of institution 1534 non-null object 7 Latitude location of institution 1534 non-null object 8 Religious affiliation 1532 non-null object 9 Offers Less than one year certificate 1532 non-null object 10 Offers One but less than two years certificate 1532 non-null object 11 Offers Associate's degree 1532 non-null object 12 Offers Two but less than 4 years certificate 1532 non-null object 13 Offers Bachelor's degree 1532 non-null object 14 Offers Postbaccalaureate certificate 1532 non-null object 15 Offers Master's degree 1532 non-null object 16 Offers Post-master's certificate 1532 non-null object 17 Offers Doctor's degree - research/scholarship 1532 non-null object 18 Offers Doctor's degree - professional practice 1532 non-null object 19 Offers Doctor's degree - other 1532 non-null object 20 Offers Other degree 1532 non-null object 21 Applicants total 1377 non-null float64 22 Admissions total 1377 non-null float64 23 Enrolled total 1377 non-null float64 24 Percent of freshmen submitting SAT scores 1257 non-null float64 25 Percent of freshmen submitting ACT scores 1259 non-null float64 26 SAT Critical Reading 25th percentile score 1169 non-null float64 27 SAT Critical Reading 75th percentile score 1169 non-null float64 28 SAT Math 25th percentile score 1182 non-null float64 29 SAT Math 75th percentile score 1182 non-null float64 30 SAT Writing 25th percentile score 714 non-null float64 31 SAT Writing 75th percentile score 714 non-null float64 32 ACT Composite 25th percentile score 1199 non-null float64 33 ACT Composite 75th percentile score 1199 non-null float64 34 Estimated enrollment, total 1532 non-null float64 35 Estimated enrollment, full time 1531 non-null float64 36 Estimated enrollment, part time 1523 non-null float64 37 Estimated undergraduate enrollment, total 1526 non-null float64 38 Estimated undergraduate enrollment, full time 1525 non-null float64 39 Estimated undergraduate enrollment, part time 1514 non-null float64 40 Estimated freshman undergraduate enrollment, total 1515 non-null float64 41 Estimated freshman enrollment, full time 1515 non-null float64 42 Estimated freshman enrollment, part time 1451 non-null float64 43 Estimated graduate enrollment, total 1432 non-null float64 44 Estimated graduate enrollment, full time 1422 non-null float64 45 Estimated graduate enrollment, part time 1420 non-null float64 46 Associate's degrees awarded 1532 non-null float64 47 Bachelor's degrees awarded 1532 non-null float64 48 Master's degrees awarded 1532 non-null float64 49 Doctor's degrese - research/scholarship awarded 1532 non-null float64 50 Doctor's degrees - professional practice awarded 1532 non-null float64 51 Doctor's degrees - other awarded 1532 non-null float64 52 Certificates of less than 1-year awarded 1532 non-null float64 53 Certificates of 1 but less than 2-years awarded 1532 non-null float64 54 Certificates of 2 but less than 4-years awarded 1532 non-null float64 55 Postbaccalaureate certificates awarded 1532 non-null float64 56 Post-master's certificates awarded 1532 non-null float64 57 Number of students receiving an Associate's degree 1532 non-null float64 58 Number of students receiving a Bachelor's degree 1532 non-null float64 59 Number of students receiving a Master's degree 1532 non-null float64 60 Number of students receiving a Doctor's degree 1532 non-null float64 61 Number of students receiving a certificate of less than 1-year 1532 non-null float64 62 Number of students receiving a certificate of 1 but less than 4-years 1532 non-null float64 63 Number of students receiving a Postbaccalaureate or Post-master's certificate 1532 non-null float64 64 Percent admitted - total 1376 non-null float64 65 Admissions yield - total 1376 non-null float64 66 Tuition and fees, 2010-11 1490 non-null float64 67 Tuition and fees, 2011-12 1490 non-null float64 68 Tuition and fees, 2012-13 1492 non-null float64 69 Tuition and fees, 2013-14 1497 non-null float64 70 Total price for in-state students living on campus 2013-14 1400 non-null float64 71 Total price for out-of-state students living on campus 2013-14 1400 non-null float64 72 State abbreviation 1534 non-null object 73 FIPS state code 1534 non-null object 74 Geographic region 1534 non-null object 75 Sector of institution 1534 non-null object 76 Level of institution 1534 non-null object 77 Control of institution 1534 non-null object 78 Historically Black College or University 1534 non-null object 79 Tribal college 1534 non-null object 80 Degree of urbanization (Urban-centric locale) 1534 non-null object 81 Carnegie Classification 2010: Basic 1534 non-null object 82 Total enrollment 1532 non-null float64 83 Full-time enrollment 1532 non-null float64 84 Part-time enrollment 1532 non-null float64 85 Undergraduate enrollment 1532 non-null float64 86 Graduate enrollment 1532 non-null float64 87 Full-time undergraduate enrollment 1532 non-null float64 88 Part-time undergraduate enrollment 1532 non-null float64 89 Percent of total enrollment that are American Indian or Alaska Native 1532 non-null float64 90 Percent of total enrollment that are Asian 1532 non-null float64 91 Percent of total enrollment that are Black or African American 1532 non-null float64 92 Percent of total enrollment that are Hispanic/Latino 1532 non-null float64 93 Percent of total enrollment that are Native Hawaiian or Other Pacific Islander 1532 non-null float64 94 Percent of total enrollment that are White 1532 non-null float64 95 Percent of total enrollment that are two or more races 1532 non-null float64 96 Percent of total enrollment that are Race/ethnicity unknown 1532 non-null float64 97 Percent of total enrollment that are Nonresident Alien 1532 non-null float64 98 Percent of total enrollment that are Asian/Native Hawaiian/Pacific Islander 1532 non-null float64 99 Percent of total enrollment that are women 1532 non-null float64 100 Percent of undergraduate enrollment that are American Indian or Alaska Native 1522 non-null float64 101 Percent of undergraduate enrollment that are Asian 1522 non-null float64 102 Percent of undergraduate enrollment that are Black or African American 1522 non-null float64 103 Percent of undergraduate enrollment that are Hispanic/Latino 1522 non-null float64 104 Percent of undergraduate enrollment that are Native Hawaiian or Other Pacific Islander 1522 non-null float64 105 Percent of undergraduate enrollment that are White 1522 non-null float64 106 Percent of undergraduate enrollment that are two or more races 1522 non-null float64 107 Percent of undergraduate enrollment that are Race/ethnicity unknown 1522 non-null float64 108 Percent of undergraduate enrollment that are Nonresident Alien 1522 non-null float64 109 Percent of undergraduate enrollment that are Asian/Native Hawaiian/Pacific Islander 1522 non-null float64 110 Percent of undergraduate enrollment that are women 1522 non-null float64 111 Percent of graduate enrollment that are American Indian or Alaska Native 1269 non-null float64 112 Percent of graduate enrollment that are Asian 1269 non-null float64 113 Percent of graduate enrollment that are Black or African American 1269 non-null float64 114 Percent of graduate enrollment that are Hispanic/Latino 1269 non-null float64 115 Percent of graduate enrollment that are Native Hawaiian or Other Pacific Islander 1269 non-null float64 116 Percent of graduate enrollment that are White 1269 non-null float64 117 Percent of graduate enrollment that are two or more races 1269 non-null float64 118 Percent of graduate enrollment that are Race/ethnicity unknown 1269 non-null float64 119 Percent of graduate enrollment that are Nonresident Alien 1269 non-null float64 120 Percent of graduate enrollment that are Asian/Native Hawaiian/Pacific Islander 1269 non-null float64 121 Percent of graduate enrollment that are women 1269 non-null float64 122 Number of first-time undergraduates - in-state 911 non-null float64 123 Percent of first-time undergraduates - in-state 911 non-null float64 124 Number of first-time undergraduates - out-of-state 911 non-null float64 125 Percent of first-time undergraduates - out-of-state 911 non-null float64 126 Number of first-time undergraduates - foreign countries 911 non-null float64 127 Percent of first-time undergraduates - foreign countries 911 non-null float64 128 Number of first-time undergraduates - residence unknown 911 non-null float64 129 Percent of first-time undergraduates - residence unknown 911 non-null float64 130 Graduation rate - Bachelor degree within 4 years, total 1476 non-null float64 131 Graduation rate - Bachelor degree within 5 years, total 1476 non-null float64 132 Graduation rate - Bachelor degree within 6 years, total 1476 non-null float64 133 Percent of freshmen receiving any financial aid 1492 non-null float64 134 Percent of freshmen receiving federal, state, local or institutional grant aid 1492 non-null float64 135 Percent of freshmen receiving federal grant aid 1492 non-null float64 136 Percent of freshmen receiving Pell grants 1492 non-null float64 137 Percent of freshmen receiving other federal grant aid 1492 non-null float64 138 Percent of freshmen receiving state/local grant aid 1492 non-null float64 139 Percent of freshmen receiving institutional grant aid 1492 non-null float64 140 Percent of freshmen receiving student loan aid 1492 non-null float64 141 Percent of freshmen receiving federal student loans 1492 non-null float64 142 Percent of freshmen receiving other loan aid 1492 non-null float64 143 Endowment assets (year end) per FTE enrollment (GASB) 516 non-null float64 144 Endowment assets (year end) per FTE enrollment (FASB) 960 non-null float64 dtypes: float64(114), int64(2), object(29) memory usage: 1.7+ MB

In this case, only three dtypes have been recognized: float64, int64, and object.

If we take a look at the number of non-null elements of each column we can see that many columns, for example the column SAT Writing 75th percentile score, contain several null or NaN values.

2.1 Working with missing, incorrect, and invalid data

Missing, incorrect, and invalid data need to be addressed to avert possible misleading mathematical operations. There are pandas functions, for example mean(), that exclude NaN values by default; however, other numerical operations consider all values, including missing values, to perform those operations. This will lead to wrong results. Incorrect and invalid data will also result in wrong outcomes. For these reasons, it is vital to tackle these different types of data according to their case.

-Missing values (NaN values)

Let's check the NaN values within the entire data frame. We will see a list of columns sorted according to the number of NaN values that they present.

In [14]:
universities_df.isna().sum().sort_values(ascending=False)
Out[14]:
Endowment assets (year end) per FTE enrollment (GASB)       1018
SAT Writing 75th percentile score                            820
SAT Writing 25th percentile score                            820
Number of first-time undergraduates - residence unknown      623
Percent of first-time undergraduates - residence unknown     623
                                                            ... 
Historically Black College or University                       0
Tribal college                                                 0
Degree of urbanization (Urban-centric locale)                  0
Carnegie Classification 2010: Basic                            0
ID number                                                      0
Length: 145, dtype: int64

Now, let's see only the columns with 20% or more of NaN values.

In [15]:
perc_nan=universities_df.isna().sum()/len(universities_df)*100
In [16]:
ax=perc_nan[perc_nan>=20].sort_values(ascending=False).plot.bar(title='Percentage of NaN values',figsize=(12,5));
ax.set_ylabel('% of NaN elements');
In [17]:
colum_off=universities_df.isna().sum()[universities_df.isna().sum()>=(0.2*len(universities_df))]
list_colum_off=colum_off.index.to_list()

All columns containing 20% of NaN values or more will be removed since those columns cannot be used as representative values. But before removing them, I'll create a copy of the data frame, and I'll continue working on this new data frame.

In [18]:
universitiesnw_df=universities_df.copy()
In [19]:
universitiesnw_df.drop(list_colum_off,axis=1,inplace=True)

From the remaining columns, only those that are related to the goal of this project will be selected to continue working on them.

In [20]:
interesting_columns=['Name', 'year', 'Highest degree offered', "Offers Bachelor's degree",
       "Offers Master's degree",
       "Offers Doctor's degree - research/scholarship",
       "Offers Doctor's degree - professional practice", 'Applicants total',
       'Admissions total', 'Enrolled total', 'Estimated enrollment, total',
       'Tuition and fees, 2013-14',
       'Total price for in-state students living on campus 2013-14',
       'Total price for out-of-state students living on campus 2013-14',
       'State abbreviation', 'Control of institution', 'Total enrollment',
       'Full-time enrollment', 'Part-time enrollment',
       'Undergraduate enrollment', 'Graduate enrollment',
       'Full-time undergraduate enrollment',
       'Part-time undergraduate enrollment',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are women',
       'Percent of graduate enrollment that are women',
       'Graduation rate - Bachelor degree within 4 years, total',
       'Graduation rate - Bachelor degree within 5 years, total',
       'Graduation rate - Bachelor degree within 6 years, total',
       ]
In [21]:
universitiesnw_df=universitiesnw_df[interesting_columns]

Now, some rows that contain a plethora of NaN values will also been removed.

In [22]:
universitiesnw_df[universitiesnw_df['Total enrollment'].isna()][['Name','Applicants total','Admissions total','Enrolled total','Total enrollment']]
Out[22]:
In [23]:
a=universitiesnw_df[universitiesnw_df['Name']=='University of North Georgia'].index[0]
b=universitiesnw_df[universitiesnw_df['Name']=='Texas A & M University-Galveston'].index[0]
universitiesnw_df=universitiesnw_df.drop([a,b],axis=0)
In [24]:
print('The data frame now has {} columns out of the {} original columns, and {} rows out of the {} original rows.'.format(universitiesnw_df.shape[1],universities_df.shape[1],universitiesnw_df.shape[0],universities_df.shape[0]))
The data frame now has 29 columns out of the 145 original columns, and 1532 rows out of the 1534 original rows.
-Invalid values

Now let's check if the data frame contains invalid values. These values can be more dangerous than missing values because they can produce errors that can be tricky to identify. First, we can try to find out if the data frame has negative values. It is supposed that, according to the context of this data frame, we will not find any negative value.

To perform this, we need to identify which columns contain numerical values. Only those columns will be assessed.

In [25]:
col=universitiesnw_df.select_dtypes(include=['float64','int64']).columns
In [26]:
lt=list()
for i in col:
    y=any(x < 0 for x in universitiesnw_df[i])
    if y==True:
        lt.append(y)
print('There are {} negative values in the data frame.'.format(len(lt)))
There are 0 negative values in the data frame.

Therefore, for this dataset, we do not need to be preoccupied with handling negative invalid values.

-Inconsistent values
In [27]:
universitiesnw_df.describe()
Out[27]:

According to the description chart, there are some universities which, for example, in 2013, had 0 applications in total, 0 admissions in total, and 0 enrollments in total. These are inconsistent values, and they need to be handled.

Let's find the total number of entries with 0 value.

In [28]:
total_zero=(universitiesnw_df[universitiesnw_df.loc[0:]==0]).count().sum()
In [29]:
print('This data set contains {} zero values.'.format(total_zero))
This data set contains 395 zero values.

In the specific case of this data set, the entries with a zero value do not have sense; in fact, the most probable reason for them is that the information was not available at that time, but it doesn't mean they are 0. Therefore, I prefer to replace those entries with NaN values. In this way, if a pandas function is applied, none of those NaN values will be considered.

In [30]:
universitiesnw_df.replace(0,np.nan,inplace=True)
In [31]:
total_zero_nw=universitiesnw_df[universitiesnw_df.loc[0:]==0].count().sum()
In [32]:
print('This data set contains {} zero values.'.format(total_zero_nw))
This data set contains 0 zero values.
In [33]:
universitiesnw_df[['Name','Applicants total']].sort_values('Applicants total').head()
Out[33]:

According to the chart above, in the case of the column Applicants total, the possible inconsistent value belongs to Goddard College: Applicants total=4. However, if we search a little more about this university, we'll find that this number of applications is, in fact, consistent with their data over the years. This is corroborated with the data found in:
https://www.collegefactual.com/colleges/goddard-college/applying/entering-class-stats/

2.2 Giving format

Now, in order to use the column names as properties and take advantage of it, it's necessary to remove some special characters that some column names contain, for example, apostrophe ', hyphen -, quotation marks ", colon :, and slash /. Also, some column names contain spaces between their words; these spaces will be removed as well.

In [34]:
universitiesnw_df.columns
Out[34]:
Index(['Name', 'year', 'Highest degree offered', 'Offers Bachelor's degree',
       'Offers Master's degree',
       'Offers Doctor's degree - research/scholarship',
       'Offers Doctor's degree - professional practice', 'Applicants total',
       'Admissions total', 'Enrolled total', 'Estimated enrollment, total',
       'Tuition and fees, 2013-14',
       'Total price for in-state students living on campus 2013-14',
       'Total price for out-of-state students living on campus 2013-14',
       'State abbreviation', 'Control of institution', 'Total enrollment',
       'Full-time enrollment', 'Part-time enrollment',
       'Undergraduate enrollment', 'Graduate enrollment',
       'Full-time undergraduate enrollment',
       'Part-time undergraduate enrollment',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are women',
       'Percent of graduate enrollment that are women',
       'Graduation rate - Bachelor degree within 4 years, total',
       'Graduation rate - Bachelor degree within 5 years, total',
       'Graduation rate - Bachelor degree within 6 years, total'],
      dtype='object')
In [35]:
# def remove_space(header):
    
#     '''This function takes all the spaces between the words of column names and replaces them
#     with '_' . 
    
#     The argument header corresponds to a column name. '''
    
#     list1=list()
#     words_header=header.split()    
#     size=int(len(words_header))
    
#     for i in range(len(words_header)):
#         if i<size-1:
#             list1.append(words_header[i]+'_')            
#         else:
#             list1.append(words_header[i])
                
#     separator = ''
#     final=separator.join(list1)    
    
#     return final
In [36]:
def remove_space(list_headers,charact): #charact should be: charact=[' - ',' ']
    new_headers=list()
    for header in list_headers:
        for char in charact:      
            if char in header:
                header=header.replace(char,'_')
            header=header
        new_headers.append(header)
    return new_headers
In [37]:
def remove_sp_char(header):
    
    '''This function takes all the special characters found in column names and replaces them
    with other character accordingly to the case. 
    
    The argument header corresponds to a column name. '''
    
    if "'" in header:
        header=header.replace("'",'')
    if "," in header:
        header=header.replace(",",'')
    if "_-_" in header:
        header=header.replace("_-_",'_')
    if "/" in header:
        header=header.replace("/",'_or_')
    if ":" in header:
        header=header.replace(":",'')
    if "-" in header:
        header=header.replace("-",'_')
        
    return header
In [38]:
def remove_sp_char(headers,chars):
    new_headers=list()
    for header in headers:    
        for char in chars:
            if char=='-' or char=='/':
                header=header.replace(char,'_')
            if char in header:
                header=header.replace(char,'')
            
            header=header   
        new_headers.append(header)
    return new_headers    

Besides removing spaces and replacing special characters, I'll change capitalized letters to avoid a typo of this kind.

In [39]:
headers=remove_space(universitiesnw_df.columns,[' - ',' '])
In [40]:
headers=remove_sp_char(headers,["'",',',':','-','/'])
In [41]:
list_new_header=list()

for header in headers:
    header=header.casefold()   # All capitalized letters are changed.
    
    if "degrese" in header:    # One column name has a typo.  
        header=header.replace("degrese",'degrees')
            
    list_new_header.append(header)

The next step is to replace the original column names with the new ones, which have the desired format.

In [42]:
universitiesnw_df.columns=list_new_header
In [43]:
universitiesnw_df.columns
Out[43]:
Index(['name', 'year', 'highest_degree_offered', 'offers_bachelors_degree',
       'offers_masters_degree', 'offers_doctors_degree_research_scholarship',
       'offers_doctors_degree_professional_practice', 'applicants_total',
       'admissions_total', 'enrolled_total', 'estimated_enrollment_total',
       'tuition_and_fees_2013_14',
       'total_price_for_in_state_students_living_on_campus_2013_14',
       'total_price_for_out_of_state_students_living_on_campus_2013_14',
       'state_abbreviation', 'control_of_institution', 'total_enrollment',
       'full_time_enrollment', 'part_time_enrollment',
       'undergraduate_enrollment', 'graduate_enrollment',
       'full_time_undergraduate_enrollment',
       'part_time_undergraduate_enrollment',
       'percent_of_total_enrollment_that_are_women',
       'percent_of_undergraduate_enrollment_that_are_women',
       'percent_of_graduate_enrollment_that_are_women',
       'graduation_rate_bachelor_degree_within_4_years_total',
       'graduation_rate_bachelor_degree_within_5_years_total',
       'graduation_rate_bachelor_degree_within_6_years_total'],
      dtype='object')

For convenience, the name of the column state_abbreviation will be replaced by just state.

In [44]:
universitiesnw_df.rename(columns={'state_abbreviation':'state'}, inplace=True)
In [45]:
universitiesnw_df[['state']].head(2)
Out[45]:
In [46]:
jovian.commit(filename=file_name,files=['universities_data.csv'])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "danycg85/student-perfere-a-university" on https://jovian.ai/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully! https://jovian.ai/danycg85/student-perfere-a-university

3. Exploratory Data Analysis (EDA) and Visualization. Quantitative and qualitative analysis (Asking and Answering Questions).

To find insight from our dataset, we'll explore its variables. This exploratory analysis will help us to understand how the variables interact and how representative they are. In the end, it'll help us to formulate hypotheses to reach the goal of the project and think about future works.

In [47]:
matplotlib.rcParams['figure.facecolor']='whitesmoke'

Let's begin this step by looking for information about each column.

In [48]:
from IPython.display import display
with pd.option_context('display.max_columns',None):
    display(universitiesnw_df.describe())

It's interesting to see that in 2013 one university received around 72000 applications; whereas, another received only 4 applications in the same year. So, let's see which universities received the highest number of applications.

In [49]:
high_app_df=universitiesnw_df[['name','applicants_total']].sort_values('applicants_total',ascending=False).head(20)
In [50]:
plt.figure(figsize=(12,8))
matplotlib.rcParams['font.size']=14
sns.barplot(x='applicants_total',y='name',data=high_app_df)
plt.title('Top 20 American Universities with the Most Applications in 2013')
plt.xlabel('Number of applications')
plt.ylabel('');

We could think that the universities with the highest number of applications are the preferred ones by students. However, one student can send several applications at the same time, but only one university could be the preferred one; the other universities could be just in case of not being admitted to the preferred one. To find out if the universities with the most applications are the preferred ones, let's find some relationships with the number of admissions and enrollments.

In [51]:
plt.figure(figsize=(16,6))

plt.subplot(1,3,1)
sns.histplot(universitiesnw_df.applicants_total,bins=50)
plt.title('''Histogram of Number of Applications. 
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.applicants_total.mean(),universitiesnw_df.applicants_total.median()));
plt.xlabel('Number of Applications')
plt.axis([0,30000,0,350])
plt.xticks(rotation=10)
plt.grid()

plt.subplot(1,3,2)
sns.histplot(universitiesnw_df.admissions_total,bins=50)
plt.title('''Histogram of Number of Admissions. 
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.admissions_total.mean(),universitiesnw_df.admissions_total.median()));
plt.axis([0,10000,0,350])
plt.xlabel('Number of Admissions')
plt.xticks(rotation=10)
plt.grid()

plt.subplot(1,3,3)
sns.histplot(universitiesnw_df.enrolled_total,bins=50)
plt.title('''Histogram of Number of Enrollments. 
Mean: {:.1f}, Median: {:.1f}'''.format(universitiesnw_df.enrolled_total.mean(),universitiesnw_df.enrolled_total.median()));
plt.axis([0,5000,0,350])
plt.xlabel('Number of Enrollments')
plt.grid()
plt.xticks(rotation=10)
plt.tight_layout(pad=2);

As we can see, the majority of universities received less than 5000 applications in 2013. However, some universities received a much higher number of applications, as many as 30000 applications in the same year.

We also can see that the average number of admissions is much lower than that of applications. The majority of universities admitted to less than 2000 students in 2013; however, some other universities presented a much higher number of admissions, more than 10000 admissions. As a high rate of admissions could represent a high possibility to be accepted, we could ask: Do students prefer universities with a high rate of admission?

Talking about enrollments, we'll find lower numbers when compared to the number of applications and admissions. That happens because students usually apply to several universities at the same time, hoping that the favorite one admits them; however, when that is not the case, they still have other options.

For now, we have two closely-related questions:

Q: Do universities with a high number of applications are the preferred ones by students?; in other words, could the number of applications tell us that a university is one of the most preferred by students?.

The second question:

Q: Do students prefer universities that have a high rate of admission?, in other words, do students prefer a university where it is easier for them to be admitted?.
In [52]:
plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plt.title('APPLICATIONS VS ADMISSIONS')
sns.scatterplot(y=universitiesnw_df.admissions_total,x=universitiesnw_df.applicants_total,hue=universitiesnw_df.control_of_institution)
plt.ylabel('Number of Admissions')
plt.xlabel('Number of Applications')
plt.grid()

plt.subplot(1,2,2)
plt.title('ADMISSIONS VS ENROLLMENTS')
sns.scatterplot(x='admissions_total',y='enrolled_total',data=universitiesnw_df,hue='control_of_institution')
plt.ylabel('Number of Enrollments')
plt.xlabel('Number of Admissions')
plt.grid()

plt.tight_layout(pad=2)

If we look at the left figure, we can see that, with a few exceptions, universities with a high number of applications also have a high number of admissions, and vice versa. However, the universities with the highest number of applications are not the ones with the highest number of admissions. Additionally, there is a batch of private universities with a high amount of applications, but their number of admissions is pretty low.

If we look at the figure on the right, we can see that the higher the number of admissions, the higher the number of enrollments, too. However, it could be more meaningful to work with a rate of admissions or acceptance and an enrollment rate.

Additionally, we should address the number of applications and admissions to the number of enrollments because if a student applies to several universities, is admitted to several universities, in the end, he will enroll only in the preferred one.

Let's find the acceptance rate and enrollment rate.

The acceptance rate will be based on the number of applications; whereas, the enrollment rate on the number of admissions.

In [53]:
universitiesnw_df['acceptance_rate']=(universitiesnw_df.admissions_total/universitiesnw_df.applicants_total*100).round(2)
In [54]:
universitiesnw_df['enrollment_rate']=(universitiesnw_df.enrolled_total/universitiesnw_df.admissions_total*100).round(2)
In [55]:
plt.figure(figsize=(12,5))
sns.scatterplot(x='applicants_total',y='enrollment_rate',data=universitiesnw_df)
plt.title('APPLICATIONS VS ENROLLMENT RATE')
plt.ylabel('Enrollment Rate %')
plt.xlabel('Number of Applications');

This figure tells us that the universities which receive a lower number of applications are the ones with a higher enrollment rate. Obviously, there are some exceptions, but this is the strongest tendency. Consequently, we can say that a high number of applications does not mean that a university is preferred among students.

Now, let's answer the inquiry about the number of admissions.

In [56]:
plt.figure(figsize=(16,6))
sns.scatterplot(x='acceptance_rate',y='enrollment_rate',data=universitiesnw_df,hue=universitiesnw_df.control_of_institution)
plt.title('ACCEPTANCE VS ENROLLMENT RATES')
plt.ylabel('Enrollment Rate %')
plt.xlabel('Acceptance Rate %');

We can see that for high acceptance rates, the enrollment rate vastly varies among public and private universities; nonetheless, there is a higher concentration where the enrollment rate is not high. That leads us to think that the acceptance rate is not a feature that strongly influences the student's preference for a university. On the other hand, we can see that the lower the acceptance rate is, the higher the enrollment rate will be. Therefore, students do not necessarily prefer a university because of its high acceptance rate or because it is easier for students to be admitted to a university. Additionally, it's interesting to spot that the universities with the lowest acceptance rate are private universities, and several of them have a high enrollment rate. Thereby, though it is difficult for students to be admitted to these private institutions when they are accepted, they enroll straightforwardly. This will be part of our next question.

To corroborate the last finding, let's take a closer look at representative universities.

In [57]:
high_acceptance=universitiesnw_df[universitiesnw_df.acceptance_rate.notnull()][['name','acceptance_rate','enrollment_rate']].sort_values('acceptance_rate',ascending=False).head(25)
In [58]:
low_acceptance=universitiesnw_df[universitiesnw_df.acceptance_rate.notnull()][['name','acceptance_rate','enrollment_rate']].sort_values('acceptance_rate',ascending=False).tail(25)
In [59]:
plt.figure(figsize=(16,4))

plt.subplot(1,2,1)
ind = np.arange(len(high_acceptance)) #number of universities
width = 0.35       #space

plt.bar(ind, high_acceptance.acceptance_rate, width, label='Acceptance Rate')
plt.bar(ind + width, high_acceptance.enrollment_rate, width,label='Enrollment Rate')
plt.title('''Acceptance and Enrollment Rates.
25 Universities With the Highest Acceptance Rate ''')
plt.ylabel('Rates %')
plt.xticks(ind + width,high_acceptance.name.values,rotation=90 )
plt.legend(loc='best');

plt.subplot(1,2,2)
ind = np.arange(len(low_acceptance)) #number of universities
width = 0.35       #space

plt.bar(ind, low_acceptance.acceptance_rate, width, label='Acceptance Rate')
plt.bar(ind + width, low_acceptance.enrollment_rate, width,label='Enrollment Rate')
plt.title('''Acceptance and Enrollment Rates.
25 Universities With the Lowest Acceptance Rate ''')
plt.ylabel('Rates %')
plt.xticks(ind + width,high_acceptance.name.values,rotation=90 )
plt.legend(loc='best');

Now, we verified that there is no a clear pattern of enrollment rate when universities have a high acceptance rate.

In the case that universities have a low acceptance rate (more difficult to be admitted), again, there is no a specific pattern or tendency, but it is clear that the enrollment rate outnumbers the acceptance rate. Thereby, we can be sure that, in general, the students'preference is not based in how easy is for them to be admitted to a university. Again, there should be other most relevant features that influence the students' preference.

Another interesting feature that could influence on students is the type of university.

Q: Do students prefer public or private universities?

To find out whether students prefer private or public universities, one approach could be by examining the number of applications accordingly to the type of university.

In [60]:
#spliting the number of applications according to the type of control: private or public.
uni_private_df=universitiesnw_df[universitiesnw_df.control_of_institution=='Private not-for-profit']
uni_private_df=uni_private_df[uni_private_df.applicants_total.notnull()]

uni_public_df=universitiesnw_df[universitiesnw_df.control_of_institution=='Public']
uni_public_df=uni_public_df[uni_public_df.applicants_total.notnull()]
In [61]:
plt.figure(figsize=(16,7))

plt.subplot(1,2,1)
plt.hist([uni_public_df.applicants_total,uni_private_df.applicants_total],stacked=True,bins=25)
plt.axis([0,31000,0,700])
plt.title('Distribution of Applications')
plt.xlabel('Number of Applications')
plt.ylabel('Universities')
plt.legend(['Public universities. ({})'.format(len(uni_public_df)),'Private universities. ({})'.format(len(uni_private_df))]);

plt.subplot(1,2,2)
sns.barplot(x=universitiesnw_df.control_of_institution,y=universitiesnw_df.applicants_total);
plt.title('''Average and Variation of Applications 
According to the Type of Control''')
plt.xlabel('')
plt.ylabel('Number of Applications');
plt.tight_layout(pad=1)