Learn data science and machine learning by building real-world projects on Jovian

Exploratory Data Analysis on 2017 freeCodeCamp Survey

freeCodeCampBanner

This project is the result of the knowledge acquired during the course Data Analysis with Python: Zero to Pandas offered by Jovian.ml in partnership with freeCodeCamp.

For this project, was chosen the open dataset 2017-new-coder-survey, which contains data collected from freeCodeCamp's 2017 survey of more than 20,000 developers. The main goal is to make an initial Exploratory Data Analysis and find some insights about the collected data. Multiple python libraries will be used for data manipulation, cleaning, and visualization.

Let's install some necessary Python libraries that we will be using

In [1]:
%%capture
! pip install numpy pandas matplotlib seaborn wordcloud jovian --upgrade
In [2]:
import jovian
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud
import warnings

warnings.filterwarnings('ignore')

project_name='eda-freecodecamp-survey'
jovian.commit(project=project_name)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "rocio-x-linares95/eda-freecodecamp-survey" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/rocio-x-linares95/eda-freecodecamp-survey

Data loading

The open dataset 2017-new-coder-survey is composed of two files:

  • 2017-new-coder-survey-part-1.csv - the first half of the survey. 100% of respondents completed this section.
  • 2017-new-coder-survey-part-2.csv - the first half of the survey, plus the second half - which about 95% of respondents also completed.

These files have a column in common: Network ID. So, It can be built a single dataset using this shared key.

In [3]:
fcc1_df = pd.read_csv('2017-new-coder-survey-part-1.csv')
fcc2_df = pd.read_csv('2017-new-coder-survey-part-2.csv')
fcc_survey_raw_df=fcc1_df.merge(fcc2_df,on='Network ID',how='inner')
fcc_survey_raw_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 19526 entries, 0 to 19525 Columns: 153 entries, #_x to Submit Date (UTC)_y dtypes: float64(28), object(125) memory usage: 22.9+ MB

Data Preparation and Cleaning

Select wanted columns

The dataset contains a lot of information, for this analysis we'll limit our analysis to the following areas:

  • Demographics of the survey respondents
  • Learning techniques preferences
  • Employment-related information and opinions

Let's select a subset of columns with the relevant data for our analysis.

In [4]:
survey_id = 'Network ID'

# Demographic info
# ================
gender = "What's your gender?"
age = "How old are you?"
location = "Which country do you currently live in?"
language = "Which language do you you speak at home with your family?"

# Work info
# =========

# - current work
already_working = "already_working"
employment_status  = "Regarding employment status, are you currently..." 
other_employment_status = "Other.1_y"
work_field = "Which field do you work in?"
other_work_field = "Other.2_y"
last_year_earnings = "About how much money did you make last year (in US dollars)?"

# - work preferences
home_or_remote = "Would you prefer to work..."
expected_earnings = "expected_earnings"
job_interests = fcc_survey_raw_df.columns[5:18]

# Learning path info
# ===================

week_learning_hours = "About how many hours do you spend learning each week?"
programing_months = "About how many months have you been programming for?"
degree = "What's the highest degree or level of school you have completed?"
learning_resources= fcc_survey_raw_df.columns[22: 42]
event_types = fcc_survey_raw_df.columns[42: 57]
podcasts = fcc_survey_raw_df.columns[57: 73]


# Normalize the name of the columns
renamed_columns = {x: x.strip() for x in job_interests}
renamed_columns['Other_x'] = 'Other job interests'
renamed_columns['Other.1_x'] = 'Other learning resources'
renamed_columns['Other.2_x'] = 'Other event types'
renamed_columns['Other.3'] = 'Other podcasts'

fcc_survey_raw_df.rename(columns=renamed_columns, inplace=True)
In [5]:
selected_columns = [
    survey_id,
    
    # demographic info
    gender,
    age,
    location,
    language,
    
    # current work
    already_working,
    employment_status,
    other_employment_status,
    work_field,
    other_work_field,
    last_year_earnings,

    # work preferences
    home_or_remote,
    expected_earnings,
    
    # learning path info
    week_learning_hours,
    programing_months,
    degree,
]

# Update column names
job_interests = fcc_survey_raw_df.columns[5:18]
learning_resources= fcc_survey_raw_df.columns[22: 42]
event_types = fcc_survey_raw_df.columns[42: 57]
podcasts = fcc_survey_raw_df.columns[57: 73]

Let's create a DataFrame with the subset of columns with the relevant data for our analysis.

In [6]:
# Extract a copy of the data from these columns into a new data frame
all_selected_columns = selected_columns + job_interests.to_list() + learning_resources.to_list() + event_types.to_list() + podcasts.to_list()
fcc_survey_df = fcc_survey_raw_df[all_selected_columns].copy()
In [7]:
fcc_survey_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 19526 entries, 0 to 19525 Data columns (total 80 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Network ID 19526 non-null object 1 What's your gender? 19136 non-null object 2 How old are you? 19034 non-null float64 3 Which country do you currently live in? 18993 non-null object 4 Which language do you you speak at home with your family? 18994 non-null object 5 already_working 19416 non-null object 6 Regarding employment status, are you currently... 17705 non-null object 7 Other.1_y 1073 non-null object 8 Which field do you work in? 9940 non-null object 9 Other.2_y 1290 non-null object 10 About how much money did you make last year (in US dollars)? 9813 non-null float64 11 Would you prefer to work... 8772 non-null object 12 expected_earnings 7608 non-null object 13 About how many hours do you spend learning each week? 18109 non-null float64 14 About how many months have you been programming for? 18510 non-null float64 15 What's the highest degree or level of school you have completed? 19045 non-null object 16 Full-Stack Web Developer 5269 non-null object 17 Back-End Web Developer 3492 non-null object 18 Front-End Web Developer 4433 non-null object 19 Mobile Developer 2927 non-null object 20 DevOps / SysAdmin 1198 non-null object 21 Data Scientist 2067 non-null object 22 Quality Assurance Engineer 636 non-null object 23 User Experience Designer 1864 non-null object 24 Product Manager 1014 non-null object 25 Game Developer 2086 non-null object 26 Information Security 1701 non-null object 27 Data Engineer 1592 non-null object 28 Other job interests 299 non-null object 29 freeCodeCamp 14863 non-null object 30 EdX 3483 non-null object 31 Coursera 4750 non-null object 32 Khan Academy 4181 non-null object 33 Pluralsight / Code School 2597 non-null object 34 Codecademy 10249 non-null object 35 Udacity 4129 non-null object 36 Udemy 5537 non-null object 37 Code Wars 2058 non-null object 38 The Odin Project 1051 non-null object 39 Treehouse 2489 non-null object 40 Lynda.com 2826 non-null object 41 Stack Overflow 12123 non-null object 42 W3Schools 10588 non-null object 43 Skillcrush 507 non-null object 44 HackerRank 2203 non-null object 45 Mozilla Developer Network (MDN) 6927 non-null object 46 Egghead.io 1454 non-null object 47 CSS Tricks 5125 non-null object 48 Other learning resources 1079 non-null object 49 freeCodeCamp study groups 1832 non-null object 50 hackathons 2167 non-null object 51 conferences 1750 non-null object 52 workshops 1892 non-null object 53 Startup Weekend 552 non-null object 54 NodeSchool 474 non-null object 55 Women Who Code 513 non-null object 56 Girl Develop It 347 non-null object 57 Meetup.com events 2744 non-null object 58 RailsBridge 155 non-null object 59 Game Jam 339 non-null object 60 Rails Girls 147 non-null object 61 Django Girls 172 non-null object 62 weekend bootcamps 575 non-null object 63 Other event types 1724 non-null object 64 Code Newbie 1784 non-null object 65 The Changelog 450 non-null object 66 Software Engineering Daily 837 non-null object 67 JavaScript Jabber 1211 non-null object 68 Ruby Rogues 344 non-null object 69 Shop Talk Show 369 non-null object 70 Developer Tea 781 non-null object 71 Programming Throwdown 340 non-null object 72 .NET Rocks 344 non-null object 73 Talk Python To Me 713 non-null object 74 JavaScript Air 742 non-null object 75 The Web Ahead 328 non-null object 76 CodePen Radio 820 non-null object 77 Giant Robots Smashing into Other Giant Robots 194 non-null object 78 Software Engineering Radio 421 non-null object 79 Other podcasts 2023 non-null object dtypes: float64(4), object(76) memory usage: 12.1+ MB
In [8]:
fcc_survey_df.describe()
Out[8]:

These reports show a summary of the selected data. This summary reflects that there are many irregularities in the data like columns with incorrect data types and a lot of undefined values. Let's address some techniques to clean and normalize this data.

Remove duplicates rows

In [9]:
initial_rows = fcc_survey_df.shape[0]
fcc_survey_df.drop_duplicates(survey_id, inplace=True)
cleaned_rows = fcc_survey_df.shape[0]

print(f'- {initial_rows - cleaned_rows} duplicate rows were successfully removed. That is a huge amount of data that would have caused inaccurate analysis.')
- 1519 duplicate rows were successfully removed. That is a huge amount of data that would have caused inaccurate analysis.

Assign correct datatype to the columns

In [10]:
# Convert to numeric values the age column
fcc_survey_df[age] = pd.to_numeric(fcc_survey_df[age], errors='coerce', downcast='unsigned')

# Convert to numeric values the expected_earnings column
fcc_survey_df[expected_earnings] = pd.to_numeric(fcc_survey_df[expected_earnings], errors='coerce')

Handle null data. Drop & Replace

In [11]:
# Ignore the rows where the value in the age column is higher than 100 years or lower than 10 years.
fcc_survey_df.drop(fcc_survey_df[fcc_survey_df[age] < 10].index, inplace=True)
fcc_survey_df.drop(fcc_survey_df[fcc_survey_df[age] > 100].index, inplace=True)

# Ignore the rows where the value in the programing_months column is higher than 500 months ~ 42 years.
fcc_survey_df.drop(fcc_survey_df[fcc_survey_df[programing_months] > 500].index, inplace=True)

# Set nan value to the age row in those cases where the value in programing_months column is higher than their age.
fcc_survey_df[age] = np.where(fcc_survey_df[age]< (fcc_survey_df[programing_months]/12), np.nan, fcc_survey_df[age])

# Ignore the rows where the value in the week_learning_hours column is higher than 140 hours.
fcc_survey_df.drop(fcc_survey_df[fcc_survey_df[week_learning_hours] > 140].index, inplace=True)
values = {week_learning_hours: 0}
fcc_survey_df.fillna(value=values)

# Fill nan values in some columns
fcc_survey_df[gender] = np.where(fcc_survey_df[gender].isna(), 'other', fcc_survey_df[gender])
fcc_survey_df[gender] = np.where((fcc_survey_df[gender].str.contains('genderqueer', na=False)), 'other', fcc_survey_df[gender])
fcc_survey_df[gender] = np.where((fcc_survey_df[gender].str.contains('agender', na=False)), 'other', fcc_survey_df[gender])
fcc_survey_df[gender] = np.where((fcc_survey_df[gender].str.contains('trans', na=False)), 'other', fcc_survey_df[gender])

fcc_survey_df[work_field] = np.where(fcc_survey_df[work_field].isna(), fcc_survey_df[other_work_field], fcc_survey_df[work_field])
fcc_survey_df[work_field] = np.where(fcc_survey_df[work_field].isna(), 'Other', fcc_survey_df[work_field])

fcc_survey_df[other_employment_status] = fcc_survey_df[other_employment_status].str.lower()
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('scho', na=False)), 'Student', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('learn', na=False)), 'Student', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('stud', na=False)), 'Student', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('bootcamp', na=False)), 'Student', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('internship', na=False)), 'Intership', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('intership', na=False)), 'Intership', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[employment_status].str.contains('Doing an unpaid internship', na=False)), 'Intership', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where((fcc_survey_df[other_employment_status].str.contains('freelance', na=False)), 'Self-employed freelancer', fcc_survey_df[employment_status])
fcc_survey_df[employment_status] = np.where(fcc_survey_df[employment_status].isna(), 'Other', fcc_survey_df[employment_status])

fcc_survey_df.drop(other_employment_status, inplace=True, axis=1)
fcc_survey_df.drop(other_work_field, inplace=True, axis=1)

# Drop rows with nan values in some columns
fcc_survey_df.dropna(how='any', subset=[ survey_id, gender, age, language, location, already_working], inplace=True)
In [12]:
fcc_survey_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 17175 entries, 0 to 19525 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Network ID 17175 non-null object 1 What's your gender? 17175 non-null object 2 How old are you? 17175 non-null float64 3 Which country do you currently live in? 17175 non-null object 4 Which language do you you speak at home with your family? 17175 non-null object 5 already_working 17175 non-null object 6 Regarding employment status, are you currently... 17175 non-null object 7 Which field do you work in? 17175 non-null object 8 About how much money did you make last year (in US dollars)? 9012 non-null float64 9 Would you prefer to work... 7825 non-null object 10 expected_earnings 6828 non-null float64 11 About how many hours do you spend learning each week? 16116 non-null float64 12 About how many months have you been programming for? 16479 non-null float64 13 What's the highest degree or level of school you have completed? 17019 non-null object 14 Full-Stack Web Developer 4697 non-null object 15 Back-End Web Developer 3145 non-null object 16 Front-End Web Developer 3981 non-null object 17 Mobile Developer 2607 non-null object 18 DevOps / SysAdmin 1059 non-null object 19 Data Scientist 1848 non-null object 20 Quality Assurance Engineer 563 non-null object 21 User Experience Designer 1659 non-null object 22 Product Manager 900 non-null object 23 Game Developer 1856 non-null object 24 Information Security 1500 non-null object 25 Data Engineer 1427 non-null object 26 Other job interests 257 non-null object 27 freeCodeCamp 13252 non-null object 28 EdX 3103 non-null object 29 Coursera 4170 non-null object 30 Khan Academy 3659 non-null object 31 Pluralsight / Code School 2313 non-null object 32 Codecademy 9070 non-null object 33 Udacity 3662 non-null object 34 Udemy 4950 non-null object 35 Code Wars 1797 non-null object 36 The Odin Project 954 non-null object 37 Treehouse 2172 non-null object 38 Lynda.com 2467 non-null object 39 Stack Overflow 10765 non-null object 40 W3Schools 9373 non-null object 41 Skillcrush 433 non-null object 42 HackerRank 1871 non-null object 43 Mozilla Developer Network (MDN) 6139 non-null object 44 Egghead.io 1253 non-null object 45 CSS Tricks 4476 non-null object 46 Other learning resources 969 non-null object 47 freeCodeCamp study groups 1589 non-null object 48 hackathons 1851 non-null object 49 conferences 1497 non-null object 50 workshops 1650 non-null object 51 Startup Weekend 480 non-null object 52 NodeSchool 400 non-null object 53 Women Who Code 450 non-null object 54 Girl Develop It 290 non-null object 55 Meetup.com events 2393 non-null object 56 RailsBridge 118 non-null object 57 Game Jam 276 non-null object 58 Rails Girls 122 non-null object 59 Django Girls 142 non-null object 60 weekend bootcamps 510 non-null object 61 Other event types 1521 non-null object 62 Code Newbie 1583 non-null object 63 The Changelog 392 non-null object 64 Software Engineering Daily 750 non-null object 65 JavaScript Jabber 1070 non-null object 66 Ruby Rogues 309 non-null object 67 Shop Talk Show 321 non-null object 68 Developer Tea 699 non-null object 69 Programming Throwdown 300 non-null object 70 .NET Rocks 312 non-null object 71 Talk Python To Me 611 non-null object 72 JavaScript Air 650 non-null object 73 The Web Ahead 297 non-null object 74 CodePen Radio 701 non-null object 75 Giant Robots Smashing into Other Giant Robots 159 non-null object 76 Software Engineering Radio 372 non-null object 77 Other podcasts 1798 non-null object dtypes: float64(5), object(73) memory usage: 10.4+ MB

Let's see a sample of how the dataset is.

In [13]:
fcc_survey_df.sample(10)
Out[13]:

Exploratory Data Analysis and Visualization

In order to understand how representative the survey is and before asking some questions about the collected information, we consider that is important to know a little more about the general composition of the dataset.

In [14]:
total_respondents = fcc_survey_df.shape[0]
total_raw_respondents = fcc_survey_raw_df.shape[0]

total_locations = fcc_survey_df[location].nunique()
total_languages = fcc_survey_df[language].nunique()
genders = fcc_survey_df[gender].unique()

print(f'''
 {total_respondents} of {total_raw_respondents} people that attended this freeCodeCamp survey will be analyzed in this project:
   - {len(genders)} gender categories were specified: {", ".join(genders)}
   - People from {total_locations} countries attended this survey
   - There are represented a variety of {total_languages} languages
   ''')
17175 of 19526 people that attended this freeCodeCamp survey will be analyzed in this project: - 3 gender categories were specified: female, male, other - People from 175 countries attended this survey - There are represented a variety of 171 languages

Let's configure matplotlib and seaborn python libraries before go in deep into the analysis.

In [15]:
sns.reset_orig()
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Country and Language

In [27]:
fig, axes = plt.subplots(1,2, figsize=(14, 7),  sharey=True)

top_countries = fcc_survey_df[location].value_counts().head(10)
axes[0].set_title(location)
axes[0].set_xlabel('Country')
axes[0].set_xticklabels(axes[0].get_xticklabels(),rotation=70)
sns.barplot(x=top_countries.index, y=top_countries, ax=axes[0])

top_languages = fcc_survey_df[language].value_counts().head(10)
axes[1].set_title(language)
axes[1].set_xlabel('Language')
axes[1].set_xticklabels(axes[1].get_xticklabels(),rotation=70)
sns.barplot(x=top_languages.index, y=top_languages, ax=axes[1])

axes[0].yaxis.set_label_text('Number of respondents')

for ax in axes.flat:
    ax.label_outer()
    
fig.tight_layout(pad=2);
fig.align_xlabels(axes)
Notebook Image

The visualization shows that a disproportionately high number of respondents are from USA and the most used language is English. Those are expected values because the survey is in English, and is the common language used in the countries that are at the top of the countries list (USA, India, UK, and Canada). Also, we can saw the survey has a lack of representation of non-English speaking countries.

Age and Gender

In [17]:
fig, axes = plt.subplots(1,2, figsize=(14, 5))

axes[0].set_title(age)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Number of respondents')
sns.histplot(data=fcc_survey_df, x=age, bins=np.arange(10, 80,5), ax=axes[0])

gender_counts = fcc_survey_df[gender].value_counts()
axes[1].set_title(gender)
axes[1].pie(gender_counts, autopct='%1.1f%%', startangle=180, textprops=dict(color="w", size=20))
axes[1].legend(gender_counts.index,
          title="Genders",
          loc="center left",
          bbox_to_anchor=(1, 0, 0.5, 1))

plt.tight_layout(pad=2);
Notebook Image

The distribution of the age of respondents depicts that a high number of respondents are in the age range of 20-35, and it is in correspondence with the reality of the programmers around the world. It also appears that a disproportionately high number of respondents consider themselves male, while females and other genders are in the minority. This may be the result of discrimination and the lack of inclusion of these sectors in the technological field.

Fields of Work

In [18]:
work_field_frequencies = fcc_survey_df[work_field].value_counts().to_dict()
work_field_frequencies.pop("Other")

wc = WordCloud(repeat=True)
wc.generate_from_frequencies(work_field_frequencies)

plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
Out[18]:
(-0.5, 399.5, 199.5, -0.5)
Notebook Image

It may be interesting to analyze the fields of work of the respondents to finish this first analysis of the general characteristics of the dataset. For this, a WordCloud was used. This image shows in larger letters those job fields most frequently in the survey. It can be seen that the most mentioned fields of work are software and IT development, education, and sales. The image also shows the variety of respondents, areas as different from programming as arts, entertainment, sports and media, health care, transportation, design, and marketing were mentioned.

Asking and Answering Questions

Now, let's look at how the studied dataframe can be used to conclude some useful information. Let's ask some questions regarding the dataframe.

Q1: What lines of study, contribution or collaboration do the respondents use? Which were the most popular according to the survey?

In [30]:
fig, axes = plt.subplots(3, 1, figsize=(14, 14),  sharex=True)

learning_resources_count = fcc_survey_df[learning_resources].count().sort_values(ascending=False)
axes[0].set_title('Most used learning plataforms')
axes[0].set(xlabel='Count', ylabel='Learning Plataforms')
sns.barplot(x=learning_resources_count,y=learning_resources_count.index, ax=axes[0])

event_types_count = fcc_survey_df[event_types].count().sort_values(ascending=False)
axes[1].set_title('Most attended event types')
axes[1].set(xlabel='Count', ylabel='Events')
sns.barplot(x=event_types_count,y=event_types_count.index, ax=axes[1])

podcasts_count = fcc_survey_df[podcasts].count().sort_values(ascending=False)
axes[2].set_title('Most listened to podcasts')
axes[2].set(xlabel='Count', ylabel='Podcasts')
sns.barplot(x=podcasts_count,y=podcasts_count.index,  ax=axes[2])

for ax in axes.flat:
    ax.label_outer()
    
plt.tight_layout(pad=0.5)
fig.align_ylabels(axes)
Notebook Image

The survey collects 3 forms of study, contribution, or collaboration: Learning Platforms, Events, and Podcasts. It appears that the most used form, without any doubt, is the Learning Platforms, followed by Events, and Podcasts in the last place.

  • Regarding the learning platforms, it is observed that freeCodeCamp is in the first place, which is expected since the survey was launched by this platform, so its users were the most motivated to respond. According to the survey, other popular platforms are Stack Overflow, W3Schools, Code Academy, and MDN, which provide part of their content for free and are more accessible to a majority of users.

  • Among the most attended events are meetups, hackathons, workshops, and, as expected, freeCodeCamp study groups. An element that stands out about the events is that the events organized for the female community rank last, which is in correspondence with the fact that the majority of respondents were men.

  • Finally, there are podcasts, the least used learning route. Among the most listened to are: Code Newbie, JavaScript Jabber, and Software Engineering Daily.

Q2: About how many hours do the respondents spend learning each week depending on their highest degree?

In [31]:
print(f'Average learning hours overall: {fcc_survey_df[week_learning_hours].mean()}')

learning_hours_by_degree_df = fcc_survey_df.groupby(degree)[[week_learning_hours]].mean().sort_values(week_learning_hours, ascending=False)
learning_hours_by_degree_df
Average learning hours overall: 15.351638123603871
Out[31]:

The non-university levels of school like college credit non-degree, high school diploma, or trade, technical, or vocational training have the highest learning hours, perhaps, due to the need to learn these knowledge self-taught. Then, are the university degrees. However, there isn't too much variation overall and the average learning hours seem to be around 15.5 hours per week.

Below is a scatter plot that depicts the relation between the Learning Hours per Week, Programming Months, and the Working Status. The graph shows that the longer we have programming skills, the more likely we are to have a job, as we would expect. In this case, in addition, the time devoted to study is less. It can be due to several factors, for example: not having time, we have already achieved some specialization in the area or we simply fulfilled our objective. On the other hand, the shorter the time with programming knowledge, the greater the time devoted to studying and the fewer probabilities that have a job.

In [32]:
plt.title('Programming Months vs. Learning Hours per Week')
plt.xlabel('Programming Months')
plt.ylabel('Learning Hours per Week')
sns.scatterplot(x=programing_months, y=week_learning_hours , hue=already_working, data=fcc_survey_df, s=20);
Notebook Image

Q3: What job positions are people most interested in?

In [33]:
job_interests_count = fcc_survey_df[job_interests].count().sort_values(ascending=False)
pd.DataFrame({
    'Job Positions':job_interests_count.index, 
    'Interested':job_interests_count.values
})
Out[33]:

It appears that the Web Development jobs are the most wanted jobs. There is important to know that a high number of the courses offered by these learning platforms are about web development topics. Then are ranked some popular jobs like Mobile and Game Developer and Data Scientist. Below is a wordCloud with the listed job positions. You must remember that the image will show in larger letters those elements with greater frequency.

In [34]:
wc = WordCloud(repeat=True)
wc.generate_from_frequencies(job_interests_count)

plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
Out[34]:
(-0.5, 399.5, 199.5, -0.5)
Notebook Image

Q4: Are earnings dependent on gender or where you would like to work?

In [35]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

axes[1].set_title('Expected Earings')
axes[1].set_xticklabels(axes[1].get_xticklabels(),rotation=30)
sns.barplot(x=home_or_remote, y=expected_earnings , hue=gender, data=fcc_survey_df, ax= axes[1]);

axes[0].set_title('Last Year Earings')
axes[0].set_xticklabels(axes[1].get_xticklabels(),rotation=30)
sns.barplot(x=home_or_remote, y=last_year_earnings , hue=gender, data=fcc_survey_df, ax= axes[0]);

axes[0].yaxis.set_label_text('US dollars per year')

for ax in axes.flat:
    ax.label_outer()
plt.tight_layout(pad=1)
Notebook Image

The graph on the left shows the relationship between last year's earnings, the preferred workplaces, and the gender of the users. The same analysis is shown on the right, but taking into account the expected earnings per year. In both cases, it can be seen that there is no notable difference in earnings (obtained or expected) depending on the workplace. However, the difference between the two scenarios is notable, people earn less than they expect, which represents a generalized dissatisfaction in the sector.

Another element that stands out is the position of that sector that does not identify itself as either male or female. Without a doubt, they are the ones with the lowest earnings, which represents an index of discrimination and undervaluation. They also are the ones that show the greatest variability in the expected earnings, showing their lack of conformity and perhaps the insecurity that has been created to them over the years due to the exclusion.

Q5: How employment status is related to the months have you been programming? Consider employment statuses with more than 200 responses only.

In [36]:
employment_