I am a Software Engineer and finding the relevance of a certain programming language seems to be pretty daunting with the ever-increasing pool of programming languages and their usage. This analysis tries to shed some light on the way we can perceive this data with the help of some other datasets as well. I am greatly indebted to jovian.ml for this course as well as the jovian package which has made version control a cake-walk almost. Setting up an adhoc online environment with Binder gives the flexibility of working on the fly with resource management as a server-sourced service available on a mouse-click. Rest of the packages used for this analyses are pandas, matplotlib and seaborn. The main data is sourced from https://www.kaggle.com/jaimevalero/developers-and-programming-languages
project_name = "usage-of-programming-languages"
!pip install jovian --upgrade -q
!pip install pandas
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.1.3)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied: numpy>=1.15.4 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.19.2)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.12.0)
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
Import the main data
import pandas as pd
user_lang_df = pd.read_csv('user-languages.csv')
Describing the user-languages dataframe
user_lang_df.info()
user_lang_df.describe()
user_lang_df.head()
Pivoting the table according to languages and their mean and count, based only on the entries where the values are greater than zeo, meaning where the languages are in use
language_df = pd.DataFrame()
for col in user_lang_df.columns:
if col != 'user_id':
current_col_df = user_lang_df[user_lang_df[col] > 0.0][col]
mean = current_col_df.mean(axis = 0, skipna = True)
count = current_col_df.count()
language_df = language_df.append({'language':col, 'mean': mean, 'count': count}, ignore_index=True)
language_df
We've taken this dataset from https://github.com/jamhall/programming-languages-csv
This is to list the most widely used programming languages in the world
valid_languages_df = pd.read_csv('languages.csv')
valid_languages_df
What are the top 100 languages in use
valid_languages_df['name_lower'] = valid_languages_df['name'].str.lower()
valid_existing_languages_in_projects_df = language_df.sort_values(by=['mean'], ascending=False).merge(valid_languages_df, left_on="language", right_on="name_lower", how="inner")
top_valid_existing_languages_in_projects_df = valid_existing_languages_in_projects_df.head(100)[['count', 'language', 'mean', 'name']]
top_valid_existing_languages_in_projects_df['mean * count'] = top_valid_existing_languages_in_projects_df['count'] * top_valid_existing_languages_in_projects_df['mean']
top_valid_existing_languages_in_projects_df = top_valid_existing_languages_in_projects_df.sort_values(by='mean * count', ascending = False)
top_valid_existing_languages_in_projects_df
Top 10 most relevant languages
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="mean * count", y="name", data=top_valid_existing_languages_in_projects_df.head(10))
How many unique users?
pd.DataFrame(user_lang_df.user_id.unique()).count()
Shape?
user_lang_df.shape
This means that all the users are unique here
existing_languages_df = valid_existing_languages_in_projects_df.language
technology_set = set()
technology_dict = dict()
for i, row in user_lang_df.iterrows():
current_set_of_languages = set()
for language in existing_languages_df:
if row[language] > 0.0:
current_set_of_languages.add(language)
technology_set.add(frozenset(current_set_of_languages))
if frozenset(current_set_of_languages) in technology_dict.keys():
technology_dict[frozenset(current_set_of_languages)] = technology_dict[frozenset(current_set_of_languages)] + 1
else:
technology_dict[frozenset(current_set_of_languages)] = 1
technology_dict
technology_combo_freq_df = pd.DataFrame(columns = ['language_combo', 'frequency'])
for combo in technology_dict.keys():
curr_dict = dict()
curr_dict['language_combo'] = ",".join(combo)
curr_dict['frequency'] = technology_dict[combo]
technology_combo_freq_df = technology_combo_freq_df.append(curr_dict, ignore_index=True)
technology_combo_freq_df
technology_combo_freq_df = technology_combo_freq_df.sort_values(by = ['frequency'], ascending = False).head(100)
technology_combo_freq_df = technology_combo_freq_df.reset_index()
technology_combo_freq_df
Top 10 language combos
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="frequency", y="language_combo", data=technology_combo_freq_df.head(10), palette="Blues_d")
We try to find the most used languages (or claims) which are not there in the valid language list
invalid_existing_languages_in_projects_df = language_df.sort_values(by=['count', 'mean'], ascending=False)[~language_df.language.isin(valid_languages_df['name_lower'])]
invalid_existing_languages_in_projects_df.head(100)
This is to look for a correlation between the mean (average proportion in the projects it is present) vs frequency (number of users who claim to use it). This may help us in finding the outliers
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.scatterplot(data=invalid_existing_languages_in_projects_df.head(100), x="count", y="mean", hue="language")
plt.legend(title='top invalid languages: count vs mean', loc='upper right', bbox_to_anchor=(2.00, 2.00), ncol=1, labels=invalid_existing_languages_in_projects_df.head(100).language)
fig = plt.figure(figsize =(20, 20))
plt.show(ax)
We've taken below data from Wikipedia (https://en.wikipedia.org/wiki/Programming_languages_used_in_most_popular_websites) and converted the table to csv with the help of https://www.convertcsv.com/html-table-to-csv.htm
most_popular_website_languages_df = pd.read_csv("most-popular-website-languages.csv")
most_popular_website_languages_df
We see that only the frontend and the back-end languages are of significance to us. Let's first remove \n and \r characters
most_popular_website_languages_df = most_popular_website_languages_df.rename(columns={"Front-end\r\n(Client-side)": "Front-end", "Back-end\r\n(Server-side)": "Back-end"})
most_popular_website_languages_df
Needing only the 3rd and 4th column, we'll shrink the actual dataframe
most_popular_website_languages_df = most_popular_website_languages_df[["Front-end", "Back-end"]]
most_popular_website_languages_df
As there are some unwanted characters, we'll remove them first
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r' ', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\[.*\]', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\(.*\)', value='')
most_popular_website_languages_df
big_tech_language_freq_dict = dict()
big_tech_language_list = list()
for i, row in most_popular_website_languages_df.iterrows():
big_tech_language_list += row["Front-end"].split(",")
big_tech_language_list += row["Back-end"].split(",")
big_tech_language_list
for language in big_tech_language_list:
if language in big_tech_language_freq_dict.keys():
big_tech_language_freq_dict[language] += 1
else:
big_tech_language_freq_dict[language] = 1
big_tech_language_freq_dict
import jovian
jovian.commit(project=project_name)
!pip install matplotlib seaborn numpy
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
big_tech_languages = big_tech_language_freq_dict.keys()
big_tech_languages_frequencies = big_tech_language_freq_dict.values()
# Creating plot
fig_tech_giant = plt.figure(figsize =(12, 12))
plt.pie(big_tech_languages_frequencies, labels = big_tech_languages)
# show plot
plt.show()
github_usage_with_big_tech_languages_df = pd.DataFrame()
for language in big_tech_languages:
current_df = language_df[language_df['language'] == language.lower()]
current_df['language'] = language
github_usage_with_big_tech_languages_df = github_usage_with_big_tech_languages_df.append(current_df, ignore_index=True)
github_usage_with_big_tech_languages_df
github_usage_with_big_tech_languages = github_usage_with_big_tech_languages_df.language
github_usage_with_big_tech_languages_frequencies = github_usage_with_big_tech_languages_df['count'].astype(int)
fig_github_pie = plt.figure(figsize =(12, 12))
plt.pie(github_usage_with_big_tech_languages_frequencies, labels = github_usage_with_big_tech_languages)
# show plot
plt.show()
import jovian
jovian.commit()
1. What are the most relevant languages now, that people are working on?
# Here are the top 10, based on the product of their average proporion and recurrences
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="mean * count", y="name", data=top_valid_existing_languages_in_projects_df.head(10))
What are the most popular language-combos that people love to use and a single programmer can feel empowered in having it in their toolkit?
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="frequency", y="language_combo", data=technology_combo_freq_df.head(10), palette="Blues_d")
Shocked? Only a single language (Javascript) is still the top player! Looking deep, this actually is pretty legit, as Javascript has now become a language of choice for any kind of full-stack development these days. The next one is blank. Yes, it's blank, means there's nothing! Probably these are the accounts of the users who haven't started coding yet or all of their stuff are on private repositories. The community awaits them! Rest of the chart needs little explanation.
3. Are there duplicate users?
# No.of unique user_ids
pd.DataFrame(user_lang_df.user_id.unique()).count()
# Check for the shape whether it matches the count in any way?
user_lang_df.shape
This means there aren't any duplicate users
The most-used (top 100) set of languages (claims) that are not there on the valid language list
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.scatterplot(data=invalid_existing_languages_in_projects_df.head(100), x="count", y="mean", hue="language")
plt.legend(title='top invalid languages: count vs mean', loc='upper right', bbox_to_anchor=(2.00, 2.00), ncol=1, labels=invalid_existing_languages_in_projects_df.head(100).language)
fig = plt.figure(figsize =(20, 20))
plt.show(ax)
5. Does there seem to be close relation between languages used by Tech-giants like Google, Facebook or WordPress with programmers on GitHub?
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
big_tech_languages = big_tech_language_freq_dict.keys()
big_tech_languages_frequencies = big_tech_language_freq_dict.values()
# Creating plot
fig_tech_giant = plt.figure(figsize =(12, 12))
plt.pie(big_tech_languages_frequencies, labels = big_tech_languages)
# show plot
plt.show()
github_usage_with_big_tech_languages = github_usage_with_big_tech_languages_df.language
github_usage_with_big_tech_languages_frequencies = github_usage_with_big_tech_languages_df['count'].astype(int)
fig_github_pie = plt.figure(figsize =(12, 12))
plt.pie(github_usage_with_big_tech_languages_frequencies, labels = github_usage_with_big_tech_languages)
# show plot
plt.show()
There seems to be clear correlation between the two above
import jovian
jovian.commit()
The inferences are the following:
1. There's a huge gap between the number of usage of the first and the second which may mean Javscript is almost twice as used as Python in real projects.
2. The list of user-language combos show that which combinations of languages may land you a job, unless you know it by heart.
3. Languages which are not claimed to be valid seem to prove the fact that if a language doesn't play a big role (mean) in an average project from a user, it can't be counted in times of crises.
4. The languages used by the tech-giants are the most relevant in the industry. We see a clear relationship from the pie charts above,, with respect to people writing software at Github.
5. The difference observed between giant's Python share and Github's probably is due to the advent Data Science, because we've taken only the front-end and back-end languages from the Giant's dataset
We can conclude that our analysis has been successful in highlighting the trend in usage of these programming languages and which language we can pick or brush at this point in time.
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()