Jovian
⭐️
Sign In

Usage of programming languages

I am a Software Engineer and finding the relevance of a certain programming language seems to be pretty daunting with the ever-increasing pool of programming languages and their usage. This analysis tries to shed some light on the way we can perceive this data with the help of some other datasets as well. I am greatly indebted to jovian.ml for this course as well as the jovian package which has made version control a cake-walk almost. Setting up an adhoc online environment with Binder gives the flexibility of working on the fly with resource management as a server-sourced service available on a mouse-click. Rest of the packages used for this analyses are pandas, matplotlib and seaborn. The main data is sourced from https://www.kaggle.com/jaimevalero/developers-and-programming-languages

In [232]:
project_name = "usage-of-programming-languages"
In [233]:
!pip install jovian --upgrade -q
!pip install pandas
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.1.3) Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2020.1) Requirement already satisfied: python-dateutil>=2.7.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.0) Requirement already satisfied: numpy>=1.15.4 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.19.2) Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.12.0)
In [234]:
import jovian
In [ ]:
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..

Data Preparation and Cleaning

Import the main data

In [ ]:
import pandas as pd
In [ ]:
user_lang_df = pd.read_csv('user-languages.csv')

Describing the user-languages dataframe

In [ ]:
user_lang_df.info()
In [ ]:
user_lang_df.describe()
In [ ]:
user_lang_df.head()

Pivoting the table according to languages and their mean and count, based only on the entries where the values are greater than zeo, meaning where the languages are in use

In [ ]:
language_df = pd.DataFrame()
for col in user_lang_df.columns:
    if col != 'user_id':
        current_col_df = user_lang_df[user_lang_df[col] > 0.0][col]
        mean = current_col_df.mean(axis = 0, skipna = True)
        count = current_col_df.count()
        language_df = language_df.append({'language':col, 'mean': mean, 'count': count}, ignore_index=True)
language_df

We've taken this dataset from https://github.com/jamhall/programming-languages-csv

This is to list the most widely used programming languages in the world

In [ ]:
valid_languages_df = pd.read_csv('languages.csv')
valid_languages_df

Exploratory Analysis and Visualization

What are the top 100 languages in use

In [ ]:
valid_languages_df['name_lower'] = valid_languages_df['name'].str.lower()
valid_existing_languages_in_projects_df = language_df.sort_values(by=['mean'], ascending=False).merge(valid_languages_df, left_on="language", right_on="name_lower", how="inner")
top_valid_existing_languages_in_projects_df = valid_existing_languages_in_projects_df.head(100)[['count', 'language', 'mean', 'name']]
top_valid_existing_languages_in_projects_df['mean * count'] = top_valid_existing_languages_in_projects_df['count'] * top_valid_existing_languages_in_projects_df['mean']
top_valid_existing_languages_in_projects_df = top_valid_existing_languages_in_projects_df.sort_values(by='mean * count', ascending = False)
top_valid_existing_languages_in_projects_df

Top 10 most relevant languages

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="mean * count", y="name", data=top_valid_existing_languages_in_projects_df.head(10))

How many unique users?

In [ ]:
pd.DataFrame(user_lang_df.user_id.unique()).count()

Shape?

In [ ]:
user_lang_df.shape

This means that all the users are unique here

In [ ]:
existing_languages_df = valid_existing_languages_in_projects_df.language
technology_set = set()
technology_dict = dict()
for i, row in user_lang_df.iterrows():
    current_set_of_languages = set()
    for language in existing_languages_df:
        if row[language] > 0.0:
            current_set_of_languages.add(language)
    technology_set.add(frozenset(current_set_of_languages))
    if frozenset(current_set_of_languages) in technology_dict.keys():
        technology_dict[frozenset(current_set_of_languages)] = technology_dict[frozenset(current_set_of_languages)] + 1
    else:
        technology_dict[frozenset(current_set_of_languages)] = 1
technology_dict
In [ ]:
technology_combo_freq_df = pd.DataFrame(columns = ['language_combo', 'frequency'])
for combo in technology_dict.keys():
    curr_dict = dict()
    curr_dict['language_combo'] = ",".join(combo)
    curr_dict['frequency'] = technology_dict[combo]
    technology_combo_freq_df = technology_combo_freq_df.append(curr_dict, ignore_index=True)
technology_combo_freq_df
In [ ]:
technology_combo_freq_df = technology_combo_freq_df.sort_values(by = ['frequency'], ascending = False).head(100)
technology_combo_freq_df = technology_combo_freq_df.reset_index()
technology_combo_freq_df

Top 10 language combos

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="frequency", y="language_combo", data=technology_combo_freq_df.head(10), palette="Blues_d")

We try to find the most used languages (or claims) which are not there in the valid language list

In [ ]:
invalid_existing_languages_in_projects_df = language_df.sort_values(by=['count', 'mean'], ascending=False)[~language_df.language.isin(valid_languages_df['name_lower'])]
invalid_existing_languages_in_projects_df.head(100)

This is to look for a correlation between the mean (average proportion in the projects it is present) vs frequency (number of users who claim to use it). This may help us in finding the outliers

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.scatterplot(data=invalid_existing_languages_in_projects_df.head(100), x="count", y="mean", hue="language")

plt.legend(title='top invalid languages: count vs mean', loc='upper right', bbox_to_anchor=(2.00, 2.00), ncol=1, labels=invalid_existing_languages_in_projects_df.head(100).language)

fig = plt.figure(figsize =(20, 20)) 
plt.show(ax)

We've taken below data from Wikipedia (https://en.wikipedia.org/wiki/Programming_languages_used_in_most_popular_websites) and converted the table to csv with the help of https://www.convertcsv.com/html-table-to-csv.htm

In [ ]:
most_popular_website_languages_df = pd.read_csv("most-popular-website-languages.csv")
In [ ]:
most_popular_website_languages_df

We see that only the frontend and the back-end languages are of significance to us. Let's first remove \n and \r characters

In [ ]:
most_popular_website_languages_df = most_popular_website_languages_df.rename(columns={"Front-end\r\n(Client-side)": "Front-end", "Back-end\r\n(Server-side)": "Back-end"})
most_popular_website_languages_df

Needing only the 3rd and 4th column, we'll shrink the actual dataframe

In [ ]:
most_popular_website_languages_df = most_popular_website_languages_df[["Front-end", "Back-end"]]
most_popular_website_languages_df

As there are some unwanted characters, we'll remove them first

In [ ]:
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r' ', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\[.*\]', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\(.*\)', value='')
most_popular_website_languages_df
In [ ]:
big_tech_language_freq_dict = dict()
big_tech_language_list = list()
for i, row in most_popular_website_languages_df.iterrows():
    big_tech_language_list += row["Front-end"].split(",")
    big_tech_language_list += row["Back-end"].split(",")
big_tech_language_list
for language in big_tech_language_list:
    if language in big_tech_language_freq_dict.keys():
        big_tech_language_freq_dict[language] += 1
    else:
        big_tech_language_freq_dict[language] = 1
big_tech_language_freq_dict
In [ ]:
import jovian
In [ ]:
jovian.commit(project=project_name)
In [ ]:
!pip install matplotlib seaborn numpy
In [ ]:
from matplotlib import pyplot as plt 
import numpy as np 
%matplotlib inline
  
big_tech_languages = big_tech_language_freq_dict.keys()
  
big_tech_languages_frequencies = big_tech_language_freq_dict.values()
  
# Creating plot 
fig_tech_giant = plt.figure(figsize =(12, 12)) 
plt.pie(big_tech_languages_frequencies, labels = big_tech_languages) 
  
# show plot 
plt.show() 
In [ ]:
github_usage_with_big_tech_languages_df = pd.DataFrame()
for language in big_tech_languages:
    current_df = language_df[language_df['language'] == language.lower()]
    current_df['language'] = language
    github_usage_with_big_tech_languages_df = github_usage_with_big_tech_languages_df.append(current_df, ignore_index=True)
github_usage_with_big_tech_languages_df
In [ ]:
github_usage_with_big_tech_languages = github_usage_with_big_tech_languages_df.language
github_usage_with_big_tech_languages_frequencies = github_usage_with_big_tech_languages_df['count'].astype(int)
  
fig_github_pie = plt.figure(figsize =(12, 12)) 
plt.pie(github_usage_with_big_tech_languages_frequencies, labels = github_usage_with_big_tech_languages) 
  
# show plot 
plt.show() 
In [ ]:
import jovian
In [ ]:
jovian.commit()

Asking and Answering Questions

1. What are the most relevant languages now, that people are working on?

In [ ]:
# Here are the top 10, based on the product of their average proporion and recurrences
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="mean * count", y="name", data=top_valid_existing_languages_in_projects_df.head(10))

What are the most popular language-combos that people love to use and a single programmer can feel empowered in having it in their toolkit?

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="frequency", y="language_combo", data=technology_combo_freq_df.head(10), palette="Blues_d")

Shocked? Only a single language (Javascript) is still the top player! Looking deep, this actually is pretty legit, as Javascript has now become a language of choice for any kind of full-stack development these days. The next one is blank. Yes, it's blank, means there's nothing! Probably these are the accounts of the users who haven't started coding yet or all of their stuff are on private repositories. The community awaits them! Rest of the chart needs little explanation.

3. Are there duplicate users?

In [ ]:
# No.of unique user_ids
pd.DataFrame(user_lang_df.user_id.unique()).count()
In [ ]:
# Check for the shape whether it matches the count in any way?
user_lang_df.shape

This means there aren't any duplicate users

The most-used (top 100) set of languages (claims) that are not there on the valid language list

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.scatterplot(data=invalid_existing_languages_in_projects_df.head(100), x="count", y="mean", hue="language")

plt.legend(title='top invalid languages: count vs mean', loc='upper right', bbox_to_anchor=(2.00, 2.00), ncol=1, labels=invalid_existing_languages_in_projects_df.head(100).language)

fig = plt.figure(figsize =(20, 20)) 
plt.show(ax)

5. Does there seem to be close relation between languages used by Tech-giants like Google, Facebook or WordPress with programmers on GitHub?

In [ ]:
from matplotlib import pyplot as plt 
import numpy as np 
%matplotlib inline
  
big_tech_languages = big_tech_language_freq_dict.keys()
  
big_tech_languages_frequencies = big_tech_language_freq_dict.values()
  
# Creating plot 
fig_tech_giant = plt.figure(figsize =(12, 12)) 
plt.pie(big_tech_languages_frequencies, labels = big_tech_languages) 
  
# show plot 
plt.show() 
In [ ]:
github_usage_with_big_tech_languages = github_usage_with_big_tech_languages_df.language
github_usage_with_big_tech_languages_frequencies = github_usage_with_big_tech_languages_df['count'].astype(int)
  
fig_github_pie = plt.figure(figsize =(12, 12)) 
plt.pie(github_usage_with_big_tech_languages_frequencies, labels = github_usage_with_big_tech_languages) 
  
# show plot 
plt.show() 

There seems to be clear correlation between the two above

In [ ]:
import jovian
In [ ]:
jovian.commit()

Inferences and Conclusion

The inferences are the following:

1. There's a huge gap between the number of usage of the first and the second which may mean Javscript is almost twice as used as Python in real projects.

2. The list of user-language combos show that which combinations of languages may land you a job, unless you know it by heart.

3. Languages which are not claimed to be valid seem to prove the fact that if a language doesn't play a big role (mean) in an average project from a user, it can't be counted in times of crises.

4. The languages used by the tech-giants are the most relevant in the industry. We see a clear relationship from the pie charts above,, with respect to people writing software at Github.

5. The difference observed between giant's Python share and Github's probably is due to the advent Data Science, because we've taken only the front-end and back-end languages from the Giant's dataset

We can conclude that our analysis has been successful in highlighting the trend in usage of these programming languages and which language we can pick or brush at this point in time.

In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

References and Future Work

TODO

In [ ]:
import jovian
In [ ]:
jovian.commit()