Jovian
⭐️
Sign In

Data for the month of May 2019

Exploratory Data Analysis and perdiction using simple sklearn models are being done

Various Libraries are being imported such as Numpy, Pandas, Matplotlib and Seaborn to extract statistical parameters

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set
Out[3]:
<function seaborn.rcmod.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None)>

.JSOLN files are being imported from the system

In [4]:
df_array = []
for i in range (1, 32):
    if i < 10:
        df_array.append(pd.read_json('2019-05-0' + str(i) + '.jsonl', lines = True))
    else:
        df_array.append(pd.read_json('2019-05-' + str(i) + '.jsonl', lines = True))
df = pd.concat(df_array, ignore_index = True, sort = True)

This is to merge all the files into a single .csv file

In [5]:
df.to_csv
df.head()
Out[5]:

df has 409083 values. None of the columns of df has missing values. The data set has information about the courses undertaken by the users in the month of May, 2019. It has the details of the provider, course name and the timestamp.

In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 409083 entries, 0 to 409082 Data columns (total 6 columns): provider 409083 non-null object schema 409083 non-null object spec 409083 non-null object status 409083 non-null object timestamp 409083 non-null datetime64[ns] version 409083 non-null int64 dtypes: datetime64[ns](1), int64(1), object(4) memory usage: 18.7+ MB

Total no. of unique projects undertaken are 6180.

In [14]:
df.spec.nunique()
Out[14]:
6180

Finding the most popular project/course. It seems iPython course was done 183362 times followed by Demo of Jupyter Lab with 34704 entries.

In [13]:
most_popular_projects = df.groupby(['spec'])['spec'].count()
most_popular_projects.nlargest(10)
Out[13]:
spec
ipython/ipython-in-depth/master              183362
jupyterlab/jupyterlab-demo/master             34704
DS-100/textbook/master                        21773
ines/spacy-io-binder/live                     18498
bokeh/bokeh-notebooks/master                   8916
ines/spacy-course/binder                       6108
binder-examples/r/master                       5703
binder-examples/requirements/master            5402
rationalmatter/juno-demo-notebooks/master      5153
QuantStack/xeus-cling/stable                   4512
Name: spec, dtype: int64

Finding the different provider for the courses.

In [43]:
df['provider'].unique()
Out[43]:
array(['GitHub', 'GitLab', 'Git', 'Gist'], dtype=object)

No. of projects done from a particular source?

In [17]:
source = df.groupby(['provider', 'spec'])['spec'].count()
source.head()
Out[17]:
provider  spec                                                  
Gist      AllenDowney/3e0ee50e828cb3a4bc2a720797bb303c/master       1
          AustinRochford/505e6a3647c57dbe4bd55a4c311a2a95/master    2
          AustinRochford/62c283a3f0fae90b5e39/master                1
          BadreeshShetty/bf9cb1dced8263ef997bcb2c3926569b/master    3
          BenLangmead/6513059/master                                1
Name: spec, dtype: int64

GitHub seems to be the most common provider with 404286 entries. Git being the least used.

In [19]:
popular_source = df.groupby(['provider'])['provider'].count()
popular_source
df_source = pd.DataFrame(popular_source)
df_source
Out[19]:

Distribution of sources/provider

In [13]:
# data visualization using matplotlib bar graph for the no. of projects from different sources
plt.bar(df_source.index, df_source['provider'], color = 'g')
plt.yscale('log')
plt.title('Distribution of Sources')
plt.xlabel('Source Name')
plt.ylabel('Projects Undertaken (log scale)')
Out[13]:
Text(0,0.5,'Projects Undertaken (log scale)')
Notebook Image

For in depth analysis, timestamp is being split

In [27]:
# splitting timestamp into date and time in different columns
df['new_date'] = [d.date() for d in df['timestamp']]
df['new_time'] = [d.time() for d in df['timestamp']]
df.head()
Out[27]:

Maximum number of projects were done on 22nd of May, 2019

In [44]:
# top 5 days when the maximum no. of projects were done
most_popular_date = df.groupby(['new_date'])['new_date'].count()
most_popular_date.nlargest()
Out[44]:
new_date
2019-05-22    18627
2019-05-08    17393
2019-05-14    17309
2019-05-09    16657
2019-05-13    16635
Name: new_date, dtype: int64

Average No. of projects done everyday is 13196!!

In [56]:
df_date = pd.DataFrame(most_popular_date)
df_date.mean()
Out[56]:
new_date    13196.225806
dtype: float64

The most common time when any particular project is started is in between 1300 hrs and 1500 hrs

In [48]:
# most common time when the projects start
most_popular_time = df.groupby(['new_time'])['new_time'].count()
most_popular_time.nlargest(5)
Out[48]:
new_time
14:50:00    497
13:41:00    481
14:40:00    479
09:30:00    477
13:42:00    477
Name: new_time, dtype: int64

Average No. of projects undertaken every minute on a particular day is 9!!

In [59]:
df_time = pd.DataFrame(most_popular_time)
print(df_time.mean()/31)
new_time 284.085417 dtype: float64 new_time 9.164046 dtype: float64
In [30]:
# splitting day and time further into year, month, date; hour, minute and second
df['time_str'] = df['new_time'].astype(str)
df['date_str'] = df['new_date'].astype(str)
df[['hour', 'minute', 'seconds']] = df.time_str.str.split(':', expand = True).astype(float)
df[['year', 'month', 'date']] = df.date_str.str.split('-', expand = True).astype(float)
df.head()
Out[30]:

Finding the most common hour to start a project! 1300 hours

In [54]:
# most common hour when the project starts
most_popular_hour = df.groupby(['hour'])['hour'].count()
most_popular_hour.nlargest(5)
Out[54]:
hour
13.0    24603
14.0    24324
12.0    22764
9.0     22713
15.0    22112
Name: hour, dtype: int64

Average No. of Projects done every hour is around 550!

In [60]:
df_hour = pd.DataFrame(most_popular_hour)
df_hour.mean()/31
Out[60]:
hour    549.842742
dtype: float64

Representing the No. of Projects undertaken in a particular hour across the whole month

In [31]:
# data visualisation on the basis of the hour in which the project was done
df.hist(column = 'hour', bins = 24, rwidth = 0.9, color = 'y')
plt.xlabel('Time (in hours)')
plt.ylabel('No. of projects undertaken')
plt.title('Distribution on the basis of time')
Out[31]:
Text(0.5,1,'Distribution on the basis of time')
Notebook Image

Maximum projects were done in between 1300 hrs and 1400 hrs

In [32]:
# most common hour
df['hour'].mode()
Out[32]:
0    13.0
dtype: float64

The maximum traffic was on 22nd of May, 2019

In [33]:
# most popular date of May, 2019
df['date'].mode()
Out[33]:
0    22.0
dtype: float64

Representing the No. of Projects undertaken on a particular date across the whole month

In [34]:
#data visualisation on the basis of the date of May, 2019
df.hist(column = 'date', bins = 30, rwidth = 0.9, color = 'y')
plt.xlabel('Date')
plt.ylabel('No. of Projects undertaken')
plt.title('Distribution on the basis of date')
Out[34]:
Text(0.5,1,'Distribution on the basis of date')
Notebook Image

Another way to represent the no. of a projects using Seaborn Library

In [37]:
# using seaborn tools to visualise the data on the basis of time of the day
sns.distplot(df.hour)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x2851fb9ae48>
Notebook Image
In [38]:
# using seaborn tools to visualise the data on the basis of the date of the month
sns.distplot(df.date)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x2852068fd30>
Notebook Image

To extract statistical parameters from the data, we need to change string values of the provider and spec into the integer value.

In [61]:
pro = df['provider'].tolist()
s = pd.Series(pro)
labels, levels = pd.factorize(s)
df1 = pd.DataFrame({'provider':(labels)})
df.pop('provider')
df=df.join(df1)
In [62]:
course=df['spec'].tolist()
s = pd.Series(course)
labels, levels = pd.factorize(s)
df1 = pd.DataFrame({'spec':(labels)})
df.pop('spec')
df=df.join(df1)

There is not much variation in the distribution across a particular date and hour

In [80]:
pd.crosstab(df.hour, df.date, margins = True).style.background_gradient(cmap='autumn_r')
Out[80]:

Test using simple Decision Tree and Random Forest Model from sklearn model

In [82]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from math import sqrt

First we define the parameters to predict the test case, the hour is being predicted keeping spec, provider and date as features.

In [76]:
y = df.hour
features = ['spec', 'provider', 'date']
X = df[features]
features_1 = ['spec', 'provider', 'date', 'hour']
Y = df[features_1]

We apply the test/train split to get a first overview of the model performance. We have predicted the hour in which the project from a test data will be started.

In [77]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
df_model = RandomForestRegressor(random_state = 1)
df_model.fit(train_X, train_y)
df_predictions = df_model.predict(val_X)
print(df_predictions)
[ 9.33153356 5.15238095 11.48847717 ... 10.41370788 19.045 11.897042 ]

The accuracy of the model is being tested by calculating Mean Absolute Error which is 4.72 and RMSE of 5.82.

In [84]:
df_mae = mean_absolute_error(df_predictions, val_y)
print(df_mae)
df_mse = mean_squared_error(df_predictions, val_y)
print(df_mse)
rmse = sqrt(df_mse)
print(rmse)
4.727974998281183 33.988401476446334 5.829957244821469

There is not much correlation between the project undertaken and the various parameters of the data.

In [39]:
print(Y.corr())
spec provider date hour spec 1.000000 0.098441 0.146765 0.042911 provider 0.098441 1.000000 -0.048344 0.011894 date 0.146765 -0.048344 1.000000 -0.023311 hour 0.042911 0.011894 -0.023311 1.000000

Decision Tree Model also predicted the time with the Mean Absolute Error and RMSE similar to Random Forest Model.

In [86]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
df_DT_model = DecisionTreeRegressor(random_state = 1)
df_DT_model.fit(train_X, train_y)
df_DT_predictions = df_DT_model.predict(val_X)
print(df_DT_predictions)
[ 9.23809524 5. 11.49863636 ... 10.40383704 20. 11.88690023]
In [87]:
df_DT_mae = mean_absolute_error(df_DT_predictions, val_y)
print(df_DT_mae)
df_DT_mse = mean_squared_error(df_DT_predictions, val_y)
print(df_DT_mse)
rmse_DT = sqrt(df_DT_mse)
print(rmse_DT)
4.747081046709708 34.65872756782684 5.887166344501133

Summary of the Data:

  1. Most Common Provider - GitHub
  2. Most Common Project - ipython/ipython-in-depth/master
  3. Average No. of projects done everyday- 13196
  4. Average No. of projects done every hour in a day is 550
  5. The best time for a source to provide a course to have the maximum users is between 1300 hours to 1400 hours
  6. The best time to start a course can be predicted with the Random Forest Regressor with root mean square error of 5.88