Jovian
⭐️
Sign In
Note: This is a mirror of a project from leadingindia interns to showcase jovian.

Intro to the project:

Video Introduction:

Stock Market Prediction

Video Link
Project Description (By Orignal Author):

The project looks at performing sentiment analysis on the Yelp Academic Dataset to predict Stock Market Pricings

Benefits of using Jovian:
  • Requirements: The orignal code repository doesn't mention the requirements to run the code. The setup for Keras and other libraries might require some iterations and guess work which might take a lot of efforts.

  • Setup issues: Different versions of the frameworks might cause issues with the current server setup. In this case, Keras required CUDA version 9.x and the system was setup at the latest CUDA 10.x version which caused needed a lot of debugging and a complete re-installation later. Once the code is pushed to jovian, all of the dependancies are handled by jovian and the complete installation is a one-click process.

  • Experiments: GitHub showcases different notebooks which can be confusing. However, the authors have run multiple experiments which aren't best documented. Jovian allows hosting multiple versions of the experiments and comparing the best results here, which also allows communicating the efforts in the project. (Presenting one simple notebook may not communicate the efforts required for a 1-month project in the best fashion for example).

  • Replication: Jovian also enables us to host the dataset along with the output pickle files from the experiment. This saves the time required to re-train the model, one can simply run the notebook and perform inference.

Setup and How to is mentioned below:

System setup

Jovian makes it easy to share Jupyter notebooks on the cloud by running a single command directly within Jupyter. It also captures the Python environment and libraries required to run your notebook, so anyone (including you) can reproduce your work.

Option 1: Run Online:
  • At the Top of the notebook you can find a one-click run online button for:
    • Run on MyBinder
    • Run on Collab
    • Run on Kaggle Kernels
Option 2: Run on Local Machine:

Here's what you need to do to get started:

Install Anaconda by following the instructions given here. You might also need to add Anaconda binaries to your system PATH to be able to run the conda command line tool. Install the jovian Python library by the running the following command (without the $) on your Mac/Linux terminal or Windows command prompt:

pip install jovian --upgrade

Download the notebook for this tutorial using the jovian clone command:

$ jovian clone <notebook_id>

(You can get the notebook_id by clicking the 'Clone' button at the top of this page on https://jvn.io)

Running the clone command creates a directory 01-pytorch-basics containing a Jupyter notebook and an Anaconda environment file.

$ ls StockMarketPredictions

Now we can enter the directory and install the required Python libraries (Jupyter, PyTorch etc.) with a single command using jovian:

$ cd StockMarketPredictions
$ jovian install

Jovian reads the environment.yml file, identifies the right dependencies for your operating system, creates a virtual environment with the given name (01-pytorch-basics by default) and installs all the required libraries inside the environment, to avoid modifying your system-wide installation of Python. It uses conda internally. If you face issues with jovian install, try running conda env update instead.

We can activate the virtual environment by running

$ conda activate <env_name>

For older installations of conda, you might need to run the command: <env_name>

Once the virtual environment is active, we can start Jupyter by running

$ jupyter notebook

You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser.

Getting the Dataset:

Download from DropBox

In [12]:
#Uncomment these commands to download the dataset:
#!wget https://www.dropbox.com/s/cf3e09xinznbgrb/data.zip?dl=0
#!mv data.zip?dl=0 data.zip
#!unzip data.zip
In [11]:
import jovian
import json as j
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer

Read and transform json file

In [5]:
json_data = None
with open('yelp_academic_dataset_review.json') as data_file:
    lines = data_file.readlines()
    joined_lines = "[" + ",".join(lines) + "]"

    json_data = j.loads(joined_lines)

data = pd.DataFrame(json_data)
print(data.head())
business_id date review_id stars \ 0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 text type \ 0 My wife took me here on my birthday for breakf... review 1 I have no idea why some people give bad review... review 2 love the gyro plate. Rice is so good and I als... review 3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review 4 General Manager Scott Petello is a good egg!!!... review user_id votes 0 rLtl8ZkDX5vH5nAx9C3q5Q {'funny': 0, 'useful': 5, 'cool': 2} 1 0a2KyEL0d3Yb1V6aivbIuQ {'funny': 0, 'useful': 0, 'cool': 0} 2 0hT2KtfLiobPvh6cDC8JQg {'funny': 0, 'useful': 1, 'cool': 0} 3 uZetl9T0NcROGOyFfughhg {'funny': 0, 'useful': 2, 'cool': 1} 4 vYmM4KTsC8ZfQBg-j5MWkw {'funny': 0, 'useful': 0, 'cool': 0}

Prepare the data

In [6]:
data = data[data.stars != 3]
data['sentiment'] = data['stars'] >= 4
print(data.head())
business_id date review_id stars \ 0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 text type \ 0 My wife took me here on my birthday for breakf... review 1 I have no idea why some people give bad review... review 2 love the gyro plate. Rice is so good and I als... review 3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review 4 General Manager Scott Petello is a good egg!!!... review user_id votes sentiment 0 rLtl8ZkDX5vH5nAx9C3q5Q {'funny': 0, 'useful': 5, 'cool': 2} True 1 0a2KyEL0d3Yb1V6aivbIuQ {'funny': 0, 'useful': 0, 'cool': 0} True 2 0hT2KtfLiobPvh6cDC8JQg {'funny': 0, 'useful': 1, 'cool': 0} True 3 uZetl9T0NcROGOyFfughhg {'funny': 0, 'useful': 2, 'cool': 1} True 4 vYmM4KTsC8ZfQBg-j5MWkw {'funny': 0, 'useful': 0, 'cool': 0} True

Build the model

In [7]:
X_train, X_test, y_train, y_test = train_test_split(data, data.sentiment, test_size=0.2)

count = CountVectorizer()
temp = count.fit_transform(X_train.text)

tdif = TfidfTransformer()
temp2 = tdif.fit_transform(temp)

text_regression = LogisticRegression()
model = text_regression.fit(temp2, y_train)

prediction_data = tdif.transform(count.transform(X_test.text))

predicted = model.predict(prediction_data)
/home/init27/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)

Alternative approach:

Instead of doing all this steps above one could also use the following pipeline:

from sklearn.pipeline import Pipeline
text_regression = Pipeline([('count', CountVectorizer()), ('tfidf', TfidfTransformer()),('reg', LogisticRegression())])
model = text_regression.fit(X_train.text, y_train)
predicted = model.predict(X_test.text)

Generate Predictions

In [9]:
print(np.mean(predicted == y_test))

print(model.predict(tdif.transform(count.transform(["this product was a great video game"]))))
0.9424811740214346 [ True]
In [ ]:
jovian.commit()
[jovian] Saving notebook..
In [ ]: