Learn practical skills, build real-world projects, and advance your career

Using Machine Learning Tools Assignment 1

Overview

In this assignment, you will apply some popular machine learning techniques to the problem of predicting bike rental demand. A data set has been provided containing records of bike rentals in Seoul, collected during 2017-18.

The main aims of the prac are:

  • to practice using tools for loading and viewing data sets;
  • to visualise data in several ways and check for common pitfalls;
  • to plan a simple experiment and prepare the data accordingly;
  • to run your experiment and to report and interpret your results clearly and concisely.

This assignment relates to the following ACS CBOK areas: abstraction, design, hardware and software, data and information, HCI and programming.

General instructions

This assignment is divided into several tasks. Use the spaces provided in this notebook to answer the questions posed in each task. Note that some questions require writing a small amount of code, some require graphical results, and some require comments or analysis as text. It is your responsibility to make sure your responses are clearly labelled and your code has been fully executed (with the correct results displayed) before submission!

Do not manually edit the data set file we have provided! For marking purposes, it's important that your code is written to run correctly on the original data file.

When creating graphical output, label is clearly, with appropriate titles, xlabels and ylabels, as appropriate.

Most of the tasks in this assignment only require writing a few lines of code! One goal of the assignment is explore sklearn, pandas, matplotlib and other libraries you will find useful throughout the course, so feel free to use the functions they provide. You are expected to search and carefully read the documentation for functions that you use, to ensure you are using them correctly.

Chapter 2 of the reference book is based on a similar workflow to this prac, so you may look there for some further background and ideas. You can also use any other general resources on the internet that are relevant although do not use ones which directly relate to these questions with this dataset (which would normally only be found in someone else's assignment answers). If you take a large portion of code or text from the internet then you should reference where this was taken from, but we do not expect any references for small pieces of code, such as from documentation, blogs or tutorials. Taking, and adapting, small portions of code is expected and is common practice when solving real problems.

The following code imports some of the essential libraries that you will need. You should not need to modify it, but you are expected to import other libraries as needed.

# Python ≥3.5 is required
import sys

import numpy
from sklearn.impute import SimpleImputer
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import pandas as pd
assert pd.__version__ >= "1.0"

# Common imports
import numpy as np
import os

# To plot pretty figures,
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Step 1: Loading and initial processing of the dataset (20%)

Download the data set from MyUni using the link provided on the assignment page. A paper that describes one related version of this dataset is: Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020. Feel free to look at this if you want more information about the dataset.

The data is stored in a CSV (comma separated variable) file and contains the following information

  • Date: year-month-day
  • Rented Bike Count: Count of bikes rented at each hour
  • Hour: Hour of the day
  • Temperature: Temperature in Celsius
  • Humidity: %
  • Windspeed: m/s
  • Visibility: 10m
  • Dew point temperature: Celsius
  • Solar radiation: MJ/m2
  • Rainfall: mm
  • Snowfall: cm
  • Seasons: Winter, Spring, Summer, Autumn
  • Holiday: Holiday/No holiday
  • Functional Day: NoFunc(Non Functional Hours), Fun(Functional hours)

Load the data set from the csv file into a DataFrame, and summarise it with at least two appropriate pandas functions.

### Your code here
# Load the data using a pandas function
dataset = pd.read_csv('SeoulBikeData.csv')
dataset.describe()
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8760 entries, 0 to 8759 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 8760 non-null object 1 Rented Bike Count 8760 non-null int64 2 Hour 8760 non-null int64 3 Temperature (C) 8760 non-null float64 4 Humidity (%) 8760 non-null int64 5 Wind speed (m/s) 8759 non-null float64 6 Visibility (10m) 8760 non-null int64 7 Dew point temperature (C) 8759 non-null float64 8 Solar Radiation (MJ/m2) 8760 non-null float64 9 Rainfall(mm) 8758 non-null object 10 Snowfall (cm) 8760 non-null object 11 Seasons 8760 non-null object 12 Holiday 8760 non-null object 13 Functioning Day 8760 non-null object dtypes: float64(4), int64(4), object(6) memory usage: 958.2+ KB