Learn practical skills, build real-world projects, and advance your career

Linear Regression with Scikit Learn - Machine Learning with Python

alt

The following topics are covered in this tutorial:

  • A typical problem statement for machine learning
  • Downloading and exploring a dataset for machine learning
  • Linear regression with one variable using Scikit-learn
  • Linear regression with multiple variables
  • Using categorical features for machine learning
  • Regression coefficients and feature importance
  • Other models and techniques for regression using Scikit-learn
  • Applying linear regression to other datasets

Problem Statement

This tutorial takes a practical and coding-focused approach. We'll define the terms machine learning and linear regression in the context of a problem, and later generalize their definitions. We'll work through a typical machine learning problem step-by-step:

QUESTION: ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. As the lead data scientist at ACME, you're tasked with creating an automated system to estimate the annual medical expenditure for new customers, using information such as their age, sex, BMI, children, smoking habits and region of residence.

Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer. Due to regulatory requirements, you must be able to explain why your system outputs a certain prediction.

You're given a CSV file containing verified historical data, consisting of the aforementioned information and the actual medical charges incurred by over 1300 customers.
alt

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets

Downloading the Data

To begin, let's download the data using the urlretrieve function from urllib.request.

medical_charges_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv'