Learn practical skills, build real-world projects, and advance your career

DATA ANALYSIS : Automobile Dataset


alt

Problem

Let's say we have a friend named Tom. And Tom wants to sell his car. But the problem is he doesn't know how much he should sell his car for. Tom wants to sell his car for as much as he can. But he also wants to set the price reasonably, so someone would want to purchase it. So the price he sets should represent the value of the car.How can we help Tom determine the best price for his car? Let's think like data scientists and clearly define some of his problems. For example, is there data on the prices of other cars and their characteristics? What features of cars affect their prices? Color? Brand? Does horsepower also effect the selling price, or perhaps something else? As a data analyst or data scientist, these are some of the questions we can start thinking about. To answer these questions, we're going to need some data.

The final model has efficiency of 84% and below one is it's performance graph

DistributionPlot(y_test, yhat, "Actual Values (Test)", "Predicted Values (Test)", Title)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning) /opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Notebook Image

TABLE OF CONTENT


  1. Data Acquisition
  2. Identify and handle missing values
  3. Data Standardization
  4. Data Normalization
  5. Binning
  6. Analyzing Individual Feature Patterns using Visualization
  7. Model Development
    • Linear Regression and Multiple Linear Regression
    • Model Evaluation using Visualization
    • Polynomial Regression and Pipeline
    • Measures for Insample Evaluation
    • Prediction and Decision Making
  8. Model Evaluation and Refinement
  9. Conclusion
  10. Reference

1.Data Acquisition

There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.


The Pandas Library is a useful tool that enables us to read various datasets into a data frame;so that all we need to do is import Pandas. If you crossed by error, install it first.

We use pandas.read_csv() function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.

Because the data does not include headers, we can add an argument headers = None inside the read_csv() method, so that pandas will not automatically set the first row as a header.

You can also assign the dataset to any variable you create.