Learn practical skills, build real-world projects, and advance your career

Exploratory Analysis - Walmart's Dataset

This notebook aims to apply everything that's needed for data preprocessing, task that will consime in average 60% of the time a data scientist dedicated on a Machine Learning project.

The main tools (libraries) requiered to clean the data in order to explore and get some insight about the data are:

  • Numpy
  • Pandas
  • MatploitLib
  • Seaborn

I will be using these libraries trouhgtout the analysis. Besides, I will do my best to describe in the most detail fashion what is happening on each step, making easy to read the code for people starting on hte field.

Regarding the data, this was part of a job-competition to get hired by Walmart. Therefore, if you are a data scientist enthusiast following the notebook will give you a good sense about what it's needed to jump into the field.

It is time to give the data the most context as possible, crucial to understend the real meaning and how to work with the data. Hence, let's add some context:

1) Comprehend 45 stores from different regions.

2) It have 4 files each of them in a tabular form as Comma Separeted Values '.csv' format (stores.csv, train.csv, test.csv, features.csv).

3) What each file stores?

a. stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

b. train.csv

This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

    - Store - the store number
    - Dept - the department number
    - Date - the week
    - Weekly_Sales -  sales for the given department in the given store
    - IsHoliday - whether the week is a special holiday week
    

c. test.csv (we wont be using the file this time)

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.
  

d. features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

    - Store - the store number
    - Date - the week
    - Temperature - average temperature in the region
    - Fuel_Price - cost of fuel in the region
    - MarkDown1-5 - anonymized data related to promotional markdowns that
    Walmart is running. MarkDown data is only available after Nov 2011, and
    is not available for all stores all the time. Any missing value is
    marked with an NA.
    - CPI - the consumer price index
    - Unemployment - the unemployment rate
    - IsHoliday - whether the week is a special holiday week
    
For convenience, the four holidays fall within the following weeks in 
the dataset (not all holidays are in the data):

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

**Base on the descripition from the original resource, you can find it**  [**here**](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data) 

As you might notice whe have more than one source to get data from, that is an inkling that we are going to merge data in order to get more in indeph insights. In case you are not fiimilar with merging means, don't worry I'm goin to expound on it later on.

A great source that is available to get the fundation on the libraries I mentioned before you can check the course.

I think we already have context about the data we need to explore it, so let's begin!

project_name = "from-zero-to-pandas" 
!pip install jovian --upgrade -q
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "emilio-garcia-ie/from-zero-to-pandas" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/emilio-garcia-ie/from-zero-to-pandas