Learn data science and machine learning by building real-world projects on Jovian
In [1]:
from PIL import Image
Pandas=Image.open('/Users/Laptop Point/Desktop/jovian_image/pandas.png')
Pandas
Out[1]:
Notebook Image

Playing With Pandas

This notebook is dedicated to the pandas library of python. The main objective of creating this notebook is to learn the basics of pandas library and slowly and steadily move to advanced level of pandas so as to get the skills of data manipulation and further analysing so as to visualise it with the other libraries of python such as Matplotlib and seaborn.

Let's begin with importing a simple dataset

During importing a dataset the data set might be in csv, xlsx or any other format. Here, we will see how to import csv and xlsx format.

First begin with importing pandas

In [2]:
import pandas as pd

Importing and reading .csv datafile into pandas dataframe

To import and read a csv data, Use the fuction pd.read_csv() with the directory of the data as an argument in string.

In [3]:
data_csv=pd.read_csv('/Users/Laptop Point/Desktop/data.csv')

Above pd.read_csv function converts the data into pandas dataframe and store in given variable as in here it is data_csv that allows the manupulations and analysis of the data using pandas Library.

In [4]:
data_csv
Out[4]:

Importing and reading .xlsx data file into pandas dataframe

To import and read a xlsx data, use the funtion pd.read_excel() function and use the file directory as an argument in string. But before using pd.read_excel() function, you might need to install and import xlrd.

In [5]:
# Uncomment the below line to import xlrd.
# pip install xlrd
# import xlrd
data_excel=pd.read_excel('/Users/Laptop Point/Desktop/data.xlsx')
In [6]:
data_excel
Out[6]:

To get general information on dataset/dataframe

Checking the datatype and shape of the dataframe

Let's check the data type and shape of the dataframe using .shape() and type() functions

In [7]:
type(data_csv)
Out[7]:
pandas.core.frame.DataFrame
In [8]:
data_csv.shape
Out[8]:
(7, 2)

Getting some basic info out of the dataframe

We can also see some basic information using .info() method

In [9]:
data_csv.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7 entries, 0 to 6 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Day 7 non-null object 1 Number 7 non-null int64 dtypes: int64(1), object(1) memory usage: 240.0+ bytes

Using .info method as shown above, basic info of the data can be clearly seen as there are 7 rows and 2 columns, the columns names are Day and Number. Non-null counts are shown along with data type. As shown here, the datatype of entries of Day column is object and Number column is integer.

For the coumn with numerical entries, we can also view some statistical information like mean, standard deviation, minimum/maximum values and number non-empty values using .describe() method.

In [10]:
data_csv.describe()
Out[10]:

Getting the list of columns of the dataframe

To get the list of the columns of the dataframe, use the proprty .columns

In [11]:
data_csv.columns
Out[11]:
Index(['Day', 'Number'], dtype='object')

Retrieving data from a dataframe

To know how to retrieve a single dataset from the whole dataframe is also sometimes required. This targeted dataset might be a single row or column.

1. Retrieving with '[ ]' indexing notation

To retrieve one specific columns

To retrieve a single column from the dataframe, we can use indexing notation as shown below.

In [12]:
data_csv
Out[12]:
In [13]:
data_csv['Day']
Out[13]:
0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
Name: Day, dtype: object

As we can see above, the column named Day has been obtained as series. We can check the datatype of data obtained.

In [14]:
type(data_csv['Day'])
Out[14]:
pandas.core.series.Series

To retrieve more than one specific column

Now if we have a dataframe with many number of columns and we want to obtain some specific columns, then for this purpose, we can pass the name of the columns as a list in the indexing notationas shown below.

In [15]:
data_csv[['Day', 'Number']]
Out[15]:
In [16]:
type(data_csv[['Day', 'Number']])
Out[16]:
pandas.core.frame.DataFrame

The data obtained from passing the list of specific columns is a dataframe as we can see the datatype obtained above.

To retrieve one specific row

To retrieve one specific row, pandas provides .loc[ ] which gives the data of a specific row as a pandas series method as shown below.

In [17]:
data_csv.loc[1]
Out[17]:
Day       Tuesday
Number          2
Name: 1, dtype: object
In [18]:
data_csv.loc[2]
Out[18]:
Day       Wednesday
Number            3
Name: 2, dtype: object
In [19]:
type(data_csv.loc[2])
Out[19]:
pandas.core.series.Series

To retrieve data from a specific row and column

Now suppose if we want to obtain a data from a given row and column, this can be done by providing the number of the row and name of the column in different indexing notations as given below.

In [20]:
data_csv['Day'][4]
Out[20]:
'Friday'
In [21]:
data_csv['Number'][4]
Out[21]:
5

For the same purpose shown above, pandas also provide a method .at[ ], as shown below.

In [22]:
data_csv.at[4, 'Day']
Out[22]:
'Friday'
In [23]:
data_csv.at[4, 'Number']
Out[23]:
5
In [24]:
data_csv.at[5, 'Day']
Out[24]:
'Saturday'

2. Retrieving with . Notation

Instead of accessing a spcific column with [ ] indexing notation, pandas also allow to access a specific column with . notation. However, this method only works for accessing columns whose names do not contain spaces or special characters.

In [25]:
data_csv.Day
Out[25]:
0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
Name: Day, dtype: object
In [26]:
data_csv.Number
Out[26]:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
Name: Number, dtype: int64
In [27]:
type(data_csv.Day)
Out[27]:
pandas.core.series.Series

Changing index and/or columns' name of a dataframe

Before diving further into the matter and changing the index or columns or both, let's copy the data into another valrable so that the original dataset or dataframe remain intact and we can use whenever required.

So, for copying a dataframe into another variable, pandas provides .copy() method

In [28]:
data_copy=data_csv.copy()
In [29]:
data_copy
Out[29]:

Copying the dataframe into another variable is important such that any change in the copied dataframe will not result in the change of original dataframe.

Changing a specific column name

In [30]:
data_copy=data_copy.rename(columns={'Number':'Count'})
In [31]:
data_copy
Out[31]:

In the above example, data_copy has been replaced with the changed column name dataframe. The same can be done with an e3xtra argument inplace set to True, as shown below

In [32]:
data_copy.rename(columns={'Count':'Counting'}, inplace=True)
In [33]:
data_copy
Out[33]:

But note that inplace is set to False by default. So, if we don't provide inplace argument then it will not change the original dataframe, as shown below

In [34]:
data_copy.rename(columns={'Counting':'Nums'})
Out[34]:

Now, let's check what happened to original data

In [35]:
data_copy
Out[35]:

So, there is no change in the original data

Now, let's change it back to the Number

In [36]:
data_copy.rename(columns={'Counting':'Number'}, inplace=True)
In [37]:
data_copy
Out[37]:

Changing the index of the dataframe

Generally, what happens while reading a csv or xlsx file into pandas dataframe is, pandas provides the dataframe of its own index number starting from 0 as its first indexing numnber.

Now, for providing a column that is already in the dataframe as the new index, pandas provide .set_index method

Let's change the index of dataframe data_copy to the Day

In [38]:
data_copy.set_index('Day')
Out[38]:

But remember since here, we did not provide inplace=True that means our original dataframe will remain unaffected

In [39]:
data_copy
Out[39]:
In [40]:
data_copy.set_index('Day', inplace=True)
In [41]:
data_copy
Out[41]:

Note: We can also set multiple columns as index just by passing the name of the column in .set_index() method as above

Now suppose if we want to reset the index, we can use .reset_index() method

In [42]:
data_copy.reset_index()
Out[42]:

Now again if we want to reset the index in original data, we can pass the argument inplace=True

In [43]:
data_copy.reset_index(inplace=True)
In [44]:
data_copy
Out[44]:

Let's import bigger data comparison to previous one for learning more pandas functions and methods to analyse data. And, let's read the imported data into pandas dataframe.

Since the data is xlsx so, let's read the data using pd.read_excel() function.

In [3]:
data_df=pd.read_excel('/Users/Laptop Point/Desktop/data1.xlsx')
data_df
Out[3]:

Now, let's use some prelearned pandas function and methods on recently imported data.

  1. type()
  2. .info()
  3. .describe()
  4. .set_index()
  5. .reset_index()
  6. .loc[]
  7. .at[]
  8. .rename()

1. Checking the data type

In [4]:
type(data_df)
Out[4]:
pandas.core.frame.DataFrame

2. Some basic info about the dataframe

In [5]:
data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sample 6 non-null object 1 data 6 non-null int64 2 for 6 non-null int64 3 pandas 6 non-null int64 dtypes: int64(3), object(1) memory usage: 320.0+ bytes

3. Info about numerical values in dataframe

In [6]:
data_df.describe()
Out[6]:

4. Changing Index

In [7]:
data_df
Out[7]:
In [8]:
data_df.set_index('sample')
Out[8]:
In [9]:
data_df.set_index('data')
Out[9]:

Setting Multiple Index

In [10]:
data_df.set_index(['sample', 'data'])
Out[10]:

5. Restting the Index

In [11]:
data_df.reset_index(drop=True)
Out[11]:

6. Locating a row

In [12]:
data_df.loc[3]
Out[12]:
sample     p
data       4
for       10
pandas     9
Name: 3, dtype: object
In [13]:
data_df.loc[5]
Out[13]:
sample     e
data       6
for       12
pandas     2
Name: 5, dtype: object

7. Loacating a value with specific row and column

In [15]:
data_df.at[3, 'data']
Out[15]:
4
In [16]:
data_df.at[4, 'pandas']
Out[16]:
12

8. Renaming a column

In [18]:
data_df.rename(columns={'data':'numbers', 'for':'in'})
Out[18]:
In [19]:
data_df
Out[19]:

Getting general sample from dataframe

1. Using .sample( ) method

.sample( ) method gives a subset random rows from dataframe. The number of rows to get is given as an argument in .sample( ) method as given below.

In [20]:
data_df.sample(3)
Out[20]:
In [21]:
data_df.sample(3)
Out[21]:

2. Using .head( ) and .tail( ) method

.head() method is used to obtain the sample from top of the dataframe in respective order as in dataframe. The number of rows required is given as argument in the method as shown below

In [22]:
data_df.head(3)
Out[22]:

.tail() method is used to obtain the sample from below in respective order as in dataframe. The number of rows required is given as argument in the method as shown below

In [23]:
data_df.tail(2)
Out[23]:

Quering and Sorting Rows

In [24]:
data_df
Out[24]:

Now suppose if we want to obtain a dataframe only consisting of the data with the 'pandas' column less than 8. So, it can be done as given below

In [25]:
data_df.pandas<8
Out[25]:
0     True
1     True
2    False
3    False
4    False
5     True
Name: pandas, dtype: bool
In [26]:
data_df[data_df.pandas<8]
Out[26]:
In [27]:
data_df['pandas']<8
Out[27]:
0     True
1     True
2    False
3    False
4    False
5     True
Name: pandas, dtype: bool
In [28]:
data_df[data_df['pandas']<8]
Out[28]:
In [29]:
import jovian
In [ ]:
jovian.commit()
[jovian] Attempting to save notebook..

Sorting the rows