This tutorial series is a beginner-friendly introduction to programming and data analysis using the Python programming language. These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. Check out the full series here:
This tutorial covers the following topics:
This tutorial is an executable Jupyter notebook hosted on Jovian. You can run this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your computer.
The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Binder. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.
To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.
Jupyter Notebooks: This tutorial is a Jupyter notebook - a document made of cells. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.
The "data" in Data Analysis typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The Numpy library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.
Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters) & average relative humidity (in percentage) as a linear equation.
yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity
We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.
Based on some statical analysis of historical data, we might come up with reasonable values for the weights w1
, w2
, and w3
. Here's an example set of values:
w1, w2, w3 = 0.3, 0.2, 0.5
Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:
To begin, we can define some variables to record climate data for a region.
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43
We can now substitute these variables into the linear equation to predict the yield of apples.
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples
56.8
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))
The expected yield of apples in Kanto region is 56.8 tons per hectare.
To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]
The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.
We can also represent the set of weights used in the formula as a vector.
weights = [w1, w2, w3]
We can now write a function crop_yield
to calcuate the yield of apples (or any other crop) given the climate data and the respective weights.
def crop_yield(region, weights):
result = 0
for x, w in zip(region, weights):
result += x * w
return result
crop_yield(kanto, weights)
56.8
crop_yield(johto, weights)
76.9
crop_yield(unova, weights)
74.9
The calculation performed by the crop_yield
(element-wise multiplication of two vectors and taking a sum of the results) is also called the dot product. Learn more about dot product here: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length .
The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.
Let's install the Numpy library using the pip
package manager.
!pip install numpy --upgrade --quiet
Next, let's import the numpy
module. It's common practice to import numpy with the alias np
.
import numpy as np
We can now use the np.array
function to create Numpy arrays.
kanto = np.array([73, 67, 43])
kanto
array([73, 67, 43])
weights = np.array([w1, w2, w3])
weights
array([0.3, 0.2, 0.5])
Numpy arrays have the type ndarray
.
type(kanto)
numpy.ndarray
type(weights)
numpy.ndarray
Just like lists, Numpy arrays support the indexing notation []
.
weights[0]
0.3
kanto[2]
43
We can now compute the dot product of the two vectors using the np.dot
function.
np.dot(kanto, weights)
56.8
We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.
(kanto * weights).sum()
56.8
The *
operator performs an element-wise multiplication of two arrays if they have the same size. The sum
method calculates the sum of numbers in an array.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr1 * arr2
array([ 4, 10, 18])
arr2.sum()
15
Numpy arrays offer the following benefits over Python lists for operating on numerical data:
(kanto * weights).sum()
rather than using loops & custom functions like crop_yield
.Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))
# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
result += x1*x2
result
CPU times: user 226 ms, sys: 0 ns, total: 226 ms
Wall time: 225 ms
833332333333500000
%%time
np.dot(arr1_np, arr2_np)
CPU times: user 3.25 ms, sys: 0 ns, total: 3.25 ms
Wall time: 1.79 ms
833332333333500000
As you can see, using np.dot
is 100 times faster than using a for
loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.
Let's save our work before continuing.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.
climate_data = np.array([[73, 67, 43],
[91, 88, 64],
[87, 134, 58],
[102, 43, 37],
[69, 96, 70]])
climate_data
If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.
Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the .shape
property of an array.
# 2D array (matrix)
climate_data.shape
weights
# 1D array (vector)
weights.shape
# 3D array
arr3 = np.array([
[[11, 12, 13],
[13, 14, 15]],
[[15, 16, 17],
[17, 18, 19.5]]])
arr3.shape
All the elements in a numpy array have the same data type. You can check the data type of an array using the .dtype
property.
weights.dtype
climate_data.dtype
If an array contains even a single floating point number, all the other elements are also converted to floats.
arr3.dtype
We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between climate_data
(a 5x3 matrix) and weights
(a vector of length 3). Here's what it looks like visually:
You can learn about matrices and matrix multiplication by watching the first 3-4 videos of this playlist: https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0&index=1 .
We can use the np.matmul
function or the @
operator to perform matrix multiplication.
np.matmul(climate_data, weights)
climate_data @ weights
Numpy also provides helper functions reading from & writing to files. Let's download a file climate.txt
, which contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:
temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
84.00,63.00,38.00
66.00,50.00,52.00
41.00,94.00,77.00
91.00,57.00,96.00
49.00,96.00,99.00
67.00,20.00,28.00
...
This format of storing data is known as comma-separated values or CSV.
CSVs: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)
To read this file into a numpy array, we can use the genfromtxt
function.
import urllib.request
urllib.request.urlretrieve(
'https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv',
'climate.txt')
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1)
climate_data
climate_data.shape
We can now perform a matrix multiplication using the @
operator to predict the yield of apples for the entire dataset using a given set of weights.
weights = np.array([0.3, 0.2, 0.5])
yields = climate_data @ weights
yields
yields.shape
Let's add the yields
to climate_data
as a fourth column using the np.concatenate
function.
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis=1)
climate_results
There are a couple of subtleties here:
Since we wish to add new columns, we pass the argument axis=1
to np.concatenate
. The axis
argument specifies the dimension for concatenation.
The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the np.reshape
function to change the shape of yields
from (10000,)
to (10000,1)
.
Here's a visual explanation of np.concatenate
along axis=1
(can you guess what axis=0
results in?):
The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments & return values. Use the cells below to experiment with np.concatenate
and np.reshape
.
Let's write the final results from our computation above back to a file using the np.savetxt
function.
climate_results
np.savetxt('climate_results.txt',
climate_results,
fmt='%.2f',
delimiter=',',
header='temperature,rainfall,humidity,yeild_apples',
comments='')
The results are written back in the CSV format to the file climate_results.txt
.
temperature,rainfall,humidity,yeild_apples
25.00,76.00,99.00,72.20
39.00,65.00,70.00,59.70
59.00,45.00,77.00,65.20
84.00,63.00,38.00,56.80
...
Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:
np.sum
, np.exp
, np.round
, arithemtic operatorsnp.reshape
, np.stack
, np.concatenate
, np.split
np.matmul
, np.dot
, np.transpose
, np.eigvals
np.mean
, np.median
, np.std
, np.max
How to find the function you need? The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to this tutorial on array concatenation.
You can find a full list of array functions here: https://numpy.org/doc/stable/reference/routines.html
Whether you're running this Jupyter notebook online or on your computer, it's essential to save your work from time to time. You can continue working on a saved notebook later or share it with friends and colleagues to let them execute your code. Jovian offers an easy way of saving and sharing your Jupyter notebooks online.
# Install the library
!pip install jovian --upgrade --quiet
import jovian
jovian.commit()
The first time you run jovian.commit
, you'll be asked to provide an API Key to securely upload the notebook to your Jovian account. You can get the API key from your Jovian profile page after logging in / signing up.
jovian.commit
uploads the notebook to your Jovian account, captures the Python environment, and creates a shareable link for your notebook, as shown above. You can use this link to share your work and let anyone (including you) run your notebooks and reproduce your work.
Numpy arrays support arithmetic operators like +
, -
, *
, etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.
arr2 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 1, 2, 3]])
arr3 = np.array([[11, 12, 13, 14],
[15, 16, 17, 18],
[19, 11, 12, 13]])
# Adding a scalar
arr2 + 3
# Element-wise subtraction
arr3 - arr2
# Division by scalar
arr2 / 2
# Element-wise multiplication
arr2 * arr3
# Modulus with scalar
arr2 % 4
Numpy arrays also support broadcasting, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.
arr2 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 1, 2, 3]])
arr2.shape
arr4 = np.array([4, 5, 6, 7])
arr4.shape
arr2 + arr4
When the expression arr2 + arr4
is evaluated, arr4
(which has the shape (4,)
) is replicated three times to match the shape (3, 4)
of arr2
. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.
Broadcasting only works if one of the arrays can be replicated to match the other array's shape.
arr5 = np.array([7, 8])
arr5.shape
arr2 + arr5
In the above example, even if arr5
is replicated three times, it will not match the shape of arr2
. Hence arr2 + arr5
cannot be evaluated successfully. Learn more about broadcasting here: https://numpy.org/doc/stable/user/basics.broadcasting.html .
Numpy arrays also support comparison operations like ==
, !=
, >
etc. The result is an array of booleans.
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])
arr1 == arr2
arr1 != arr2
arr1 >= arr2
arr1 < arr2
Array comparison is frequently used to count the number of equal elements in two arrays using the sum
method. Remember that True
evaluates to 1
and False
evaluates to 0
when booleans are used in arithmetic operations.
(arr1 == arr2).sum()
Numpy extends Python's list indexing notation using []
to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.
arr3 = np.array([
[[11, 12, 13, 14],
[13, 14, 15, 19]],
[[15, 16, 17, 21],
[63, 92, 36, 18]],
[[98, 32, 81, 23],
[17, 18, 19.5, 43]]])
arr3.shape
# Single element
arr3[1, 1, 2]
# Subarray using ranges
arr3[1:, 0:1, :2]
# Mixing indices and ranges
arr3[1:, 1, 3]
# Mixing indices and ranges
arr3[1:, 1, :3]
# Using fewer indices
arr3[1]
# Using fewer indices
arr3[:2, 1]
# Using too many indices
arr3[1,3,2,1]
The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:
Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the official documentation or use the help
function to learn more.
# All zeros
np.zeros((3, 2))
# All ones
np.ones([2, 2, 3])
# Identity matrix
np.eye(3)
# Random vector
np.random.rand(5)
# Random matrix
np.random.randn(2, 3) # rand vs. randn - what's the difference?
# Fixed value
np.full([2, 3], 42)
# Range with start, end and step
np.arange(10, 90, 3)
# Equally spaced numbers in a range
np.linspace(3, 27, 9)
Let's record a snapshot of our work using jovian.commit
.
# Install the library
!pip install jovian --upgrade --quiet
import jovian
jovian.commit()
Try the following exercises to become familiar with Numpy arrays and practice your skills:
With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this tutorial:
Check out the following resources for learning more about Numpy:
You are ready to move on to the next tutorial: Analyzing Tabular Data using Pandas.
Try answering the following questions to test your understanding of the topics covered in this notebook:
numpy
module?numpy
?@
operator used for in Numpy?axis
argument of np.concatenate
?np.reshape
function?np.random.rand
and np.random.randn
? Illustrate with examples.np.arange
and np.linspace
? Illustrate with examples.