9 months ago

## The Not So Famous Five of Numpy

• histogram - Computes the histogram of a set of data
• corrcoef - Return Pearson product-moment correlation coefficients
• count_nonzero - Counts the number of non-zero values in the array a.
• ravel - Return a contiguous flattened array.
• clip - it is used to keep values in an array within an interval.

The recommended way to run this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on mybinder.org, a free online service for running Jupyter notebooks.

In [2]:
In [3]:
import jovian
In [4]:
jovian.commit(project='numpy-array-operations')
[jovian] Attempting to save notebook.. [jovian] Please enter your API key ( from https://jovian.ml/ ): API KEY: ········ [jovian] Updating notebook "abhinavyadav7/numpy-array-operations" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/abhinavyadav7/numpy-array-operations

Let's begin by importing Numpy and listing out the functions covered in this notebook.

In [5]:
import numpy as np
In [9]:
# List of functions explained
function1 = np.histogram  # (change this)
function2 = np.corrcoef
function3 = np.count_nonzero
function4 = np.ravel
function5 = np.nanstd
In [10]:
jovian.commit(project='numpy-array-operations')

### Function 1 - np.histogram()

Computes the histogram of a set of data i.e. gives the frequency of data occuring in the data set

In [14]:
arr = np.random.randint(11, size = 50)
np.histogram(arr)
Out[14]:
(array([ 4,  2,  2,  4, 12,  4,  5,  5,  4,  8]),
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]))

The example defines arr as a series with numbers bw 1 and 10 at 50 data set points. The frquency of occurence of each of those numbers is given by the np.histogram function. The function is extremely helpful in creating histograms to get an idea about the data

In [21]:
# Example 2 - working
arr = np.random.randint(100, size = 50)
np.histogram(arr, bins = [0,10,20,30,40,50,60,70,80,90,100])
Out[21]:
(array([ 7, 12,  4,  8,  3,  1,  2,  4,  4,  5]),
array([  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100]))

The example defines arr as a series with numbers bw 1 and 10 at 50 data set points. The frquency of occurence of each of those numbers is given by the np.histogram function. The function is extremely helpful in creating histograms to get an idea about the data

In [23]:
# Example 3 - breaking (to illustrate when it breaks)
arr = np.random.randint(100, size = 50)
np.histogram(arr, bin = [0,10,20,30,40,50,60,70,80,90,100])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-23-5b080845ad2c> in <module> 1 # Example 3 - breaking (to illustrate when it breaks) 2 arr = np.random.randint(100, size = 50) ----> 3 np.histogram(arr, bin = [0,10,20,30,40,50,60,70,80,90,100]) <__array_function__ internals> in histogram(*args, **kwargs) TypeError: _histogram_dispatcher() got an unexpected keyword argument 'bin'

the parameters in the histrogram function should be written carefully as the keyword to be used is bins not bin

the function should be used when freuqency of occurence is to be calculated , moreso in a graphical manner

In [24]:
jovian.commit()

### Function 2 - np.corrcoef

Return Pearson product-moment correlation coefficients in the form of a matrix.

In [25]:
# Example 1 - calcaulating the co relation between height in inches and weight in kgs
height = np.array([72,75,76,78,82,73,85])
weight = np.array([68,69,69,73,75,65,78])
np.corrcoef(height,weight)
Out[25]:
array([[1.       , 0.9569077],
[0.9569077, 1.       ]])

The matrix gives the correlations between height and weight as matrix . It shows that height and weight are co related that as height increases so does weight in almost a linear fashion.

In [30]:
# Example 1 - calcaulating the co relation between  weight in kgs and number of hours excercised
weight = np.array([68,69,69,73,75,65,78])
hours = np.array([3,2,1,0,0,1,0])
np.corrcoef(weight, hours)
Out[30]:
array([[ 1.        , -0.67219362],
[-0.67219362,  1.        ]])

The matrix gives the correlations between height and weight as matrix . It shows that height and number of hours excercised are co related lightly .

In [31]:
weight = np.array([68,69,69,73,75,65,78])
hours = np.array([3,2,1,0,0,1])
np.corrcoef(weight, hours)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-31-f8d8c518cd10> in <module> 1 weight = np.array([68,69,69,73,75,65,78]) 2 hours = np.array([3,2,1,0,0,1]) ----> 3 np.corrcoef(weight, hours) <__array_function__ internals> in corrcoef(*args, **kwargs) /srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof) 2549 warnings.warn('bias and ddof have no effect and are deprecated', 2550 DeprecationWarning, stacklevel=3) -> 2551 c = cov(x, y, rowvar) 2552 try: 2553 d = diag(c) <__array_function__ internals> in cov(*args, **kwargs) /srv/conda/envs/notebook/lib/python3.8/site-packages/numpy/lib/function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights) 2413 if not rowvar and y.shape[0] != 1: 2414 y = y.T -> 2415 X = np.concatenate((X, y), axis=0) 2416 2417 if ddof is None: <__array_function__ internals> in concatenate(*args, **kwargs) ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 7 and the array at index 1 has size 6

The example breaks as array size should be the same . care should be taken to make sure array sizes are the same.

The function is to be used when trying to establish relationships between datasets. This gives a geneeral direction on what to expect from the data .

In [32]:
jovian.commit()

### Function 3 - count_nonzero

Counts the number of non-zero values in the array a.

In [41]:
# Example 1 - working
arr1 = np.random.randint(5 , size = (4,4))
arr1
Out[41]:
array([[1, 1, 3, 2],
[2, 2, 0, 0],
[0, 2, 2, 2],
[0, 2, 4, 4]])
In [43]:
np.count_nonzero(arr1)
Out[43]:
12

np.count_nonzero(arr1) gives the number of non zero elements in a series which in this matrix is 4

In [46]:
height = np.array([72,75,76,78,82,73,85])
weight = np.array([68,69,69,73,75,65,78])
len(height),np.count_nonzero(height),len(weight),np.count_nonzero(weight)

Out[46]:
(7, 7, 7, 7)
In [47]:
np.corrcoef(height,weight)
Out[47]:
array([[1.       , 0.9569077],
[0.9569077, 1.       ]])

For example in ssome surveys NaN or questions people choose not to answer are erroneously coded as 0's this leads to bias in the data set and throws off the calculations. here we check before finding the co relation coef whther both the data contains all no zero elements as neither height or weight can be zero

In [49]:
# Example 3 - breaking (to illustrate when it breaks)
height = np.array([72,75,76,78,82,73,85])
weight = np.array([68,69,69,73,0,65,0])
np.corrcoef(height,weight)
Out[49]:
array([[ 1.        , -0.86761515],
[-0.86761515,  1.        ]])

No count_nozero was used to weed out the 0's and the co relation coef dropped as compared with the previous example. Hence it is necessary to weed out the zeros from data that are not expected to be zero like height weight or age

This function should be used in exploratory data analysis to sanitise the data so that the zeros can be weeded out from data that are not expected to be zero like height weight or age

In [50]:
jovian.commit()

### Function 4 - ravel

Return a contiguous flattened array.

In [56]:
# Example 1 - working
arr1 = np.random.randint(5 , size = (4,4))
flattened_arr1 = np.ravel(arr1)
flattened_arr1
Out[56]:
array([3, 1, 4, 4, 1, 1, 3, 3, 2, 2, 2, 4, 1, 2, 2, 3])

Flattens the array entered and created a copy , does not change the original array.

In [58]:
# Example 2 - working
arr2 = np.random.randint( 15, size = (8,8))
flattened_arr2 = np.ravel(arr2)
flattened_arr1
Out[58]:
array([ 1,  1, 14,  6, 10,  0, 12, 14,  4, 13,  8,  1, 14,  1,  4,  1,  9,
10,  7,  9,  0,  4, 10, 10, 11,  7, 13,  2,  5,  1,  4,  6,  9,  6,
6,  4, 13,  3,  7, 12,  2, 10, 10, 12,  4, 11,  0, 14,  6,  1,  1,
0,  8,  6,  0, 12,  3,  0, 11,  1,  7,  1,  8,  0])

Flattens the array entered and created a copy , does not change the original array.

In [62]:
# Example 3 - breaking (to illustrate when it breaks)
arr2 = np.random.randint( 15, size = (8,8))
arr2.ravel
Out[62]:
<function ndarray.ravel>

the array has to be put inside the paenthsis of the ravel function and cannot be used in the . format

Used to flatten really larges matrices into a series to perform functions on them

In [63]:
jovian.commit()

### Function 5 - clip

clip is used to keep values in an array within an interval

In [66]:
# Example 1 - working
arr3 = np.arange(15)
print(arr3)
np.clip(arr3,3,12)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
Out[66]:
array([ 3,  3,  3,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 12, 12])

In this example clip defines the range of the series, converting everything above the max range =12 to 12 and everything below the min range = 3 to 3

In [72]:
# Example 2 - working
weight = np.array([68,69,69,73,75,65,120])
hours = np.array([10,2,1,0,1,0,0])
print(np.corrcoef(weight, hours))
np.clip(weight, 30,100)
np.clip(hours, 0,5)
print(np.corrcoef(np.clip(weight, 30,100), np.clip(hours, 0,5)))

[[ 1. -0.27863692] [-0.27863692 1. ]] [[ 1. -0.35346722] [-0.35346722 1. ]]

Clip is another excellent function to sanistise the data by allowing it to handle outliers such as excessive weight of 120 kg and excessive no of hours of excercise = 10 . By using the outliers one can obatain a more realistic picture as the two coorcoef clearly show.

In [74]:
# Example 3 - breaking (to illustrate when it breaks)
weight = np.array([50,60,70,70,75,85,120])
hours = np.array([10,5,6,5,1,0,0])
print(np.corrcoef(weight, hours))
np.clip(weight, 30,100)
np.clip(hours, 0,5)
print(np.corrcoef(np.clip(weight, 60,100), np.clip(hours, 0,5)))

[[ 1. -0.79801789] [-0.79801789 1. ]] [[ 1. -0.85870876] [-0.85870876 1. ]]

Care must be taken as to how clip and its ranges are used , if not used carefully clip can remove valid data from the datset introducing additional bias .

Clip is an excellent tool for exploratoruy data analysis that allows an easy and fast way to handle exceptions when used carefully

In [ ]:
jovian.commit()
[jovian] Attempting to save notebook..

### Conclusion

Summarize what was covered in this notebook, and where to go next