Jovian
⭐️
Sign In
In [1]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/horse-or-human.zip \
    -O /tmp/horse-or-human.zip
--2020-02-09 11:19:39-- https://storage.googleapis.com/laurencemoroney-blog.appspot.com/horse-or-human.zip Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.126.128, 2a00:1450:4013:c05::80 Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.126.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 149574867 (143M) [application/zip] Saving to: ‘/tmp/horse-or-human.zip’ /tmp/horse-or-human 100%[===================>] 142.65M 181MB/s in 0.8s 2020-02-09 11:19:40 (181 MB/s) - ‘/tmp/horse-or-human.zip’ saved [149574867/149574867]
In [2]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip \
    -O /tmp/validation-horse-or-human.zip
--2020-02-09 11:19:42-- https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.69.128, 2a00:1450:4013:c04::80 Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.69.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 11480187 (11M) [application/zip] Saving to: ‘/tmp/validation-horse-or-human.zip’ /tmp/validation-hor 100%[===================>] 10.95M 37.1MB/s in 0.3s 2020-02-09 11:19:43 (37.1 MB/s) - ‘/tmp/validation-horse-or-human.zip’ saved [11480187/11480187]

The following python code will use the OS library to use Operating System libraries, giving you access to the file system, and the zipfile library allowing you to unzip the data.

In [ ]:
import os
import zipfile

local_zip = '/tmp/horse-or-human.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp/horse-or-human')
local_zip = '/tmp/validation-horse-or-human.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp/validation-horse-or-human')
zip_ref.close()

The contents of the .zip are extracted to the base directory /tmp/horse-or-human, which in turn each contain horses and humans subdirectories.

In short: The training set is the data that is used to tell the neural network model that 'this is what a horse looks like', 'this is what a human looks like' etc.

One thing to pay attention to in this sample: We do not explicitly label the images as horses or humans. If you remember with the handwriting example earlier, we had labelled 'this is a 1', 'this is a 7' etc. Later you'll see something called an ImageGenerator being used -- and this is coded to read images from subdirectories, and automatically label them from the name of that subdirectory. So, for example, you will have a 'training' directory containing a 'horses' directory and a 'humans' one. ImageGenerator will label the images appropriately for you, reducing a coding step.

Let's define each of these directories:

In [ ]:
# Directory with our training horse pictures
train_horse_dir = os.path.join('/tmp/horse-or-human/horses')

# Directory with our training human pictures
train_human_dir = os.path.join('/tmp/horse-or-human/humans')

# Directory with our training horse pictures
validation_horse_dir = os.path.join('/tmp/validation-horse-or-human/horses')

# Directory with our training human pictures
validation_human_dir = os.path.join('/tmp/validation-horse-or-human/humans')

Now, let's see what the filenames look like in the horses and humans training directories:

In [5]:
train_horse_names = os.listdir(train_horse_dir)
print(train_horse_names[:10])

train_human_names = os.listdir(train_human_dir)
print(train_human_names[:10])

validation_horse_hames = os.listdir(validation_horse_dir)
print(validation_horse_hames[:10])

validation_human_names = os.listdir(validation_human_dir)
print(validation_human_names[:10])
['horse35-8.png', 'horse17-7.png', 'horse12-1.png', 'horse44-6.png', 'horse07-9.png', 'horse32-2.png', 'horse03-3.png', 'horse01-2.png', 'horse33-0.png', 'horse47-2.png'] ['human11-24.png', 'human07-16.png', 'human12-02.png', 'human08-19.png', 'human02-16.png', 'human07-20.png', 'human14-04.png', 'human14-20.png', 'human10-22.png', 'human08-01.png'] ['horse5-100.png', 'horse3-326.png', 'horse5-065.png', 'horse5-303.png', 'horse5-103.png', 'horse5-342.png', 'horse5-402.png', 'horse3-055.png', 'horse3-397.png', 'horse1-298.png'] ['valhuman02-21.png', 'valhuman03-07.png', 'valhuman05-06.png', 'valhuman04-17.png', 'valhuman03-21.png', 'valhuman05-27.png', 'valhuman03-10.png', 'valhuman04-02.png', 'valhuman04-03.png', 'valhuman04-19.png']

Let's find out the total number of horse and human images in the directories:

In [6]:
print('total training horse images:', len(os.listdir(train_horse_dir)))
print('total training human images:', len(os.listdir(train_human_dir)))
print('total validation horse images:', len(os.listdir(validation_horse_dir)))
print('total validation human images:', len(os.listdir(validation_human_dir)))
total training horse images: 500 total training human images: 527 total validation horse images: 128 total validation human images: 128

Now let's take a look at a few pictures to get a better sense of what they look like. First, configure the matplot parameters:

In [ ]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

# Index for iterating over images
pic_index = 0

Now, display a batch of 8 horse and 8 human pictures. You can rerun the cell to see a fresh batch each time:

In [8]:
# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols * 4, nrows * 4)

pic_index += 8
next_horse_pix = [os.path.join(train_horse_dir, fname) 
                for fname in train_horse_names[pic_index-8:pic_index]]
next_human_pix = [os.path.join(train_human_dir, fname) 
                for fname in train_human_names[pic_index-8:pic_index]]

for i, img_path in enumerate(next_horse_pix+next_human_pix):
  # Set up subplot; subplot indices start at 1
  sp = plt.subplot(nrows, ncols, i + 1)
  sp.axis('Off') # Don't show axes (or gridlines)

  img = mpimg.imread(img_path)
  plt.imshow(img)

plt.show()