Jovian
⭐️
Sign In

If you are totally new to tensorflow like me, this may help you.

I tried to add those basic detail in this tutorial.

This tutorial is from TF2.0 official website.

The sample is about when you have a CSV with specific data structure.

How can you handle your data?

The dataset were provided by TF2.0 tutorial, which were an online dataset.

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import tensorflow as tf

print(tf.__version__)

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
2.0.0-alpha0
In [2]:
URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
data = pd.read_csv(URL)
data.head()

train, test = train_test_split(data, test_size=0.2)
train, val  = train_test_split(train, test_size=0.2)

print(len(train), ' ', len(val), ' ', len(test))
193 49 61

Here the sample introduce you an API calls tf.data it allows you build complex input pipelines.

there are 2 ways to create this kind of dataset:

  1. ataset.from_tensor_slices()

this method constructs a dataset from Tensor objects.

  1. ataset.batch()

this method constructs a dataset from Dataset objects.

when extract elements from a dataset, use

data.Iterator you can use Iterator.get_next() to get next objs.

find detail on their official website.

The convient about tf.data is helping you transfor & normalize your data structure.

for example, when classification, you won't use id directly, you will use one-hot for your items.

In [3]:
print(train[:10])
age sex cp trestbps chol fbs restecg thalach exang oldpeak \ 134 42 1 3 120 240 1 0 194 0 0.8 251 57 1 0 130 131 0 1 115 1 1.2 224 57 1 2 154 232 0 2 164 0 0.0 300 65 1 4 135 254 0 2 127 0 2.8 205 70 1 2 156 245 0 2 143 0 0.0 75 56 0 4 200 288 1 2 133 1 4.0 106 58 1 4 125 300 0 2 171 0 0.0 264 44 1 4 112 290 0 2 153 0 0.0 122 64 1 4 145 212 0 2 132 0 2.0 16 48 1 2 110 229 0 0 168 0 1.0 slope ca thal target 134 3 0 reversible 0 251 1 1 normal 0 224 1 1 normal 0 300 2 1 reversible 1 205 1 0 normal 0 75 3 2 reversible 1 106 1 2 reversible 0 264 1 1 normal 1 122 2 2 fixed 1 16 3 0 reversible 0

first, we transfer the structure from CSV into tf.data

In [4]:
def Transfer2TfData(data, shuffle=True, batch_size=32):
    labels = data.copy().pop('target')
    ds_data = tf.data.Dataset.from_tensor_slices((dict(data), labels))
    if shuffle:
        ds_data.shuffle(buffer_size = len(data))
    ds_data = ds_data.batch(batch_size)
    return ds_data

to understand what the function did, here is the sample of how dict works

In [5]:
print(train[:2])
print("--------------")
print(dict(train[:2]))
age sex cp trestbps chol fbs restecg thalach exang oldpeak \ 134 42 1 3 120 240 1 0 194 0 0.8 251 57 1 0 130 131 0 1 115 1 1.2 slope ca thal target 134 3 0 reversible 0 251 1 1 normal 0 -------------- {'age': 134 42 251 57 Name: age, dtype: int64, 'sex': 134 1 251 1 Name: sex, dtype: int64, 'cp': 134 3 251 0 Name: cp, dtype: int64, 'trestbps': 134 120 251 130 Name: trestbps, dtype: int64, 'chol': 134 240 251 131 Name: chol, dtype: int64, 'fbs': 134 1 251 0 Name: fbs, dtype: int64, 'restecg': 134 0 251 1 Name: restecg, dtype: int64, 'thalach': 134 194 251 115 Name: thalach, dtype: int64, 'exang': 134 0 251 1 Name: exang, dtype: int64, 'oldpeak': 134 0.8 251 1.2 Name: oldpeak, dtype: float64, 'slope': 134 3 251 1 Name: slope, dtype: int64, 'ca': 134 0 251 1 Name: ca, dtype: int64, 'thal': 134 reversible 251 normal Name: thal, dtype: object, 'target': 134 0 251 0 Name: target, dtype: int64}
In [6]:
# here is the result of data type & shape
ds_train = Transfer2TfData(train, batch_size = 5)
print(ds_train)
<BatchDataset shapes: ({age: (None,), sex: (None,), cp: (None,), trestbps: (None,), chol: (None,), fbs: (None,), restecg: (None,), thalach: (None,), exang: (None,), oldpeak: (None,), slope: (None,), ca: (None,), thal: (None,), target: (None,)}, (None,)), types: ({age: tf.int32, sex: tf.int32, cp: tf.int32, trestbps: tf.int32, chol: tf.int32, fbs: tf.int32, restecg: tf.int32, thalach: tf.int32, exang: tf.int32, oldpeak: tf.float64, slope: tf.int32, ca: tf.int32, thal: tf.string, target: tf.int32}, tf.int32)>
In [7]:
# get one and check the value
for feature_batch, label_batch in ds_train.take(1):
      print('Every feature:', list(feature_batch.keys()))
      print('A batch of ages:', feature_batch['age'])
      print('A batch of targets:', label_batch )
Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'] A batch of ages: tf.Tensor([42 57 57 65 70], shape=(5,), dtype=int32) A batch of targets: tf.Tensor([0 0 0 1 0], shape=(5,), dtype=int32)

we will start to transfer our data into some structure that model needs

what we have in this structure are

Column Description Feature Type Data Type
Age Age in years Numerical integer
Sex (1 = male; 0 = female) Categorical integer
CP Chest pain type (0, 1, 2, 3, 4) Categorical integer
Trestbpd Resting blood pressure (in mm Hg on admission to the hospital) Numerical integer
Chol Serum cholestoral in mg/dl Numerical integer
FBS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) Categorical integer
RestECG Resting electrocardiographic results (0, 1, 2) Categorical integer
Thalach Maximum heart rate achieved Numerical integer
Exang Exercise induced angina (1 = yes; 0 = no) Categorical integer
Oldpeak ST depression induced by exercise relative to rest Numerical integer
Slope The slope of the peak exercise ST segment Numerical float
CA Number of major vessels (0-3) colored by flourosopy Numerical integer
Thal 3 = normal; 6 = fixed defect; 7 = reversable defect Categorical string
Target Diagnosis of heart disease (1 = true; 0 = false) Classification integer

before move on, here is the introduct of next, iter

In [8]:
sample = iter('abcd')
print(next(sample))
print(next(sample))
print(next(sample))
print(next(sample))
a b c d
In [9]:
print("from above you will find the value were same, the different were their type.")
print("type of next(iter(ds_train))    is ", type(next(iter(ds_train))))
print("type of next(iter(ds_train))[0] is ", type(next(iter(ds_train))[0]))

print("-----------")
print(next(iter(ds_train)))
print("-----------")
print(next(iter(ds_train))[0])
from above you will find the value were same, the different were their type. type of next(iter(ds_train)) is <class 'tuple'> type of next(iter(ds_train))[0] is <class 'dict'> ----------- ({'age': <tf.Tensor: id=90, shape=(5,), dtype=int32, numpy=array([42, 57, 57, 65, 70])>, 'sex': <tf.Tensor: id=98, shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 1])>, 'cp': <tf.Tensor: id=93, shape=(5,), dtype=int32, numpy=array([3, 0, 2, 4, 2])>, 'trestbps': <tf.Tensor: id=103, shape=(5,), dtype=int32, numpy=array([120, 130, 154, 135, 156])>, 'chol': <tf.Tensor: id=92, shape=(5,), dtype=int32, numpy=array([240, 131, 232, 254, 245])>, 'fbs': <tf.Tensor: id=95, shape=(5,), dtype=int32, numpy=array([1, 0, 0, 0, 0])>, 'restecg': <tf.Tensor: id=97, shape=(5,), dtype=int32, numpy=array([0, 1, 2, 2, 2])>, 'thalach': <tf.Tensor: id=102, shape=(5,), dtype=int32, numpy=array([194, 115, 164, 127, 143])>, 'exang': <tf.Tensor: id=94, shape=(5,), dtype=int32, numpy=array([0, 1, 0, 0, 0])>, 'oldpeak': <tf.Tensor: id=96, shape=(5,), dtype=float64, numpy=array([0.8, 1.2, 0. , 2.8, 0. ])>, 'slope': <tf.Tensor: id=99, shape=(5,), dtype=int32, numpy=array([3, 1, 1, 2, 1])>, 'ca': <tf.Tensor: id=91, shape=(5,), dtype=int32, numpy=array([0, 1, 1, 1, 0])>, 'thal': <tf.Tensor: id=101, shape=(5,), dtype=string, numpy= array([b'reversible', b'normal', b'normal', b'reversible', b'normal'], dtype=object)>, 'target': <tf.Tensor: id=100, shape=(5,), dtype=int32, numpy=array([0, 0, 0, 1, 0])>}, <tf.Tensor: id=104, shape=(5,), dtype=int32, numpy=array([0, 0, 0, 1, 0])>) ----------- {'age': <tf.Tensor: id=124, shape=(5,), dtype=int32, numpy=array([42, 57, 57, 65, 70])>, 'sex': <tf.Tensor: id=132, shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 1])>, 'cp': <tf.Tensor: id=127, shape=(5,), dtype=int32, numpy=array([3, 0, 2, 4, 2])>, 'trestbps': <tf.Tensor: id=137, shape=(5,), dtype=int32, numpy=array([120, 130, 154, 135, 156])>, 'chol': <tf.Tensor: id=126, shape=(5,), dtype=int32, numpy=array([240, 131, 232, 254, 245])>, 'fbs': <tf.Tensor: id=129, shape=(5,), dtype=int32, numpy=array([1, 0, 0, 0, 0])>, 'restecg': <tf.Tensor: id=131, shape=(5,), dtype=int32, numpy=array([0, 1, 2, 2, 2])>, 'thalach': <tf.Tensor: id=136, shape=(5,), dtype=int32, numpy=array([194, 115, 164, 127, 143])>, 'exang': <tf.Tensor: id=128, shape=(5,), dtype=int32, numpy=array([0, 1, 0, 0, 0])>, 'oldpeak': <tf.Tensor: id=130, shape=(5,), dtype=float64, numpy=array([0.8, 1.2, 0. , 2.8, 0. ])>, 'slope': <tf.Tensor: id=133, shape=(5,), dtype=int32, numpy=array([3, 1, 1, 2, 1])>, 'ca': <tf.Tensor: id=125, shape=(5,), dtype=int32, numpy=array([0, 1, 1, 1, 0])>, 'thal': <tf.Tensor: id=135, shape=(5,), dtype=string, numpy= array([b'reversible', b'normal', b'normal', b'reversible', b'normal'], dtype=object)>, 'target': <tf.Tensor: id=134, shape=(5,), dtype=int32, numpy=array([0, 0, 0, 1, 0])>}

in TF, you will need to handle your data like this

  1. a feature layer to build model (params with dict type)
  2. a feature column to describe your data column

let start with basic one, use int directly.

numeric_column
In [10]:
# we first get a dict type of data
dict_train = next(iter(ds_train))[0]

# create definition
age = feature_column.numeric_column("age")
# create layer
feature_layer = layers.DenseFeatures(age)
# add data into layer
mydata = feature_layer(dict_train)
# result
print(mydata[:5])
WARNING: Logging before flag parsing goes to stderr. W0505 21:03:09.582941 14316 deprecation.py:323] From F:\Python36\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2758: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead.
tf.Tensor( [[42.] [57.] [57.] [65.] [70.]], shape=(5, 1), dtype=float32)

for those numeric column,

some times you may need to category them and create one-hot vector to represent them

Bucketized columns
In [11]:
# this is how you can do with pre-originized your data
age_boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
age_buckets = feature_column.bucketized_column(age, boundaries = age_boundaries)

# below were the same
feature_layer = layers.DenseFeatures(age_buckets)
mydata = feature_layer(dict_train)
print(mydata[:5])
W0505 21:03:09.598928 14316 deprecation.py:323] From F:\Python36\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:2902: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.cast` instead.
tf.Tensor( [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]], shape=(5, 11), dtype=float32)
Categorical columns
In [12]:
# for those need to do one-hot vector (ex. Thal)

# create the column definition first
thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
# make it a one-hot feature column
thal_one_hot = feature_column.indicator_column(thal)

# below were the same
feature_layer = layers.DenseFeatures(thal_one_hot)
mydata = feature_layer(dict_train)
print(mydata[:5])
W0505 21:03:09.625905 14316 deprecation.py:323] From F:\Python36\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:4307: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version. Instructions for updating: The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead. W0505 21:03:09.628903 14316 deprecation.py:323] From F:\Python36\lib\site-packages\tensorflow\python\feature_column\feature_column_v2.py:4362: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version. Instructions for updating: The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
tf.Tensor( [[0. 0. 1.] [0. 1. 0.] [0. 1. 0.] [0. 0. 1.] [0. 1. 0.]], shape=(5, 3), dtype=float32)

categorical_column_with_vocabulary_list is not working when you get plenty of words. You can't list it.

Instead, use

categorical_column_with_hash_bucket(
    key,
    hash_bucket_size,
    dtype=tf.string
)

it will get the words ferenqnecy and try to put them into buckets. You will need to consider the bucket_size.

Embedding columns

embedding column is use the hanlding Analog signal / data. it will normalize your data into Gaussian distribution and create vector

In [13]:
thal_embedding = feature_column.embedding_column(thal, dimension=3)

# below were the same
feature_layer = layers.DenseFeatures(thal_embedding)
mydata = feature_layer(dict_train)
print(mydata[:5])

# if you take closer look to, the distrubution should be the same on with 
# categorical_column_with_hash_bucket since we now have dim = 3
tf.Tensor( [[ 0.32983354 -0.6892836 -0.00527199] [-0.18253493 0.54448396 0.9803278 ] [-0.18253493 0.54448396 0.9803278 ] [ 0.32983354 -0.6892836 -0.00527199] [-0.18253493 0.54448396 0.9803278 ]], shape=(5, 3), dtype=float32)
Crossed feature columns

oen of the most important is Crosee feature to find the relation between defferent column.

ERROR

below got error and some of report said it was related to x86 x64 environment

that cause "Python int too large to convert to C long."

After tring on Colab and check data structure & sys.maxsize, everything were fine.

I aspect this is a bug for TF2.0 and skip this one now.

In [14]:
print(age_buckets)
print(thal)
print("=======")

crossed_feature = feature_column.crossed_column(keys = [age_buckets, thal], hash_bucket_size=1000)
print(type(crossed_feature))

# make it a one-hot feature column
crossed_feature_one_hot = feature_column.indicator_column(crossed_feature)
print(type(crossed_feature_one_hot))

# below were the same
#feature_layer = layers.DenseFeatures(crossed_feature_one_hot)
#print(type(dict_train))

#mydata = feature_layer(dict_train)
#print(mydata[:5])
BucketizedColumn(source_column=NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(18, 25, 30, 35, 40, 45, 50, 55, 60, 65)) VocabularyListCategoricalColumn(key='thal', vocabulary_list=('fixed', 'normal', 'reversible'), dtype=tf.string, default_value=-1, num_oov_buckets=0) ======= <class 'tensorflow.python.feature_column.feature_column_v2.CrossedColumn'> <class 'tensorflow.python.feature_column.feature_column_v2.IndicatorColumn'>

Why sometimes we need to transfer feature column into another one before sending into layers?

__init__(
    feature_columns,
    trainable=True,
    name=None,
    **kwargs
)

feature_columns: An iterable containing the FeatureColumns to use as inputs to your model. All items should be instances of classes derived from DenseColumn such as - numeric_column, - embedding_column, - bucketized_column, - indicator_column.

If you have categorical features,
you can wrap them with an embedding_column or indicator_column.

ok, till now, we found the structure of TF were the same and we shouldn't create function for each of them.

Let's redo this job in another way.

Column Description Feature Type Data Type
Age Age in years Numerical integer
Sex (1 = male; 0 = female) Categorical integer
CP Chest pain type (0, 1, 2, 3, 4) Categorical integer
Trestbpd Resting blood pressure (in mm Hg on admission to the hospital) Numerical integer
Chol Serum cholestoral in mg/dl Numerical integer
FBS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) Categorical integer
RestECG Resting electrocardiographic results (0, 1, 2) Categorical integer
Thalach Maximum heart rate achieved Numerical integer
Exang Exercise induced angina (1 = yes; 0 = no) Categorical integer
Oldpeak ST depression induced by exercise relative to rest Numerical integer
Slope The slope of the peak exercise ST segment Numerical float
CA Number of major vessels (0-3) colored by flourosopy Numerical integer
Thal 3 = normal; 6 = fixed defect; 7 = reversable defect Categorical string
Target Diagnosis of heart disease (1 = true; 0 = false) Classification integer
In [15]:
feature_columns = []

# 1. handle those use raw int data 
numerical_int_headers = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']
for column_name in numerical_int_headers:
    feature_columns.append(feature_column.numeric_column(column_name))

#2. any other bool values
bool_headers = ['sex','fbs','exang']
for column_name in bool_headers:
    feature_columns.append(feature_column.numeric_column(column_name))

# 3. indicator cols
cp = feature_column.categorical_column_with_vocabulary_list(
      'cp', [0,1,2,3,4])
cp_one_hot = feature_column.indicator_column(cp)
feature_columns.append(cp_one_hot)

restecg = feature_column.categorical_column_with_vocabulary_list(
      'restecg', [0,1,2])
restecg_one_hot = feature_column.indicator_column(restecg)
feature_columns.append(restecg_one_hot)

thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# 4. embedding with another dimention
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

#5. bucketized age
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)
In [27]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
In [28]:
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

ds_train = Transfer2TfData(train, True)
ds_val   = Transfer2TfData(val, True)
ds_test  = Transfer2TfData(test, True)

model.fit(ds_train,
          validation_data=ds_val,
          epochs=10)
W0505 21:04:34.736920 14316 training_utils.py:1353] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
Epoch 1/10 7/7 [==============================] - 1s 117ms/step - loss: 3.6258 - accuracy: 0.5932 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 2/10 7/7 [==============================] - 0s 64ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 3/10 7/7 [==============================] - 0s 64ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 4/10 7/7 [==============================] - 0s 64ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 5/10 7/7 [==============================] - 0s 65ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 6/10 7/7 [==============================] - 0s 66ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 7/10 7/7 [==============================] - 0s 64ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 8/10 7/7 [==============================] - 0s 66ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 9/10 7/7 [==============================] - 0s 64ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347 Epoch 10/10 7/7 [==============================] - 0s 65ms/step - loss: 3.7229 - accuracy: 0.7562 - val_loss: 3.9838 - val_accuracy: 0.7347
Out[28]:
<tensorflow.python.keras.callbacks.History at 0x21af7d56ba8>
In [29]:
loss, accuracy = model.evaluate(ds_test)
print("Accuracy", accuracy)
2/2 [==============================] - 0s 38ms/step - loss: 4.2219 - accuracy: 0.7213 Accuracy 0.72131145
In [ ]:
import jovian
jovian.commit()
[jovian] Saving notebook..