Working with tf.data (TF 2)

Introduction to tf.data (TF2)
H2O Meetup
Galvanize San Francisco
02/19/2020
Oswald Campesato
ocampesato@yahoo.com

Highlights/Overview
What is tf.data?
Working with TF 2 tf.data.Dataset
Intermediate operators
Terminal operators
filter() and map()
zip() and batch()
Working with TF 2 generators

tf.data: TF Input Pipeline
 An input pipeline is useful:
 for streaming data
 when data is too big to fit in memory
 When data requires preprocessing
 When you need to shuffle large data
 Can be scaled to multiple hosts
 => ETL functionality

What are tf.data.Datasets
 Simple example:
 1) define a Numpy array of numbers
 2) create a TF Dataset ds
 3) iterate through the dataset ds

What are TF Datasets
 import tensorflow as tf # tf-dataset1.py
 import numpy as np
 x = np.array([1,2,3,4,5])
 ds = tf.data.Dataset.from_tensor_slices(x)
 # iterate through the elements:
for value in ds.take(len(x)):
print(value)

What are Lambda Expressions
 a lambda expression is an anonymous function
 use lambda expressions to define local functions
 pass lambda expressions as arguments
 return them as the value of function calls

Some tf.data “lazy operators”
map()
filter()
flatmap()
batch()
take()
zip()
flatten()
=> Combined via “method chaining”

tf.data “lazy operators”
 filter():
 uses Boolean logic to "filter" the elements in an array to
determine which elements satisfy the Boolean condition
 map(): a projection
 this operator "applies" a lambda expression to each input
element
 flat_map():
 maps a single element of the input dataset to a Dataset of
elements

tf.data “lazy operators”
 batch(n):
 processes a "batch" of n elements during each
iteration
 repeat(n):
 repeats its input values n times
 take(n):
 operator "takes" n input values

tf.data.Dataset.from_tensors()
 Import tensorflow as tf
 #combine the input into one element
 t1 = tf.constant([[1, 2], [3, 4]])
 ds1 = tf.data.Dataset.from_tensors(t1)
 # output: [[1, 2], [3, 4]]

tf.data.Dataset.from_tensor_slices()
 Import tensorflow as tf
 #separate element for each item
 t2 = tf.constant([[1, 2], [3, 4]])
 ds1 =
tf.data.Dataset.from_tensor_slices(t2)
 # output: [1, 2], [3, 4]

TF2 Datasets: code sample
 import tensorflow as tf
x = np.arange(0, 10)
 # create a dataset from a Numpy array
ds = tf.data.Dataset.from_tensor_slices(x)

TF filter() operator: ex #1
import tensorflow as tf # tf2_filter1.py
import numpy as np
x = np.array([1,2,3,4,5])
 print("First iteration:")
 for value in ds:
 print("value:",value)

 First iteration:
 value: tf.Tensor(1, shape=(), dtype=int64)

import tensorflow as tf # tf2_filter2.py
import numpy as np
x = np.array([1,2,3,4,5])

 # "tf.math.equal(x, y)" is required
 # for equality comparison
 def filter_fn(x):
 return tf.math.equal(x, 1)
 ds = ds.filter(filter_fn)
 print("Second iteration:")

 Second iteration:

What are Lambda Expressions
 a lambda expression takes an input variable
 performs an operation on that variable
 A "bare bones" lambda expression:
lambda x: x + 1
 => this adds 1 to an input variable x

 import tensorflow as tf # tf2_filter3.py
 ds = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])
 ds = ds.filter(lambda x: x < 4) # [1,2,3]

 # "tf.math.equal(x, y)" is required
 # for equality comparison
 return tf.math.equal(x, 1)
 print("Second iteration:")

 Second iteration:

 return tf.equal(x % 2, 0)
 x = np.array([1,2,3,4,5,6,7,8,9,10])

 return tf.reshape(tf.not_equal(x % 2, 1), [])
 x = np.array([1,2,3,4,5,6,7,8,9,10])

TF map() operator: ex #1
import tensorflow as tf # tf2-map.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 ds = ds.map(lambda x: x*2)

 value: tf.Tensor([2], shape=(1,), dtype=int64)

TF map() and filter() operators
import tensorflow as tf # tf2_map_filter.py
import numpy as np
 x = np.array([1,2,3,4,5,6,7,8,9,10])
 ds1 = ds.filter(lambda x: tf.equal(x % 4, 0))
 ds1 = ds1.map(lambda x: x*x)
 ds2 = ds.map(lambda x: x*x)
 ds2 = ds2.filter(lambda x: tf.equal(x % 4, 0))

 for value1 in ds1:
 print("value1:",value1)
 for value2 in ds2:
 print("value2:",value2)

 value1: tf.Tensor(16, shape=(), dtype=int64)

import tensorflow as tf # tf2-map2.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 # METHOD #1: THE LONG WAY
 # a lambda expression to double each value
 #ds = ds.map(lambda x: x*2)
 # a lambda expression to add one to each value
 #ds = ds.map(lambda x: x+1)
 # a lambda expression to cube each value
 #ds = ds.map(lambda x: x**3)

 # METHOD #2: A SHORTER WAY
ds = ds.map(lambda x: x*2).map(lambda x: x+1).map(lambda x: x**3)
 print("value:",value
 # an example of “Method Chaining”

TF take() operator: ex #1
import tensorflow as tf # tf2-take.py
import numpy as np
ds = tf.data.Dataset.from_tensor_slices(tf.range(8))
ds = ds.take(5)
 for value in ds.take(20):

import tensorflow as tf # tf2_take.py
import numpy as np
 x = np.array([[1],[2],[3],[4]])
 # make a ds from a numpy array
 ds = ds.map(lambda x: x*2)
.map(lambda x: x+1).map(lambda x: x**3)

TF 2 map() and take(): output

TF zip() operator: ex #1
 import tensorflow as tf # tf2_zip1.py
 dx = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
 dy = tf.data.Dataset.from_tensor_slices([1,1,2,3,5])
 # zip the two datasets together
 d2 = tf.data.Dataset.zip((dx, dy))
 for value in d2:

 value:
(<tf.Tensor: id=11, shape=(), dtype=int32, numpy=0>,
<tf.Tensor: id=12, shape=(), dtype=int32, numpy=1>)
 value:
 value:
=> Plus two more rows of output

 import tensorflow as tf # tf2_zip_take.py
 x = np.arange(0, 10)
 y = np.arange(1, 11)
 dx = tf.data.Dataset.from_tensor_slices(x)
 dy = tf.data.Dataset.from_tensor_slices(y)
 # zip the two datasets together
 d2 = tf.data.Dataset.zip((dx, dy)).batch(3)
 for value in d2.take(8):

 value: (<tf.Tensor: id=11, shape=(), dtype=int32,
numpy=0>, <tf.Tensor: id=12, shape=(), dtype=int32,
numpy=1>)
numpy=1>)
numpy=2>)
numpy=3>)
numpy=5>)

 import tensorflow as tf # tf2_zip_batch.py
 ds1 = tf.data.Dataset.range(100)
 ds2 = tf.data.Dataset.range(0, -100, -1)
 ds3 = tf.data.Dataset.zip((ds1, ds2))
 ds4 = ds3.batch(4)
 for value in ds4.take(4):
 for value in d2.take(8):

 value: (<tf.Tensor: id=21, shape=(4,), dtype=int64,
numpy=array([0, 1, 2, 3])>, <tf.Tensor: id=22,
shape=(4,), dtype=int64, numpy=array([ 0, -1, -2, -
3])>)
shape=(4,), dtype=int64, numpy=array([-4, -5, -6, -
7])>)
numpy=array([ 8, 9, 10, 11])>, <tf.Tensor: id=30,
shape=(4,), dtype=int64, numpy=array([ -8, -9, -10,
-11])>)
shape=(4,), dtype=int64, numpy=array([-12, -13, -14,
-15])>)

TF 2 generators
Python functions
Containing your custom code
Specified in the dataset definition
Invoked when data is requested

Generator Functions (1)
 import tensorflow as tf # tf2_generator1.py
 def gener():
 i = 0
 while(i < len(x)):
 yield i
 i += 1

ds=tf.data.Dataset.from_generator(gener,(tf.int64))
 size = 2*len(x)
 for value in ds.take(size):
 # value: tf.Tensor(0, shape=(), dtype=int64)

import tensorflow as tf # tf2-timesthree.py
import numpy as np
x = np.arange(0, 5) # 0, 1, 2, 3, 4
def gener():
for i in x:
yield (3*i)
ds = tf.data.Dataset.from_generator(gener, (tf.int64))
for value in ds.take(len(x)):
print("1value:",value)
for value in ds.take(2*len(x)):
print("2value:",value)

 1value: tf.Tensor(0, shape=(), dtype=int64)

 import tensorflow as tf # tf2_generator3.py
 while(i < len(x/3)):
 yield (i, i+1, i+2)
 i += 3

 ds = tf.data.Dataset.from_generator(
gener,
(tf.int64,tf.int64,tf.int64))
 third = int(len(x)/3)
 for value in ds.take(third):

#value:
#(<tf.Tensor: id=35, shape=(),dtype=int64,numpy=0>,
# <tf.Tensor: id=36, shape=(),dtype=int64,numpy=1>,
# <tf.Tensor: id=37, shape=(),dtype=int64,numpy=2>)
#value:
#(<tf.Tensor: id=41, shape=(),dtype=int64,numpy=3>,
# <tf.Tensor: id=42, shape=(),dtype=int64,numpy=4>,

Processing Text Files (1)
 define a TF Dataset with lines in file.txt
 skip lines that start with a “#” character
 then display only the first two lines

Contents of file.txt
 #this is file line #1
 this is file line #3
 this is file line #5

 import tensorflow as tf # tf2_flatmap_filter.py
 filenames = ["file.txt”]
 ds = tf.data.Dataset.from_tensor_slices(filenames)

 ds = ds.flat_map(
 lambda filename: (
 tf.data.TextLineDataset(filename)
 .skip(1)
 .filter(lambda line:
tf.not_equal(tf.strings.substr(line,0,1),"#"))))

Text Output (first two lines)
('value:', <tf.Tensor: id=16, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#3'],
dtype=object)>)
('value:', <tf.Tensor: id=18, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#5'],
dtype=object)>)

Tokenizers and tf.text
 import tensorflow as tf # NB: requires TF 2
 import tensorflow_text as text
 # pip3 install -q tensorflow-text
 docs = tf.data.Dataset.from_tensor_slices(
[['Chicago Pizza'],
["how are you"]])
 tokenizer = text.WhitespaceTokenizer()
 token_docs = docs.map(
lambda x: tokenizer.tokenize(x))

Tokenizers and tf.text
 iterator = iter(tokenized_docs)
 print(next(iterator).to_list())
 print(next(iterator).to_list())
 # [[b'a', b'b', b'c']]
 # [[b'd', b'e', b'f']]
 [[b'Chicago', b'Pizza']]
 [[b'how', b'are', b'you']]

Tf.data and MNIST
 import tensorflow as tf # tf2_mnist.py
train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train
mnist_ds=tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
 for value in mnist_ds.take(2):

Tf.data and MNIST
 value: tf.Tensor(
 [[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
 0 0 0 0 0 0 0 0 0 0]
 [ 0 0 0 0 0 0 0 0 0 0 0 0 3 18
18 18 126 136
 175 26 166 255 247 127 0 0 0 0]
 [ 0 0 0 0 0 0 0 0 30 36 94 154 170 253
253 253 253 253
 225 172 253 242 195 64 0 0 0 0]
 [ 0 0 0 0 0 0 0 49 238 253 253 253 253 253
253 253 253 251
 93 82 82 56 39 0 0 0 0 0]

TF2 generator Example
import tensorflow as tf # tf2_generator2.py
import numpy as np
x = np.arange(0, 12)
def gener():
i = 0
while(i < len(x/3)):
yield (i, i+1, i+2) # three integers at a time
i += 3
ds = tf.data.Dataset.from_generator(gener, (tf.int64,tf.int64,tf.int64))
third = int(len(x)/3)
for value in ds.take(third):
print("value:",value)

TF2 generator Example
 value:
(<tf.Tensor: id=35, shape=(), dtype=int64,
numpy=0>,
<tf.Tensor: id=36, shape=(), dtype=int64,
numpy=1>,
numpy=2>)
 value:
(<tf.Tensor: id=38, shape=(), dtype=int64,
numpy=3>,
numpy=4>,

TF 2 tf.data.TFRecordDataset
 ds = tf.data.TFRecordDataset(tf-records)
 ds = ds.map(your-pre-processing)
 ds = ds.batch(batch_size=32)
 OR:
 ds = tf.data.TFRecordDataset(a-tf-record)
 .map(your-pre-processing)
 .batch(batch_size=32)
 model = . . . [Keras]
 model.fit(ds, epochs=20)

Use prefetch() for Performance
 .prefetch(buffer_size=X)

Parallelize Data Transformations
 .map(preprocess,num_parallel_calls=Y)
 => uses background threads & internal buffer

Parallelize Data “Readers”
 ds = tf.data.TFRecordDataset(tf-records,
 num_parallel_readers=Z)
 .map(preprocess,num_parallel_calls=Y)
 => for sharded data

Parallelize Data “Readers”
 Selecting optimal values:
 tf.data.experimental.AUTO_TUNE
 Uses Reinforcement Learning
 To tune values during data input

tf.data.Options
 Setting global options:
 Deterministic/non-deterministic
 Statistics
 Optimizations (ex: autotuning)
 Threading (ex: private thread pool)

tf.data.Options
op = tf.data.Options()
op.experimental_optimization.map_optimization=True
ds = ds.with_optimization(op)

TF 2: Built-in Datasets
tf.keras.datasets.boston_housing
tf.keras.datasets.cifar10
tf.keras.datasets.cifar100
tf.keras.datasets.fashion_mnist
tf.keras.datasets.imdb
tf.keras.datasets.mnist
tf.keras.datasets.reuters

About Me: Recent Books
 1) Python3 and Machine Learning (2020)
 2) Angular 9 and Deep Learning (2020)
 3) Angular 8 & Machine Learning (2020)
 4) AI/ML/DL: Concepts and Code (2020)
 5) Bash Programming on Mac (2020)
 6) TensorFlow 2 Pocket Primer (2019)
 7) TensorFlow 1.x Pocket Primer (2019)
 8) Python for TensorFlow (2019)
 9) C Programming Pocket Primer (2019)

About Me: Less Recent Books
 10) RegEx Pocket Primer (2018)
 11) Data Cleaning Pocket Primer (2018)
 12) Angular Pocket Primer (2017)
 13) Android Pocket Primer (2017)
 14) CSS3 Pocket Primer (2016)
 15) SVG Pocket Primer (2016)
 16) Python Pocket Primer (2015)
 17) D3 Pocket Primer (2015)
 18) HTML5 Mobile Pocket Primer (2014)

About Me: Older Books
 19) jQuery, CSS3, and HTML5 (2013)
 20) HTML5 Pocket Primer (2013)
 21) jQuery Pocket Primer (2013)
 22) HTML5 Canvas (2012)
 23) Flash on Android (2011)
 24) Web 2.0 Fundamentals (2010)
 25) MS Silverlight Graphics (2008)
 26) Fundamentals of SVG (2003)
 27) Java Graphics Library (2002)

ML/DL/NLP/DRL Classes
ML/DL/NLP/RL Instructor at UCSC (Santa Clara):
Deep Learning (TF2/Keras) (01/30/2020) 10 weeks
NLP (Transformer/BERT/etc) (02/21/2020) 10 weeks
ML (ML, NLP, RL) (05/01/2020) 10 weeks
DRL (PPO/A3C/SAC/etc) (05/07/2020) 10 weeks
ML (ML, NLP, RL) (06/16/2020) 10 weeks
NLP (Transformer/BERT/etc) (06/29/2020) 10 weeks
UCSC link:
https://www.ucsc-extension.edu/certificate-program/offering/deep-learning-
and-artificial-intelligence-tensorflow

Working with tf.data (TF 2)

Related slideshows

More Related Content

Working with tf.data (TF 2)