Working with tf.data (TF 2)
- 1. Introduction to tf.data (TF2)
H2O Meetup
Galvanize San Francisco
02/19/2020
Oswald Campesato
ocampesato@yahoo.com
- 3. tf.data: TF Input Pipeline
An input pipeline is useful:
for streaming data
when data is too big to fit in memory
When data requires preprocessing
When you need to shuffle large data
Can be scaled to multiple hosts
=> ETL functionality
- 4. What are tf.data.Datasets
Simple example:
1) define a Numpy array of numbers
2) create a TF Dataset ds
3) iterate through the dataset ds
- 5. What are TF Datasets
import tensorflow as tf # tf-dataset1.py
import numpy as np
x = np.array([1,2,3,4,5])
ds = tf.data.Dataset.from_tensor_slices(x)
# iterate through the elements:
for value in ds.take(len(x)):
print(value)
- 6. What are Lambda Expressions
a lambda expression is an anonymous function
use lambda expressions to define local functions
pass lambda expressions as arguments
return them as the value of function calls
- 7. Some tf.data “lazy operators”
map()
filter()
flatmap()
batch()
take()
zip()
flatten()
=> Combined via “method chaining”
- 8. tf.data “lazy operators”
filter():
uses Boolean logic to "filter" the elements in an array to
determine which elements satisfy the Boolean condition
map(): a projection
this operator "applies" a lambda expression to each input
element
flat_map():
maps a single element of the input dataset to a Dataset of
elements
- 9. tf.data “lazy operators”
batch(n):
processes a "batch" of n elements during each
iteration
repeat(n):
repeats its input values n times
take(n):
operator "takes" n input values
- 12. TF2 Datasets: code sample
import tensorflow as tf
import numpy as np
x = np.arange(0, 10)
# create a dataset from a Numpy array
ds = tf.data.Dataset.from_tensor_slices(x)
- 13. TF filter() operator: ex #1
import tensorflow as tf # tf2_filter1.py
import numpy as np
x = np.array([1,2,3,4,5])
ds = tf.data.Dataset.from_tensor_slices(x)
print("First iteration:")
for value in ds:
print("value:",value)
- 14. TF filter() operator: ex #1
First iteration:
value: tf.Tensor(1, shape=(), dtype=int64)
value: tf.Tensor(2, shape=(), dtype=int64)
value: tf.Tensor(3, shape=(), dtype=int64)
value: tf.Tensor(4, shape=(), dtype=int64)
value: tf.Tensor(5, shape=(), dtype=int64)
- 15. TF filter() operator: ex #2
import tensorflow as tf # tf2_filter2.py
import numpy as np
x = np.array([1,2,3,4,5])
ds = tf.data.Dataset.from_tensor_slices(x)
print("First iteration:")
for value in ds:
print("value:",value)
- 16. TF filter() operator: ex #2
# "tf.math.equal(x, y)" is required
# for equality comparison
def filter_fn(x):
return tf.math.equal(x, 1)
ds = ds.filter(filter_fn)
print("Second iteration:")
for value in ds:
print("value:",value)
- 17. TF filter() operator: ex #2
First iteration:
value: tf.Tensor(1, shape=(), dtype=int64)
value: tf.Tensor(2, shape=(), dtype=int64)
value: tf.Tensor(3, shape=(), dtype=int64)
Second iteration:
value: tf.Tensor(1, shape=(), dtype=int64)
- 18. What are Lambda Expressions
a lambda expression takes an input variable
performs an operation on that variable
A "bare bones" lambda expression:
lambda x: x + 1
=> this adds 1 to an input variable x
- 19. TF filter() operator: ex #3
import tensorflow as tf # tf2_filter3.py
import numpy as np
ds = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])
ds = ds.filter(lambda x: x < 4) # [1,2,3]
print("First iteration:")
for value in ds:
print("value:",value)
- 20. TF filter() operator: ex #3
# "tf.math.equal(x, y)" is required
# for equality comparison
def filter_fn(x):
return tf.math.equal(x, 1)
ds = ds.filter(filter_fn)
print("Second iteration:")
for value in ds:
print("value:",value)
- 21. TF filter() operator: ex #3
First iteration:
value: tf.Tensor(1, shape=(), dtype=int32)
value: tf.Tensor(2, shape=(), dtype=int32)
value: tf.Tensor(3, shape=(), dtype=int32)
Second iteration:
value: tf.Tensor(1, shape=(), dtype=int32)
- 22. TF filter() operator: ex #4
import tensorflow as tf # tf2_filter5.py
import numpy as np
def filter_fn(x):
return tf.equal(x % 2, 0)
x = np.array([1,2,3,4,5,6,7,8,9,10])
ds = tf.data.Dataset.from_tensor_slices(x)
ds = ds.filter(filter_fn)
for value in ds:
print("value:",value)
- 23. TF filter() operator: ex #4
value: tf.Tensor(2, shape=(), dtype=int64)
value: tf.Tensor(4, shape=(), dtype=int64)
value: tf.Tensor(6, shape=(), dtype=int64)
value: tf.Tensor(8, shape=(), dtype=int64)
value: tf.Tensor(10, shape=(), dtype=int64)
- 24. TF filter() operator: ex #5
import tensorflow as tf # tf2_filter5.py
import numpy as np
def filter_fn(x):
return tf.reshape(tf.not_equal(x % 2, 1), [])
x = np.array([1,2,3,4,5,6,7,8,9,10])
ds = tf.data.Dataset.from_tensor_slices(x)
ds = ds.filter(filter_fn)
for value in ds:
print("value:",value)
- 25. TF filter() operator: ex #5
value: tf.Tensor(2, shape=(), dtype=int64)
value: tf.Tensor(4, shape=(), dtype=int64)
value: tf.Tensor(6, shape=(), dtype=int64)
value: tf.Tensor(8, shape=(), dtype=int64)
value: tf.Tensor(10, shape=(), dtype=int64)
- 26. TF map() operator: ex #1
import tensorflow as tf # tf2-map.py
import numpy as np
x = np.array([[1],[2],[3],[4]])
ds = tf.data.Dataset.from_tensor_slices(x)
ds = ds.map(lambda x: x*2)
for value in ds:
print("value:",value)
- 27. TF map() operator: ex #1
value: tf.Tensor([2], shape=(1,), dtype=int64)
value: tf.Tensor([4], shape=(1,), dtype=int64)
value: tf.Tensor([6], shape=(1,), dtype=int64)
value: tf.Tensor([8], shape=(1,), dtype=int64)
- 28. TF map() and filter() operators
import tensorflow as tf # tf2_map_filter.py
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9,10])
ds = tf.data.Dataset.from_tensor_slices(x)
ds1 = ds.filter(lambda x: tf.equal(x % 4, 0))
ds1 = ds1.map(lambda x: x*x)
ds2 = ds.map(lambda x: x*x)
ds2 = ds2.filter(lambda x: tf.equal(x % 4, 0))
- 29. TF map() and filter() operators
for value1 in ds1:
print("value1:",value1)
for value2 in ds2:
print("value2:",value2)
- 30. TF map() and filter() operators
value1: tf.Tensor(16, shape=(), dtype=int64)
value1: tf.Tensor(64, shape=(), dtype=int64)
value2: tf.Tensor(4, shape=(), dtype=int64)
value2: tf.Tensor(16, shape=(), dtype=int64)
value2: tf.Tensor(36, shape=(), dtype=int64)
value2: tf.Tensor(64, shape=(), dtype=int64)
value2: tf.Tensor(100, shape=(), dtype=int64)
- 31. TF map() operator: ex #2
import tensorflow as tf # tf2-map2.py
import numpy as np
x = np.array([[1],[2],[3],[4]])
ds = tf.data.Dataset.from_tensor_slices(x)
# METHOD #1: THE LONG WAY
# a lambda expression to double each value
#ds = ds.map(lambda x: x*2)
# a lambda expression to add one to each value
#ds = ds.map(lambda x: x+1)
# a lambda expression to cube each value
#ds = ds.map(lambda x: x**3)
- 32. TF map() operator: ex #2
# METHOD #2: A SHORTER WAY
ds = ds.map(lambda x: x*2).map(lambda x: x+1).map(lambda x: x**3)
for value in ds:
print("value:",value
# an example of “Method Chaining”
- 33. TF map() operator: ex #2
value: tf.Tensor([27], shape=(1,), dtype=int64)
value: tf.Tensor([125], shape=(1,), dtype=int64)
value: tf.Tensor([343], shape=(1,), dtype=int64)
value: tf.Tensor([729], shape=(1,), dtype=int64)
- 34. TF take() operator: ex #1
import tensorflow as tf # tf2-take.py
import numpy as np
ds = tf.data.Dataset.from_tensor_slices(tf.range(8))
ds = ds.take(5)
for value in ds.take(20):
print("value:",value)
- 35. TF take() operator: ex #1
value: tf.Tensor(0, shape=(), dtype=int32)
value: tf.Tensor(1, shape=(), dtype=int32)
value: tf.Tensor(2, shape=(), dtype=int32)
value: tf.Tensor(3, shape=(), dtype=int32)
value: tf.Tensor(4, shape=(), dtype=int32)
- 36. TF take() operator: ex #2
import tensorflow as tf # tf2_take.py
import numpy as np
x = np.array([[1],[2],[3],[4]])
# make a ds from a numpy array
ds = tf.data.Dataset.from_tensor_slices(x)
ds = ds.map(lambda x: x*2)
.map(lambda x: x+1).map(lambda x: x**3)
for value in ds.take(4):
print("value:",value)
- 37. TF 2 map() and take(): output
value: tf.Tensor([27], shape=(1,), dtype=int64)
value: tf.Tensor([125], shape=(1,), dtype=int64)
value: tf.Tensor([343], shape=(1,), dtype=int64)
value: tf.Tensor([729], shape=(1,), dtype=int64)
- 38. TF zip() operator: ex #1
import tensorflow as tf # tf2_zip1.py
import numpy as np
dx = tf.data.Dataset.from_tensor_slices([0,1,2,3,4])
dy = tf.data.Dataset.from_tensor_slices([1,1,2,3,5])
# zip the two datasets together
d2 = tf.data.Dataset.zip((dx, dy))
for value in d2:
print("value:",value)
- 39. TF zip() operator: ex #1
value:
(<tf.Tensor: id=11, shape=(), dtype=int32, numpy=0>,
<tf.Tensor: id=12, shape=(), dtype=int32, numpy=1>)
value:
(<tf.Tensor: id=13, shape=(), dtype=int32, numpy=1>,
<tf.Tensor: id=14, shape=(), dtype=int32, numpy=1>)
value:
(<tf.Tensor: id=15, shape=(), dtype=int32, numpy=2>,
<tf.Tensor: id=16, shape=(), dtype=int32, numpy=2>)
=> Plus two more rows of output
- 40. TF zip() operator: ex #2
import tensorflow as tf # tf2_zip_take.py
import numpy as np
x = np.arange(0, 10)
y = np.arange(1, 11)
dx = tf.data.Dataset.from_tensor_slices(x)
dy = tf.data.Dataset.from_tensor_slices(y)
# zip the two datasets together
d2 = tf.data.Dataset.zip((dx, dy)).batch(3)
for value in d2.take(8):
print("value:",value)
- 41. TF zip() operator: ex #2
value: (<tf.Tensor: id=11, shape=(), dtype=int32,
numpy=0>, <tf.Tensor: id=12, shape=(), dtype=int32,
numpy=1>)
value: (<tf.Tensor: id=15, shape=(), dtype=int32,
numpy=1>, <tf.Tensor: id=16, shape=(), dtype=int32,
numpy=1>)
value: (<tf.Tensor: id=19, shape=(), dtype=int32,
numpy=2>, <tf.Tensor: id=20, shape=(), dtype=int32,
numpy=2>)
value: (<tf.Tensor: id=23, shape=(), dtype=int32,
numpy=3>, <tf.Tensor: id=24, shape=(), dtype=int32,
numpy=3>)
value: (<tf.Tensor: id=27, shape=(), dtype=int32,
numpy=4>, <tf.Tensor: id=28, shape=(), dtype=int32,
numpy=5>)
- 42. TF zip() operator: ex #3
import tensorflow as tf # tf2_zip_batch.py
import numpy as np
ds1 = tf.data.Dataset.range(100)
ds2 = tf.data.Dataset.range(0, -100, -1)
ds3 = tf.data.Dataset.zip((ds1, ds2))
ds4 = ds3.batch(4)
for value in ds4.take(4):
print("value:",value)
for value in d2.take(8):
print("value:",value)
- 43. TF zip() operator: ex #3
value: (<tf.Tensor: id=21, shape=(4,), dtype=int64,
numpy=array([0, 1, 2, 3])>, <tf.Tensor: id=22,
shape=(4,), dtype=int64, numpy=array([ 0, -1, -2, -
3])>)
value: (<tf.Tensor: id=25, shape=(4,), dtype=int64,
numpy=array([4, 5, 6, 7])>, <tf.Tensor: id=26,
shape=(4,), dtype=int64, numpy=array([-4, -5, -6, -
7])>)
value: (<tf.Tensor: id=29, shape=(4,), dtype=int64,
numpy=array([ 8, 9, 10, 11])>, <tf.Tensor: id=30,
shape=(4,), dtype=int64, numpy=array([ -8, -9, -10,
-11])>)
value: (<tf.Tensor: id=33, shape=(4,), dtype=int64,
numpy=array([12, 13, 14, 15])>, <tf.Tensor: id=34,
shape=(4,), dtype=int64, numpy=array([-12, -13, -14,
-15])>)
- 44. TF 2 generators
Python functions
Containing your custom code
Specified in the dataset definition
Invoked when data is requested
- 45. Generator Functions (1)
import tensorflow as tf # tf2_generator1.py
import numpy as np
x = np.arange(0, 7)
def gener():
i = 0
while(i < len(x)):
yield i
i += 1
- 46. Generator Functions (1)
ds=tf.data.Dataset.from_generator(gener,(tf.int64))
size = 2*len(x)
for value in ds.take(size):
print("value:",value)
# value: tf.Tensor(0, shape=(), dtype=int64)
# value: tf.Tensor(1, shape=(), dtype=int64)
# value: tf.Tensor(2, shape=(), dtype=int64)
# value: tf.Tensor(3, shape=(), dtype=int64)
# value: tf.Tensor(4, shape=(), dtype=int64)
# value: tf.Tensor(5, shape=(), dtype=int64)
# value: tf.Tensor(6, shape=(), dtype=int64)
- 47. Generator Functions (2)
import tensorflow as tf # tf2-timesthree.py
import numpy as np
x = np.arange(0, 5) # 0, 1, 2, 3, 4
def gener():
for i in x:
yield (3*i)
ds = tf.data.Dataset.from_generator(gener, (tf.int64))
for value in ds.take(len(x)):
print("1value:",value)
for value in ds.take(2*len(x)):
print("2value:",value)
- 48. Generator Functions (2)
1value: tf.Tensor(0, shape=(), dtype=int64)
1value: tf.Tensor(3, shape=(), dtype=int64)
1value: tf.Tensor(6, shape=(), dtype=int64)
1value: tf.Tensor(9, shape=(), dtype=int64)
1value: tf.Tensor(12, shape=(), dtype=int64)
2value: tf.Tensor(0, shape=(), dtype=int64)
2value: tf.Tensor(3, shape=(), dtype=int64)
2value: tf.Tensor(6, shape=(), dtype=int64)
2value: tf.Tensor(9, shape=(), dtype=int64)
2value: tf.Tensor(12, shape=(), dtype=int64)
- 49. Generator Functions (3)
import tensorflow as tf # tf2_generator3.py
import numpy as np
x = np.arange(0, 7)
while(i < len(x/3)):
yield (i, i+1, i+2)
i += 3
- 50. Generator Functions (3)
ds = tf.data.Dataset.from_generator(
gener,
(tf.int64,tf.int64,tf.int64))
third = int(len(x)/3)
for value in ds.take(third):
print("value:",value)
- 51. Generator Functions (3)
#value:
#(<tf.Tensor: id=35, shape=(),dtype=int64,numpy=0>,
# <tf.Tensor: id=36, shape=(),dtype=int64,numpy=1>,
# <tf.Tensor: id=37, shape=(),dtype=int64,numpy=2>)
#value:
#(<tf.Tensor: id=41, shape=(),dtype=int64,numpy=3>,
# <tf.Tensor: id=42, shape=(),dtype=int64,numpy=4>,
# <tf.Tensor: id=43, shape=(),dtype=int64,numpy=5>)
# <tf.Tensor: id=43, shape=(),dtype=int64,numpy=5>)
- 52. Processing Text Files (1)
define a TF Dataset with lines in file.txt
skip lines that start with a “#” character
then display only the first two lines
- 53. Contents of file.txt
#this is file line #1
#this is file line #2
this is file line #3
#this is file line #4
this is file line #5
#this is file line #6
- 54. Processing Text Files (2)
import tensorflow as tf # tf2_flatmap_filter.py
filenames = ["file.txt”]
ds = tf.data.Dataset.from_tensor_slices(filenames)
- 55. Processing Text Files (3)
ds = ds.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line:
tf.not_equal(tf.strings.substr(line,0,1),"#"))))
for value in ds.take(2):
print("value:",value)
- 56. Text Output (first two lines)
('value:', <tf.Tensor: id=16, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#3'],
dtype=object)>)
('value:', <tf.Tensor: id=18, shape=(5,),
dtype=string,
numpy=array(['this', 'is', 'file', 'line', ’#5'],
dtype=object)>)
- 57. Tokenizers and tf.text
import tensorflow as tf # NB: requires TF 2
import tensorflow_text as text
# pip3 install -q tensorflow-text
docs = tf.data.Dataset.from_tensor_slices(
[['Chicago Pizza'],
["how are you"]])
tokenizer = text.WhitespaceTokenizer()
token_docs = docs.map(
lambda x: tokenizer.tokenize(x))
- 58. Tokenizers and tf.text
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())
# [[b'a', b'b', b'c']]
# [[b'd', b'e', b'f']]
[[b'Chicago', b'Pizza']]
[[b'how', b'are', b'you']]
- 59. Tf.data and MNIST
import tensorflow as tf # tf2_mnist.py
train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train
mnist_ds=tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
for value in mnist_ds.take(2):
print("value:",value)
- 60. Tf.data and MNIST
value: tf.Tensor(
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 3 18
18 18 126 136
175 26 166 255 247 127 0 0 0 0]
[ 0 0 0 0 0 0 0 0 30 36 94 154 170 253
253 253 253 253
225 172 253 242 195 64 0 0 0 0]
[ 0 0 0 0 0 0 0 49 238 253 253 253 253 253
253 253 253 251
93 82 82 56 39 0 0 0 0 0]
- 61. TF2 generator Example
import tensorflow as tf # tf2_generator2.py
import numpy as np
x = np.arange(0, 12)
def gener():
i = 0
while(i < len(x/3)):
yield (i, i+1, i+2) # three integers at a time
i += 3
ds = tf.data.Dataset.from_generator(gener, (tf.int64,tf.int64,tf.int64))
third = int(len(x)/3)
for value in ds.take(third):
print("value:",value)
- 62. TF2 generator Example
value:
(<tf.Tensor: id=35, shape=(), dtype=int64,
numpy=0>,
<tf.Tensor: id=36, shape=(), dtype=int64,
numpy=1>,
<tf.Tensor: id=37, shape=(), dtype=int64,
numpy=2>)
value:
(<tf.Tensor: id=38, shape=(), dtype=int64,
numpy=3>,
<tf.Tensor: id=39, shape=(), dtype=int64,
numpy=4>,
- 63. TF 2 tf.data.TFRecordDataset
import tensorflow as tf
ds = tf.data.TFRecordDataset(tf-records)
ds = ds.map(your-pre-processing)
ds = ds.batch(batch_size=32)
OR:
ds = tf.data.TFRecordDataset(a-tf-record)
.map(your-pre-processing)
.batch(batch_size=32)
model = . . . [Keras]
model.fit(ds, epochs=20)
- 64. Use prefetch() for Performance
import tensorflow as tf
ds = tf.data.TFRecordDataset(tf-records)
.map(your-pre-processing)
.batch(batch_size=32)
.prefetch(buffer_size=X)
model = . . . [Keras]
model.fit(ds, epochs=20)
- 65. Parallelize Data Transformations
import tensorflow as tf
ds = tf.data.TFRecordDataset(tf-records)
.map(preprocess,num_parallel_calls=Y)
.batch(batch_size=32)
.prefetch(buffer_size=X)
model = . . . [Keras]
model.fit(ds, epochs=20)
=> uses background threads & internal buffer
- 66. Parallelize Data “Readers”
import tensorflow as tf
ds = tf.data.TFRecordDataset(tf-records,
num_parallel_readers=Z)
.map(preprocess,num_parallel_calls=Y)
.batch(batch_size=32)
.prefetch(buffer_size=X)
model = . . . [Keras]
model.fit(ds, epochs=20)
=> for sharded data
- 67. Parallelize Data “Readers”
Selecting optimal values:
tf.data.experimental.AUTO_TUNE
Uses Reinforcement Learning
To tune values during data input
- 68. tf.data.Options
Setting global options:
Deterministic/non-deterministic
Statistics
Optimizations (ex: autotuning)
Threading (ex: private thread pool)
- 69. tf.data.Options
import tensorflow as tf
ds = tf.data.TFRecordDataset(tf-records)
.map(your-pre-processing)
.batch(batch_size=32)
.prefetch(buffer_size=X)
op = tf.data.Options()
op.experimental_optimization.map_optimization=True
ds = ds.with_optimization(op)
- 70. TF 2: Built-in Datasets
tf.keras.datasets.boston_housing
tf.keras.datasets.cifar10
tf.keras.datasets.cifar100
tf.keras.datasets.fashion_mnist
tf.keras.datasets.imdb
tf.keras.datasets.mnist
tf.keras.datasets.reuters
- 71. About Me: Recent Books
1) Python3 and Machine Learning (2020)
2) Angular 9 and Deep Learning (2020)
3) Angular 8 & Machine Learning (2020)
4) AI/ML/DL: Concepts and Code (2020)
5) Bash Programming on Mac (2020)
6) TensorFlow 2 Pocket Primer (2019)
7) TensorFlow 1.x Pocket Primer (2019)
8) Python for TensorFlow (2019)
9) C Programming Pocket Primer (2019)
- 72. About Me: Less Recent Books
10) RegEx Pocket Primer (2018)
11) Data Cleaning Pocket Primer (2018)
12) Angular Pocket Primer (2017)
13) Android Pocket Primer (2017)
14) CSS3 Pocket Primer (2016)
15) SVG Pocket Primer (2016)
16) Python Pocket Primer (2015)
17) D3 Pocket Primer (2015)
18) HTML5 Mobile Pocket Primer (2014)
- 73. About Me: Older Books
19) jQuery, CSS3, and HTML5 (2013)
20) HTML5 Pocket Primer (2013)
21) jQuery Pocket Primer (2013)
22) HTML5 Canvas (2012)
23) Flash on Android (2011)
24) Web 2.0 Fundamentals (2010)
25) MS Silverlight Graphics (2008)
26) Fundamentals of SVG (2003)
27) Java Graphics Library (2002)
- 74. ML/DL/NLP/DRL Classes
ML/DL/NLP/RL Instructor at UCSC (Santa Clara):
Deep Learning (TF2/Keras) (01/30/2020) 10 weeks
NLP (Transformer/BERT/etc) (02/21/2020) 10 weeks
ML (ML, NLP, RL) (05/01/2020) 10 weeks
DRL (PPO/A3C/SAC/etc) (05/07/2020) 10 weeks
ML (ML, NLP, RL) (06/16/2020) 10 weeks
NLP (Transformer/BERT/etc) (06/29/2020) 10 weeks
UCSC link:
https://www.ucsc-extension.edu/certificate-program/offering/deep-learning-
and-artificial-intelligence-tensorflow