Skip to content

Commit 075be72

Browse files
Add ATDSDataset user guide (#1798)
resolve comment
1 parent 1be45af commit 075be72

File tree

1 file changed

+354
-0
lines changed

1 file changed

+354
-0
lines changed

AVRO_TENSOR_DATASET.md

+354
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,354 @@
1+
# AvroTensorDataset
2+
3+
AvroTensorDataset is a `tf.data.Dataset` implementation. It consumes records from one or more Avro files. The supported schema are discussed with more details in later section.
4+
5+
AvroTensorDataset loads Avro records from files into a dict of tensors.
6+
The output dict has feature name as key and tf.Tensor or tf.SparseTensor
7+
as value. The output tensor values are batched with the user defined
8+
batch size.
9+
10+
## Python API
11+
12+
A minimal example is given below:
13+
14+
>>> import tempfile
15+
>>> import avro.schema
16+
>>> from avro.datafile import DataFileWriter
17+
>>> from avro.io import DatumWriter
18+
>>> from tensorflow_io.python.experimental.atds.dataset import ATDSDataset
19+
>>> from tensorflow_io.python.experimental.atds.features import DenseFeature
20+
>>> example_path = os.path.join(tempfile.gettempdir(), "example.avro")
21+
>>> np.random.seed(0)
22+
23+
>>> # Define Avro schema
24+
>>> json_schema = '''{
25+
... "type": "record",
26+
... "name": "example",
27+
... "fields": [
28+
... { "name": "x", "type": "float" },
29+
... { "name": "y", "type": "float" }
30+
... ]
31+
... }'''
32+
>>> schema = avro.schema.Parse(json_schema)
33+
34+
>>> # Write the Avro records to a file.
35+
>>> with open(example_path, "wb") as f:
36+
... writer = DataFileWriter(f, DatumWriter(), schema)
37+
... for _ in range(3):
38+
... x, y = np.random.random(), np.random.random()
39+
... writer.append({"x": x, "y": y})
40+
... writer.close()
41+
42+
>>> # Read the data back out.
43+
>>> feature_config = {
44+
... "x": DenseFeature([], dtype=tf.float32),
45+
... "y": DenseFeature([], dtype=tf.float32)
46+
... }
47+
>>> for batch in ATDSDataset([example_path], batch_size=2,
48+
... features=feature_config):
49+
... print("x = {x}, y = {y}".format(**batch))
50+
x = [0.5488135 0.60276335], y = [0.71518934 0.5448832 ]
51+
x = [0.4236548], y = [0.6458941]
52+
53+
The constructor supports the following arguments:
54+
| Argument | type | comment |
55+
|---|---|---|
56+
| filenames | tf.string or tf.data.Dataset | A tf.string tensor containing one or more filenames. |
57+
| batch_size | tf.int64 | A tf.int64 scalar representing the number of records to read and parse per iteration. |
58+
| features | Dict[str, Union[<br> DenseFeature, <br> SparseFeature, <br> VarlenFeature]] | A feature configuration dict with feature name as key and feature spec as value. We support DenseFeature, SparseFeature, and VarlenFeature specs. All of them are named tuples with shape and dtype information. |
59+
| drop_remainder | tf.bool | (Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements. The default behavior is not to drop the smaller batch. |
60+
| reader_buffer_size | tf.int64 | (Optional) A tf.int64 scalar representing the number of bytes used in the file content buffering. Default is 128 * 1024 (128KB). |
61+
| shuffle_buffer_size | tf.int64 | (Optional) A tf.int64 scalar representing the number of records to shuffle together before batching. Default is zero. Zero shuffle <br>buffer size means shuffle is disabled. |
62+
| num_parallel_calls | tf.int64 | (Optional) A tf.int64 scalar representing the maximum thread number used in the dataset. If greater than one, records in files are processed in parallel. The number will be truncated when it is greater than the maximum available parallelism number on the host. If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU and workload. Default is 1. |
63+
64+
At a minimum, the constructor requires the list of files to read, the batch size (to support batching), and dict containing feature specs. Prefetch is enabled by default and whose behavior can be tuned via reader_buffer_size. Parsing happens automatically within the ATDSDataset operation. Shuffling is supported via configuring shuffle_buffer_size.
65+
66+
## Supported Avro Schemas
67+
68+
Although Avro supports many complex types (unions, maps, etc.), AvroTensorDataset only supports records of primitives and nested arrays. These supported types cover most TensorFlow use cases, and we get a big performance boost by only supporting a subset of complex types (more on that later).
69+
70+
AvroTensorDataset supports dense features, sparse features, and varlen features. It also supports certain TensorFlow primitives that are supported by Avro. They are represented in Avro via the following:
71+
72+
### Primitive Types
73+
74+
All Avro primitive types are supported, and map to the following TensorFlow dtypes:
75+
76+
+-------------------+----------------+
77+
| Avro data type | TF data type |
78+
|-------------------|----------------|
79+
| int | tf.int32 |
80+
|-------------------|----------------|
81+
| long | tf.int64 |
82+
|-------------------|----------------|
83+
| float | tf.float32 |
84+
|-------------------|----------------|
85+
| double | tf.float64 |
86+
|-------------------|----------------|
87+
| string | tf.string |
88+
|-------------------|----------------|
89+
| bool | tf.bool |
90+
|-------------------|----------------|
91+
| bytes | tf.string |
92+
|-------------------|----------------|
93+
94+
### Dense Features
95+
96+
Dense features are represented as nested arrays in Avro. For example, a doubly nested array represents a dense feature with rank 2. Some examples of Avro schemas representing dense features:
97+
```json
98+
"fields": [
99+
{
100+
"name" : "scalar_double_feature",
101+
"type" : "double"
102+
},
103+
{
104+
"name" : "1d_double_feature",
105+
"type" : {
106+
"type": "array",
107+
"items" : "double"
108+
}
109+
},
110+
{
111+
"name" : "2d_float_feature",
112+
"type" : {
113+
"type": "array",
114+
"items" : {
115+
"type": "array",
116+
"items": "float"
117+
}
118+
}
119+
},
120+
{
121+
"name" : "3d_int_feature",
122+
"type" : {
123+
"type": "array",
124+
"items" : {
125+
"type": "array",
126+
"items": {
127+
"type": "array",
128+
"items": "int"
129+
}
130+
}
131+
}
132+
}
133+
]
134+
```
135+
Dense features are parsed into dense tensors. For the above, the features argument to ATDSDataset might be:
136+
```python
137+
{
138+
"scalar_double_feature": DenseFeature(shape=[], dtype=tf.float64),
139+
"1d_double_feature": DenseFeature(shape=[128], dtype=tf.float64),
140+
"2d_float_feature": DenseFeature(shape=[16, 100], dtype=tf.float32),
141+
"3d_int_feature": DenseFeature(shape=[8, 10, 20], dtype=tf.int32),
142+
}
143+
```
144+
145+
### Sparse Features
146+
Sparse features are represented as a flat list of arrays in Avro. For a sparse feature with rank N, the Avro schema contains N+1 arrays: arrays named “indices0”, “indices1”, …, “indices(N-1)” and an array named “values”. All N+1 arrays should have the same length. For example, this is the schema for a sparse feature with dtype float, and rank 2:
147+
```json
148+
"fields": [
149+
{
150+
"name" : "2d_float_sparse_feature",
151+
"type" : {
152+
"type" : "record",
153+
"name" : "2d_float_sparse_feature",
154+
"fields" : [ {
155+
"name": "indices0",
156+
"type": {
157+
"type": "array",
158+
"items": "long"
159+
}
160+
}, {
161+
"name": "indices1",
162+
"type": {
163+
"type": "array",
164+
"items": "long"
165+
}
166+
}, {
167+
"name": "values",
168+
"type": {
169+
"type": "array",
170+
"items": "float"
171+
}
172+
}
173+
]
174+
}
175+
}
176+
]
177+
```
178+
179+
Sparse features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:
180+
```python
181+
{
182+
"2d_float_sparse_feature": SparseFeature(shape=[16, 10], dtype=tf.float32),
183+
}
184+
```
185+
The i-th indices array represents the indices for rank i, i.e. the Avro representation for a sparse tensor is in coordinate format. For example, the sparse tensor: tf.sparse.SparseTensor(indices=[[0,1], [2,4], [6,5]], values=[1.0, 2.0, 3.0], dense_shape=[8, 10]) would be represented in Avro via the following:
186+
```json
187+
{
188+
"indices0" : [0, 2, 6],
189+
"indices1" : [1, 4, 5],
190+
"values" : [1.0, 2.0, 3.0]
191+
}
192+
```
193+
194+
### Varlen Features
195+
Varlen features are similar to dense features in that they are also represented as nested arrays in Avro, but they can have dimensions of unknown length (indicated by -1). Some examples of Avro schemas representing varlen features:
196+
```json
197+
"fields": [
198+
{
199+
"name" : "1d_bool_varlen_feature",
200+
"type" : {
201+
"type": "array",
202+
"items" : "boolean"
203+
}
204+
},
205+
{
206+
"name" : "2d_long_varlen_feature",
207+
"type" : {
208+
"type": "array",
209+
"items" : {
210+
"type": "array",
211+
"items": "long"
212+
}
213+
}
214+
},
215+
{
216+
"name" : "3d_int_varlen_feature",
217+
"type" : {
218+
"type": "array",
219+
"items" : {
220+
"type": "array",
221+
"items": {
222+
"type": "array",
223+
"items": "int"
224+
}
225+
}
226+
}
227+
}
228+
]
229+
```
230+
Dimensions with length -1 can be variable length, hence varlen features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:
231+
```python
232+
{
233+
"1d_bool_varlen_feature": VarlenFeature(shape=[-1], dtype=tf.bool),
234+
"2d_long_varlen_feature": VarlenFeature(shape=[2, -1], dtype=tf.int64),
235+
"3d_int_varlen_feature": VarlenFeature(shape=[-1, 10, -1], dtype=tf.int32),
236+
}
237+
```
238+
Here, 2d_long_varlen_feature has variable length in rank 1; for example, an object with values [[1, 2, 3], [4, 5]] would be parsed as tf.sparse.SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]], values=[1, 2, 3, 4, 5], dense_shape=[2, 3]).
239+
240+
## Input files
241+
### Single or multiple input files
242+
AvroTensorDataset can process single or multiple files together. If there's no scheme e.g. `hdfs://` in the file path,
243+
it will search for local file. For example,
244+
245+
# Single local input file
246+
input_file = "path/to/file"
247+
dataset = ATDSDataset(input_file, batch_size=2, features=feature_config)
248+
249+
# Single input file on HDFS
250+
input_file = "hdfs://default/path/to/file"
251+
dataset = ATDSDataset(input_file, batch_size=2, features=feature_config)
252+
253+
# Multiple input files
254+
input_files = ["path/to/file1", "path/to/file2", ...]
255+
dataset = ATDSDataset(input_files, batch_size=2, features=feature_config)
256+
257+
### Inputs with file pattern
258+
To read input with specific file pattern, users can use `tf.data.Dataset.list_files` to get the file glob,
259+
and process them with AvroTensorDataset. For example,
260+
261+
filenames = tf.data.Dataset.list_files(file_pattern="path/to/files/*.avro")
262+
263+
# Batch all file names so that we can process them together.
264+
file_num = filenames.cardinality()
265+
dataset = filenames.batch(file_num)
266+
267+
dataset = dataset.interleave(
268+
lambda filename: ATDSDataset(filenames=filename, ...),
269+
cycle_length=1
270+
)
271+
272+
Moreover, users can batch files with fixed number and leverage `tf.data.Dataset.interleave` to process them in
273+
parallel. For example,
274+
275+
filenames = tf.data.Dataset.list_files(file_pattern="path/to/files/*.avro")
276+
277+
# Batch 10 files together so that we can process 10 files together.
278+
dataset = filenames.batch(10)
279+
280+
# Launch 4 interleave threads and each thread will process 10 files with AvroTensorDataset in parallel.
281+
dataset = dataset.interleave(
282+
lambda filename: ATDSDataset(filenames=filename, ...),
283+
cycle_length=4,
284+
num_parallel_calls=4
285+
)
286+
287+
## Batch
288+
289+
The output tensor values are always batched with the user defined batch size. If the last batch does not have
290+
enough data to batch, whatever remains will be batched with smaller batch size. User can drop the last small
291+
batch by setting drop_remainder to true.
292+
293+
# Drop the last small batch so that every batch has the same batch size.
294+
dataset = ATDSDataset(filenames, batch_size=128, features=feature_config, drop_remainder=True)
295+
296+
## Shuffle
297+
298+
Shuffle can be enabled before batching by configuring shuffle buffer size. The shuffle buffer size dictates the
299+
elements *in addition* to the batch size that would be read and sampled. For example,
300+
301+
# Shuffle records in a buffer with size 1024 before batching.
302+
dataset = ATDSDataset(filenames, batch_size=128, features=feature_config, shuffle_buffer_size=1024)
303+
304+
Shuffle is disabled by default with shuffle_buffer_size equals 0.
305+
306+
Internally, AvroTensorDataset keeps collecting Avro blocks(a sequence of Avro records), until the total number of unread
307+
records is greater than the shuffle buffer size + batch_size, then randomly samples block from the collected blocks.
308+
An Avro Record from the sampled block will be parsed and batched into the output tensors.
309+
310+
For instance, assume your dataset contains 5 blocks with 100 records in each block. When the batch size is set to
311+
32 and shuffle buffer size is set to 128, this dataset will collect two blocks as the two blocks contains more
312+
than 128 + 32 = 160 unread records, and randomly samples block from the two blocks 32 times. When a block is sampled,
313+
a record in the sampled block is read and batched into the output tensor dict until all records in the sampled block
314+
are read. If only one block fits into the batch + shuffle_buffer_size, records in that block will be read sequentially
315+
without shuffle. Users can increase the shuffle buffer size or apply dataset unbatch, shuffle, and batch for better
316+
shuffling.
317+
318+
## Parallel computing
319+
320+
Batching, shuffling, and record parsing can be done in parallel by configuring the num_parallel_calls in AvroTensorDataset.
321+
num_parallel_calls controls the number of threads for processing the input files. For example, if users want to
322+
do batching, shuffling, and parsing in parallel with four threads, they can configure AvroTensorDataset like this
323+
324+
dataset = ATDSDataset(
325+
filenames,
326+
batch_size=128,
327+
features=feature_config,
328+
shuffle_buffer_size=1024,
329+
num_parallel_calls=4 # Processing data in parallel with 4 threads within ops.
330+
)
331+
332+
It is different from what we have seen in `interleave`. What num_parallel_calls controls is the number of threads
333+
used in one AvroTensorDataset. Hence, if one uses 4 interleave node and each interleave node runs with
334+
2 internal threads in AvroTensorDataset, the total number of launched threads will be 4 * 2 = 8. For example,
335+
336+
filenames = tf.data.Dataset.list_files(file_pattern="path/to/files/*.avro")
337+
338+
# Batch 10 files together so that we can process 10 files together.
339+
dataset = filenames.batch(10)
340+
341+
# Launch 4 interleave threads and each thread will process 10 files with AvroTensorDataset in parallel.
342+
dataset = dataset.interleave(
343+
lambda filename: ATDSDataset(
344+
filenames=filename,
345+
batch_size=128,
346+
features=feature_config,
347+
shuffle_buffer_size=1024,
348+
num_parallel_calls=2), # 2 threads for each interleave node.
349+
cycle_length=4,
350+
num_parallel_calls=4
351+
)
352+
353+
By default, AvroTensorDataset will use all available CPU cores on the host as its num_parallel_calls number.
354+

0 commit comments

Comments
 (0)