Include labels when training an Avro dataset (#1571)

markmcd · web-flow · commit cd90c6067595 · 2021-11-24T19:38:11.000-08:00
diff --git a/docs/tutorials/avro.ipynb b/docs/tutorials/avro.ipynb
@@ -113,7 +113,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": null,
       "metadata": {
         "id": "m6KXZuTBWgRm"
       },
@@ -134,7 +134,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": null,
       "metadata": {
         "id": "dX74RKfZ_TdF"
       },
@@ -188,7 +188,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "IGnbXuVnSo8T"
+        "id": "jJzE6lMwhY7l"
       },
       "source": [
         "Download the corresponding schema file of the sample Avro file:"
@@ -198,7 +198,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "Tu01THzWcE-J"
+        "id": "Cpxa6yhLhY7l"
       },
       "outputs": [],
       "source": [
@@ -238,7 +238,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "upgCc3gXybsB"
+        "id": "m7XR0agdhY7n"
       },
       "source": [
         "To read and print an Avro file in a human-readable format:\n"
@@ -276,7 +276,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "z9GCyPWNuOm7"
+        "id": "qKgUPm6JhY7n"
       },
       "source": [
         "And the schema of `train.avro` which is represented by `train.avsc` is a JSON-formatted file.\n",
@@ -287,7 +287,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "nS3eTBvjt-O5"
+        "id": "D-95aom1hY7o"
       },
       "outputs": [],
       "source": [
@@ -302,7 +302,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "4CfKVmCvwcL7"
+        "id": "21szKFY1hY7o"
       },
       "source": [
         "### Prepare the dataset\n"
@@ -311,7 +311,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "z9GCyPWNuOm7"
+        "id": "hNeBO9m-hY7o"
       },
       "source": [
         "Load `train.avro` as TensorFlow dataset with Avro dataset API: \n"
@@ -321,7 +321,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "nS3eTBvjt-O5"
+        "id": "v-nbLZHKhY7o"
       },
       "outputs": [],
       "source": [
@@ -363,7 +363,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "nS3eTBvjt-O5"
+        "id": "bc9vDHyghY7p"
       },
       "outputs": [],
       "source": [
@@ -382,7 +382,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "IF_kYz_o2DH4"
+        "id": "x45KolnDhY7p"
       },
       "source": [
         "One can also increase num_parallel_reads to expediate Avro data processing by increasing avro parse/read parallelism.\n"
@@ -392,7 +392,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "nS3eTBvjt-O5"
+        "id": "Z2x-gPj_hY7p"
       },
       "outputs": [],
       "source": [
@@ -412,7 +412,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "IF_kYz_o2DH4"
+        "id": "6V-nwDJGhY7p"
       },
       "source": [
         "For detailed usage of `make_avro_record_dataset`, please refer to <a target=\"_blank\" href=\"https://www.tensorflow.org/io/api_docs/python/tfio/experimental/columnar/make_avro_record_dataset\">API doc</a>.\n"
@@ -421,7 +421,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "4CfKVmCvwcL7"
+        "id": "vIOijGlAhY7p"
       },
       "source": [
         "### Train tf.keras models with Avro dataset\n",
@@ -432,7 +432,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "z9GCyPWNuOm7"
+        "id": "s7K85D53hY7q"
       },
       "source": [
         "Load `train.avro` as TensorFlow dataset with Avro dataset API: \n"
@@ -442,14 +442,16 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "nS3eTBvjt-O5"
+        "id": "VFoeLwIOhY7q"
       },
       "outputs": [],
       "source": [
         "features = {\n",
-        "    'features[*]': tfio.experimental.columnar.VarLenFeatureWithRank(dtype=tf.int32)\n",
+        "    'features[*]': tfio.experimental.columnar.VarLenFeatureWithRank(dtype=tf.int32),\n",
+        "    'label': tf.io.FixedLenFeature(shape=[], dtype=tf.int32, default_value=-100),\n",
         "}\n",
         "\n",
+        "\n",
         "schema = tf.io.gfile.GFile('train.avsc').read()\n",
         "\n",
         "dataset = tfio.experimental.columnar.make_avro_record_dataset(file_pattern=['train.avro'],\n",
@@ -463,17 +465,17 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "z9GCyPWNuOm7"
+        "id": "hR2FnIIMhY7q"
       },
       "source": [
         "Define a simple keras model: \n"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": null,
       "metadata": {
-        "id": "m6KXZuTBWgRm"
+        "id": "hGV5rHfJhY7q"
       },
       "outputs": [],
       "source": [
@@ -488,27 +490,31 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "4CfKVmCvwcL7"
+        "id": "Tuv9n6HshY7q"
       },
       "source": [
         "### Train the keras model with Avro dataset:\n"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": null,
       "metadata": {
-        "id": "m6KXZuTBWgRm"
+        "id": "lb44cUuWhY7r"
       },
       "outputs": [],
       "source": [
-        "model.fit(x=dataset, epochs=1, steps_per_epoch=1, verbose=1)\n"
+        "def extract_label(feature):\n",
+        "  label = feature.pop('label')\n",
+        "  return tf.sparse.to_dense(feature['features[*]']), label\n",
+        "\n",
+        "model.fit(x=dataset.map(extract_label), epochs=1, steps_per_epoch=1, verbose=1)\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "IF_kYz_o2DH4"
+        "id": "7K6qAv5rhY7r"
       },
       "source": [
         "The avro dataset can parse and coerce any avro data into TensorFlow tensors, including records in records, maps, arrays, branches, and enumerations. The parsing information is passed into the avro dataset implementation as a map where \n",
@@ -541,7 +547,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "IF_kYz_o2DH4"
+        "id": "1PFQPuy5hY7r"
       },
       "source": [
         "A comprehensive set of examples of Avro dataset API is provided within <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/tests/test_parse_avro.py#L437\">the tests</a>.\n"