docs: Updated notebook for streaming

jruaux · jruaux · commit ffade8edc50c · 2025-03-31T13:09:29.000-07:00
diff --git a/redis-spark-notebook.ipynb b/redis-spark-notebook.ipynb
@@ -29,7 +29,7 @@
     "1. Choose **Maven** as your source and click **Search Packages**\n",
     "1. Enter `redis-spark-connector` and select `com.redis:redis-spark-connector:x.y.z`\n",
     "1. Finalize by clicking **Install** <br/>\n",
-    "Want to explore the connector's full capabilities? Check the [detailed documentation](https://redis-field-engineering.github.io/redis-spark)\n"
+    "Want to explore the connector's full capabilities? Check the [detailed documentation](https://redis-field-engineering.github.io/redis-spark)"
    ]
   },
   {
@@ -57,9 +57,7 @@
     "## Configuring Spark with Redis Connection Details\n",
     "\n",
     "1. From your Redis Cloud database dashboard, find your connection endpoint under **Connect**. The string follows this pattern: `redis://<user>:<pass>@<host>:<port>`\n",
-    "1. In Databricks, open your cluster settings and locate **Advanced Options**. Under **Spark** in the **Spark config** text area, add your Redis connection string as both `spark.redis.read.connection.uri redis://...` and `spark.redis.write.connection.uri redis://...` parameters. This configuration applies to all notebooks using this cluster. Note that it is recommended to use secrets to store sensitive Redis URIs. Refer to the [Redis Spark documentation](https://redis-field-engineering.github.io/redis-spark/#_databricks) for more details.\n",
-    "\n",
-    "\n"
+    "1. In Databricks, open your cluster settings and locate **Advanced Options**. Under **Spark** in the **Spark config** text area, add your Redis connection string as both `spark.redis.read.connection.uri redis://...` and `spark.redis.write.connection.uri redis://...` parameters. This configuration applies to all notebooks using this cluster. Note that it is recommended to use secrets to store sensitive Redis URIs. Refer to the [Redis Spark documentation](https://redis-field-engineering.github.io/redis-spark/#_databricks) for more details."
    ]
   },
   {
@@ -77,7 +75,7 @@
    "source": [
     "## Reading from Redis\n",
     "\n",
-    "To read data from Redis run the following code:\n"
+    "To read data from Redis use the following line."
    ]
   },
   {
@@ -102,6 +100,45 @@
     "display(df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "63658b45-2684-440f-9d2d-b0d84bb4af8f",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    }
+   },
+   "source": [
+    "## Writing to Redis\n",
+    "\n",
+    "Let's use the `df` data we imported earlier and write it back to Redis as JSON. Refresh **Redis Insight** and notice the new JSON keys prefixed with `spark:nobel`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 0,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
+     "inputWidgets": {},
+     "nuid": "37786ad3-b0ec-49a9-8839-75ff907e4cae",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "df.write.format(\"redis\").option(\"type\", \"json\").option(\"keyspace\", \"spark:nobel\").option(\"key\", \"id\").mode(\"append\").save()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -117,7 +154,7 @@
    "source": [
     "## Reading from Redis in Streaming Mode\n",
     "\n",
-    "The following code reads data from a Redis stream, appending data to a streaming in-memory dataframe."
+    "The following code reads data from the Redis stream `nobels`, appending data to a streaming in-memory dataframe."
    ]
   },
   {
@@ -138,31 +175,31 @@
    },
    "outputs": [],
    "source": [
+    "streamDf = spark.readStream.format(\"redis\").option(\"type\", \"stream\").option(\"streamKey\", \"nobels\").load()\n",
+    "query = streamDf.writeStream.format(\"memory\").queryName(\"nobels\").outputMode(\"append\").trigger(processingTime=\"1 second\").start()\n",
+    "\n",
     "import time\n",
-    "streamDf = spark.readStream.format(\"redis\").option(\"type\", \"stream\").option(\"streamKey\", \"stream:nobels\").load()\n",
-    "query = streamDf.writeStream.format(\"memory\").trigger(continuous=\"1 second\").queryName(\"nobels\").start()\n",
     "time.sleep(3)\n",
-    "streamDs = spark.sql(\"select * from nobels\")\n",
-    "display(streamDs)"
+    "display(spark.sql(\"SELECT * FROM nobels\"))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {},
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
      "inputWidgets": {},
-     "nuid": "63658b45-2684-440f-9d2d-b0d84bb4af8f",
+     "nuid": "db0205c2-3764-42c7-b90b-008afa288a89",
      "showTitle": false,
      "tableResultSettingsMap": {},
      "title": ""
     }
    },
    "source": [
-    "## Writing to Redis\n",
-    "\n",
-    "1. Let's use the `df` data we imported earlier and write it back to Redis as JSON.\n",
-    "1. Refresh **Redis Insight** and notice the new JSON keys prefixed with `spark:write`."
+    "With the previously created streaming dataframe still running we can add data to the stream and see the dataframe receiving that new data. In **Redis Insight** change the \"All Key Types\" filter to only show keys of **Stream** type. Double-click on the `nobels` stream and click `New Entry`. Add the following fields: `category`: `physics`, `id`: `123`, `share`: `1`, `year`: `2025`, `firstName`, `lastName`, and `motivation`. Hit `Save` and run the query again. You should now see your entry in at the bottom of the table:"
    ]
   },
   {
@@ -175,22 +212,25 @@
       "rowLimit": 10000
      },
      "inputWidgets": {},
-     "nuid": "37786ad3-b0ec-49a9-8839-75ff907e4cae",
+     "nuid": "551e6a88-a9f8-48ec-82b7-34f7f8326569",
      "showTitle": false,
      "tableResultSettingsMap": {},
      "title": ""
     }
    },
    "outputs": [],
    "source": [
-    "df.write.format(\"redis\").option(\"type\", \"json\").option(\"keyspace\", \"spark:write:nobel\").option(\"key\", \"id\").mode(\"append\").save()"
+    "display(spark.sql(\"SELECT * FROM nobels\"))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
     "application/vnd.databricks.v1+cell": {
-     "cellMetadata": {},
+     "cellMetadata": {
+      "byteLimit": 2048000,
+      "rowLimit": 10000
+     },
      "inputWidgets": {},
      "nuid": "f6ba51ee-705b-4e92-89e6-8530978ba987",
      "showTitle": false,
@@ -202,8 +242,7 @@
     "## Writing to Redis in Streaming Mode\n",
     "\n",
     "We can also write to Redis in streaming mode.\n",
-    "1. Replace `<catalog>`, `<schema>`, and `<volume>` with names for a Unity Catalog volume and run the code below.\n",
-    "2. In **Redis Insight** refresh the database and notice the new hash keys prefixed with `spark:writeStream`\n"
+    "1. Replace `<catalog>`, `<schema>`, and `<volume>` with names for a Unity Catalog volume."
    ]
   },
   {
@@ -222,44 +261,61 @@
      "title": ""
     }
    },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "<pyspark.sql.streaming.query.StreamingQuery at 0x7fd64dc7a4e0>"
-      ]
-     },
-     "execution_count": 78,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "catalog = \"<catalog>\"\n",
     "schema = \"<schema>\"\n",
-    "volume = \"<volume>\"\n",
+    "volume = \"<tutorial>\"\n",
     "\n",
     "path_volume = f\"/Volumes/{catalog}/{schema}/{volume}\"\n",
-    "dbutils.fs.cp(\"http://storage.googleapis.com/jrx/nobels.jsonl\", f\"{path_volume}/nobels.json\")\n",
-    "\n",
-    "checkpoint_dir = f\"{path_volume}/checkpoint\"\n",
+    "checkpoint_dir = f\"{path_volume}/mycp\"\n",
     "dbutils.fs.mkdirs(checkpoint_dir)\n",
     "\n",
-    "spark.readStream.schema(\"id INT, year INT, category STRING, share INT, firstName STRING, lastName STRING, motivation STRING\").json(f\"{path_volume}/*.json\") \\\n",
-    "     .writeStream.format(\"redis\").outputMode(\"append\") \\\n",
+    "streamDf.writeStream.format(\"redis\").outputMode(\"append\") \\\n",
     "                 .option(\"type\", \"hash\") \\\n",
-    "                 .option(\"keyspace\", \"spark:writeStream:nobel\") \\\n",
+    "                 .option(\"keyspace\", \"spark:nobel\") \\\n",
     "                 .option(\"key\", \"id\") \\\n",
     "                 .option(\"checkpointLocation\", checkpoint_dir) \\\n",
     "                 .start()\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "b494d5a2-3187-4f8a-8632-33fee60d6492",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    }
+   },
+   "source": [
+    "In **Redis Insight** select keys with the pattern `spark:nobel:*`. You should see hashes corresponding to the entries in the `nobels` that we used previously. If you add other entries to the stream like we did in the *Reading from Redis in Streaming Mode* section, you will them reflected in that `spark:nobel` keyspace."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "49fa26c8-cd66-49f8-a5d4-b0c29098ffc6",
+     "showTitle": false,
+     "tableResultSettingsMap": {},
+     "title": ""
+    }
+   },
+   "source": []
   }
  ],
  "metadata": {
   "application/vnd.databricks.v1+notebook": {
    "computePreferences": null,
    "dashboards": [],
    "environmentMetadata": null,
+   "inputWidgetPreferences": null,
    "language": "python",
    "notebookMetadata": {
     "pythonIndentUnit": 4
@@ -273,4 +329,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}