diff --git a/docs/lakebridge/docs/transpile/pluggable_transpilers/switch/index.mdx b/docs/lakebridge/docs/transpile/pluggable_transpilers/switch/index.mdx index d98af046a..849b88e77 100644 --- a/docs/lakebridge/docs/transpile/pluggable_transpilers/switch/index.mdx +++ b/docs/lakebridge/docs/transpile/pluggable_transpilers/switch/index.mdx @@ -18,6 +18,8 @@ This LLM-powered approach excels at converting complex SQL code and business log than syntactic transformation. While generated notebooks may require manual adjustments, they provide a valuable foundation for Databricks migration. +Switch can also convert ETL workloads into Spark Declarative Pipelines, supporting both Python and SQL. Refer to the sections below for usage instructions. + --- ## How Switch Works @@ -36,6 +38,7 @@ Switch runs entirely within the Databricks workspace. You can find details about - **Jobs API**: Executes as scalable Databricks Jobs for batch processing - **Model Serving**: Direct integration with Databricks LLM endpoints, with concurrent processing for multiple files - **Delta Tables**: Tracks conversion progress and results +- **Pipelines API**: Creates and Executes Spark Declarative Pipeline for pipeline conversion ### 3. Flexible Output Formats - **Notebooks**: Python notebooks containing Spark SQL (primary output) @@ -98,6 +101,14 @@ Convert non-SQL files to notebooks or other formats. | `scala` | Scala Code → Databricks Python Notebook | | `airflow` | Airflow DAG → Databricks Jobs YAML + Operator conversion guidance (SQL→sql_task, Python→notebook, etc.) | +### Built-in Prompts: ETL Sources + +Convert ETL workloads to Spark Declarative Pipeline (SDP) in Python or SQL. + +| Source Technology | Source → Target | +|--------------|-----------------| +| `pyspark` | PySpark ETL → Databricks Notebook in Python or SQL for SDP | + ### Custom Prompts: Any Source Format Switch's LLM-based architecture supports additional conversion types through custom YAML conversion prompts, making it extensible beyond built-in options. @@ -186,7 +197,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y | Parameter | Description | Default Value | Available Options | |-----------|-------------|---------------|-------------------| -| `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats. See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file` | +| `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats, `sdp` for conversion from etl workloads to Spark Declarative Pipeline (SDP). See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file`, `sdp` | | `source_format` | Source file format type. `sql` performs SQL comment removal and whitespace compression preprocessing before conversion. `generic` processes files as-is without preprocessing. Preprocessing affects token counting and conversion quality. See [analyze_input_files](#analyze_input_files) for preprocessing details. | `sql` | `sql`, `generic` | | `comment_lang` | Language for generated comments. | `English` | `English`, `Japanese`, `Chinese`, `French`, `German`, `Italian`, `Korean`, `Portuguese`, `Spanish` | | `log_level` | Logging verbosity level. | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR` | @@ -197,6 +208,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y | `output_extension` | File extension for output files when `target_type=file`. Required for non-notebook output formats like YAML workflows or JSON configurations. See [File Conversion Flow](#file-conversion-flow) for usage examples. | `null` | Any extension (e.g., `.yml`, `.json`) | | `sql_output_dir` | (Experimental) When specified, triggers additional conversion of Python notebooks to SQL notebook format. This optional post-processing step may lose some Python-specific logic. See [convert_notebook_to_sql](#convert_notebook_to_sql-optional) for details on the SQL conversion process. | `null` | Full workspace path | | `request_params` | Additional request parameters passed to the model serving endpoint. Use for advanced configurations like extended thinking mode or custom token limits. See [LLM Configuration](/docs/transpile/pluggable_transpilers/switch/customizing_switch#llm-configuration) for configuration examples including Claude's extended thinking mode. | `null` | JSON format string (e.g., `{"max_tokens": 64000}`) | +| `sdp_language` | Control the language of converted SDP code, can only be "python" or "sql". | `python` | `python`, `sql` | --- @@ -316,7 +328,7 @@ flowchart TD ### Notebook Conversion Flow -For `target_type=notebook`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline: +For `target_type=notebook` or `target_type=sdp`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline: ```mermaid flowchart TD @@ -325,8 +337,15 @@ flowchart TD subgraph processing ["Notebook Processing Workflow"] direction TB analyze[analyze_input_files] e2@==> convert[convert_with_llm] - convert e3@==> validate[validate_python_notebook] - validate e4@==> fix[fix_syntax_with_llm] + + %% Branch: decide validation path + convert e8@==>|if SDP| validate_sdp[validate_sdp] + convert e3@==>|if NOT SDP| validate_nb[validate_python_notebook] + + %% Downstream connections - both validations flow to fix_syntax + validate_nb e4@==> fix[fix_syntax_with_llm] + validate_sdp e9@==> fix + fix e5@==> split[split_code_into_cells] split e6@==> export[export_to_notebook] export -.-> sqlExport["convert_notebook_to_sql
(Optional)"] @@ -340,6 +359,13 @@ flowchart TD export e7@==> notebooks[Python Notebooks] sqlExport -.-> sqlNotebooks["SQL Notebooks
(Optional Output)"] + + %% SDP validation pipeline operations + validate_sdp -.-> export_notebook[export notebook] + export_notebook -.-> create_pipeline[create pipeline] + create_pipeline -.-> update_pipeline[update pipeline for validation] + update_pipeline -.-> delete_pipeline[delete pipeline] + e1@{ animate: true } e2@{ animate: true } e3@{ animate: true } @@ -347,6 +373,8 @@ flowchart TD e5@{ animate: true } e6@{ animate: true } e7@{ animate: true } + e8@{ animate: true } + e9@{ animate: true } ``` ### File Conversion Flow @@ -394,6 +422,10 @@ Loads conversion prompts (built-in or custom YAML) and sends file content to the ### validate_python_notebook Performs syntax validation on the generated code. Python syntax is checked using `ast.parse()`, while SQL statements within `spark.sql()` calls are validated using Spark's `EXPLAIN` command. Any errors are recorded in the result table for potential fixing in the next step. +### validate_sdp +Performs Spark Declarative Pipeline validation on the generated code. A real pipeline is created and validation-only update is performed. +Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored. + ### fix_syntax_with_llm Attempts automatic error correction when syntax issues are detected. Sends error context back to the model serving endpoint, which suggests corrections. The validation and fix process repeats up to `max_fix_attempts` times (default: 1) until errors are resolved or the retry limit is reached.