-
Notifications
You must be signed in to change notification settings - Fork 82
Update switch documentation for Spark Declarative Pipeline conversion #2174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f57b702
7e86372
6300b80
c14502c
8c43a0a
e7f6afc
7d1071f
cf0ecf7
ba0662f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,8 @@ This LLM-powered approach excels at converting complex SQL code and business log | |
| than syntactic transformation. While generated notebooks may require manual adjustments, they provide a valuable foundation | ||
| for Databricks migration. | ||
|
|
||
| Switch can also convert ETL workloads into Spark Declarative Pipelines, supporting both Python and SQL. Refer to the sections below for usage instructions. | ||
|
|
||
| --- | ||
|
|
||
| ## How Switch Works | ||
|
|
@@ -36,6 +38,7 @@ Switch runs entirely within the Databricks workspace. You can find details about | |
| - **Jobs API**: Executes as scalable Databricks Jobs for batch processing | ||
| - **Model Serving**: Direct integration with Databricks LLM endpoints, with concurrent processing for multiple files | ||
| - **Delta Tables**: Tracks conversion progress and results | ||
| - **Pipelines API**: Creates and Executes Spark Declarative Pipeline for pipeline conversion | ||
|
|
||
| ### 3. Flexible Output Formats | ||
| - **Notebooks**: Python notebooks containing Spark SQL (primary output) | ||
|
|
@@ -98,6 +101,14 @@ Convert non-SQL files to notebooks or other formats. | |
| | `scala` | Scala Code → Databricks Python Notebook | | ||
| | `airflow` | Airflow DAG → Databricks Jobs YAML + Operator conversion guidance (SQL→sql_task, Python→notebook, etc.) | | ||
|
|
||
| ### Built-in Prompts: ETL Sources | ||
|
|
||
| Convert ETL workloads to Spark Declarative Pipeline (SDP) in Python or SQL. | ||
|
|
||
| | Source Technology | Source → Target | | ||
| |--------------|-----------------| | ||
| | `pyspark` | PySpark ETL → Databricks Notebook in Python or SQL for SDP | | ||
|
|
||
| ### Custom Prompts: Any Source Format | ||
|
|
||
| Switch's LLM-based architecture supports additional conversion types through custom YAML conversion prompts, making it extensible beyond built-in options. | ||
|
|
@@ -186,7 +197,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y | |
|
|
||
| | Parameter | Description | Default Value | Available Options | | ||
| |-----------|-------------|---------------|-------------------| | ||
| | `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats. See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file` | | ||
| | `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats, `sdp` for conversion from etl workloads to Spark Declarative Pipeline (SDP). See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file`, `sdp` | | ||
| | `source_format` | Source file format type. `sql` performs SQL comment removal and whitespace compression preprocessing before conversion. `generic` processes files as-is without preprocessing. Preprocessing affects token counting and conversion quality. See [analyze_input_files](#analyze_input_files) for preprocessing details. | `sql` | `sql`, `generic` | | ||
| | `comment_lang` | Language for generated comments. | `English` | `English`, `Japanese`, `Chinese`, `French`, `German`, `Italian`, `Korean`, `Portuguese`, `Spanish` | | ||
| | `log_level` | Logging verbosity level. | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR` | | ||
|
|
@@ -197,6 +208,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y | |
| | `output_extension` | File extension for output files when `target_type=file`. Required for non-notebook output formats like YAML workflows or JSON configurations. See [File Conversion Flow](#file-conversion-flow) for usage examples. | `null` | Any extension (e.g., `.yml`, `.json`) | | ||
| | `sql_output_dir` | (Experimental) When specified, triggers additional conversion of Python notebooks to SQL notebook format. This optional post-processing step may lose some Python-specific logic. See [convert_notebook_to_sql](#convert_notebook_to_sql-optional) for details on the SQL conversion process. | `null` | Full workspace path | | ||
| | `request_params` | Additional request parameters passed to the model serving endpoint. Use for advanced configurations like extended thinking mode or custom token limits. See [LLM Configuration](/docs/transpile/pluggable_transpilers/switch/customizing_switch#llm-configuration) for configuration examples including Claude's extended thinking mode. | `null` | JSON format string (e.g., `{"max_tokens": 64000}`) | | ||
| | `sdp_language` | Control the language of converted SDP code, can only be "python" or "sql". | `python` | `python`, `sql` | | ||
|
|
||
| --- | ||
|
|
||
|
|
@@ -316,7 +328,7 @@ flowchart TD | |
|
|
||
| ### Notebook Conversion Flow | ||
|
|
||
| For `target_type=notebook`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline: | ||
| For `target_type=notebook` or `target_type=sdp`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline: | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
|
|
@@ -325,8 +337,15 @@ flowchart TD | |
| subgraph processing ["Notebook Processing Workflow"] | ||
| direction TB | ||
| analyze[analyze_input_files] e2@==> convert[convert_with_llm] | ||
| convert e3@==> validate[validate_python_notebook] | ||
| validate e4@==> fix[fix_syntax_with_llm] | ||
|
|
||
| %% Branch: decide validation path | ||
| convert e8@==>|if SDP| validate_sdp[validate_sdp] | ||
| convert e3@==>|if NOT SDP| validate_nb[validate_python_notebook] | ||
|
|
||
| %% Downstream connections - both validations flow to fix_syntax | ||
| validate_nb e4@==> fix[fix_syntax_with_llm] | ||
| validate_sdp e9@==> fix | ||
|
|
||
| fix e5@==> split[split_code_into_cells] | ||
| split e6@==> export[export_to_notebook] | ||
| export -.-> sqlExport["convert_notebook_to_sql<br>(Optional)"] | ||
|
|
@@ -340,13 +359,22 @@ flowchart TD | |
|
|
||
| export e7@==> notebooks[Python Notebooks] | ||
| sqlExport -.-> sqlNotebooks["SQL Notebooks<br>(Optional Output)"] | ||
|
|
||
| %% SDP validation pipeline operations | ||
| validate_sdp -.-> export_notebook[export notebook] | ||
| export_notebook -.-> create_pipeline[create pipeline] | ||
| create_pipeline -.-> update_pipeline[update pipeline for validation] | ||
| update_pipeline -.-> delete_pipeline[delete pipeline] | ||
|
Comment on lines
+363
to
+367
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: Move SDP pipeline operations detail to Processing Steps sectionThe Notebook Conversion Flow diagram has become quite lengthy after adding Suggestion: Remove the SDP pipeline operation details from the main diagram and document them in the Current diagram includes: Proposed change:
### validate_sdp
Performs Spark Declarative Pipeline validation on the generated code. The validation process executes these steps sequentially:
| Step | Description |
|------|-------------|
| Export Notebook | Writes the converted code to a temporary notebook in workspace |
| Create Pipeline | Creates a temporary Spark Declarative Pipeline referencing the notebook |
| Update Pipeline | Runs a validation-only update to check for SDP syntax errors |
| Delete Pipeline | Cleans up the temporary pipeline after validation |
Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored during validation.This approach:
|
||
|
|
||
| e1@{ animate: true } | ||
| e2@{ animate: true } | ||
| e3@{ animate: true } | ||
| e4@{ animate: true } | ||
| e5@{ animate: true } | ||
| e6@{ animate: true } | ||
| e7@{ animate: true } | ||
| e8@{ animate: true } | ||
| e9@{ animate: true } | ||
| ``` | ||
|
|
||
| ### File Conversion Flow | ||
|
|
@@ -394,6 +422,10 @@ Loads conversion prompts (built-in or custom YAML) and sends file content to the | |
| ### validate_python_notebook | ||
| Performs syntax validation on the generated code. Python syntax is checked using `ast.parse()`, while SQL statements within `spark.sql()` calls are validated using Spark's `EXPLAIN` command. Any errors are recorded in the result table for potential fixing in the next step. | ||
|
|
||
| ### validate_sdp | ||
| Performs Spark Declarative Pipeline validation on the generated code. A real pipeline is created and validation-only update is performed. | ||
| Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored. | ||
|
|
||
| ### fix_syntax_with_llm | ||
| Attempts automatic error correction when syntax issues are detected. Sends error context back to the model serving endpoint, which suggests corrections. The validation and fix process repeats up to `max_fix_attempts` times (default: 1) until errors are resolved or the retry limit is reached. | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a brief note about
unknown_etl? It would help clarify that ETL types other than those listed above also transpile to Databricks Notebook in Python or SQL for SDP.For example:
unknown_etlAlso, just a thought: would it make more sense to rename
unknown_etlto something likeother_etl? It might be clearer for users. I understand this would require changes on the Switch side as well, so it's just a suggestion.