Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ This LLM-powered approach excels at converting complex SQL code and business log
than syntactic transformation. While generated notebooks may require manual adjustments, they provide a valuable foundation
for Databricks migration.

Switch can also convert ETL workloads into Spark Declarative Pipelines, supporting both Python and SQL. Refer to the sections below for usage instructions.

---

## How Switch Works
Expand All @@ -36,6 +38,7 @@ Switch runs entirely within the Databricks workspace. You can find details about
- **Jobs API**: Executes as scalable Databricks Jobs for batch processing
- **Model Serving**: Direct integration with Databricks LLM endpoints, with concurrent processing for multiple files
- **Delta Tables**: Tracks conversion progress and results
- **Pipelines API**: Creates and Executes Spark Declarative Pipeline for pipeline conversion

### 3. Flexible Output Formats
- **Notebooks**: Python notebooks containing Spark SQL (primary output)
Expand Down Expand Up @@ -98,6 +101,14 @@ Convert non-SQL files to notebooks or other formats.
| `scala` | Scala Code → Databricks Python Notebook |
| `airflow` | Airflow DAG → Databricks Jobs YAML + Operator conversion guidance (SQL→sql_task, Python→notebook, etc.) |

### Built-in Prompts: ETL Sources

Convert ETL workloads to Spark Declarative Pipeline (SDP) in Python or SQL.

| Source Technology | Source → Target |
|--------------|-----------------|
| `pyspark` | PySpark ETL → Databricks Notebook in Python or SQL for SDP |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a brief note about unknown_etl? It would help clarify that ETL types other than those listed above also transpile to Databricks Notebook in Python or SQL for SDP.

For example:

ETL Type Output
unknown_etl Any other ETL → Databricks Notebook in Python or SQL for SDP

Also, just a thought: would it make more sense to rename unknown_etl to something like other_etl? It might be clearer for users. I understand this would require changes on the Switch side as well, so it's just a suggestion.


### Custom Prompts: Any Source Format

Switch's LLM-based architecture supports additional conversion types through custom YAML conversion prompts, making it extensible beyond built-in options.
Expand Down Expand Up @@ -186,7 +197,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y

| Parameter | Description | Default Value | Available Options |
|-----------|-------------|---------------|-------------------|
| `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats. See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file` |
| `target_type` | Output format type. `notebook` for Python notebooks with validation and error fixing, `file` for generic file formats, `sdp` for conversion from etl workloads to Spark Declarative Pipeline (SDP). See [Conversion Flow Overview](#conversion-flow-overview) for processing differences. | `notebook` | `notebook`, `file`, `sdp` |
| `source_format` | Source file format type. `sql` performs SQL comment removal and whitespace compression preprocessing before conversion. `generic` processes files as-is without preprocessing. Preprocessing affects token counting and conversion quality. See [analyze_input_files](#analyze_input_files) for preprocessing details. | `sql` | `sql`, `generic` |
| `comment_lang` | Language for generated comments. | `English` | `English`, `Japanese`, `Chinese`, `French`, `German`, `Italian`, `Korean`, `Portuguese`, `Spanish` |
| `log_level` | Logging verbosity level. | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR` |
Expand All @@ -197,6 +208,7 @@ Additional conversion parameters are managed in the Switch configuration file. Y
| `output_extension` | File extension for output files when `target_type=file`. Required for non-notebook output formats like YAML workflows or JSON configurations. See [File Conversion Flow](#file-conversion-flow) for usage examples. | `null` | Any extension (e.g., `.yml`, `.json`) |
| `sql_output_dir` | (Experimental) When specified, triggers additional conversion of Python notebooks to SQL notebook format. This optional post-processing step may lose some Python-specific logic. See [convert_notebook_to_sql](#convert_notebook_to_sql-optional) for details on the SQL conversion process. | `null` | Full workspace path |
| `request_params` | Additional request parameters passed to the model serving endpoint. Use for advanced configurations like extended thinking mode or custom token limits. See [LLM Configuration](/docs/transpile/pluggable_transpilers/switch/customizing_switch#llm-configuration) for configuration examples including Claude's extended thinking mode. | `null` | JSON format string (e.g., `{"max_tokens": 64000}`) |
| `sdp_language` | Control the language of converted SDP code, can only be "python" or "sql". | `python` | `python`, `sql` |

---

Expand Down Expand Up @@ -316,7 +328,7 @@ flowchart TD

### Notebook Conversion Flow

For `target_type=notebook`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline:
For `target_type=notebook` or `target_type=sdp`, the `orchestrate_to_notebook` orchestrator executes a comprehensive 7-step processing pipeline:

```mermaid
flowchart TD
Expand All @@ -325,8 +337,15 @@ flowchart TD
subgraph processing ["Notebook Processing Workflow"]
direction TB
analyze[analyze_input_files] e2@==> convert[convert_with_llm]
convert e3@==> validate[validate_python_notebook]
validate e4@==> fix[fix_syntax_with_llm]

%% Branch: decide validation path
convert e8@==>|if SDP| validate_sdp[validate_sdp]
convert e3@==>|if NOT SDP| validate_nb[validate_python_notebook]

%% Downstream connections - both validations flow to fix_syntax
validate_nb e4@==> fix[fix_syntax_with_llm]
validate_sdp e9@==> fix

fix e5@==> split[split_code_into_cells]
split e6@==> export[export_to_notebook]
export -.-> sqlExport["convert_notebook_to_sql<br>(Optional)"]
Expand All @@ -340,13 +359,22 @@ flowchart TD

export e7@==> notebooks[Python Notebooks]
sqlExport -.-> sqlNotebooks["SQL Notebooks<br>(Optional Output)"]

%% SDP validation pipeline operations
validate_sdp -.-> export_notebook[export notebook]
export_notebook -.-> create_pipeline[create pipeline]
create_pipeline -.-> update_pipeline[update pipeline for validation]
update_pipeline -.-> delete_pipeline[delete pipeline]
Comment on lines +363 to +367
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Move SDP pipeline operations detail to Processing Steps section

The Notebook Conversion Flow diagram has become quite lengthy after adding validate_sdp support. The dotted lines showing the pipeline operations (export_notebook → create_pipeline → update_pipeline → delete_pipeline) make the diagram harder to follow at a glance.

Suggestion: Remove the SDP pipeline operation details from the main diagram and document them in the ### validate_sdp section under Processing Steps instead.

Current diagram includes:

%% SDP validation pipeline operations
validate_sdp -.-> export_notebook[export notebook]
export_notebook -.-> create_pipeline[create pipeline]
create_pipeline -.-> update_pipeline[update pipeline for validation]
update_pipeline -.-> delete_pipeline[delete pipeline]

Proposed change:

  1. Remove these 4 dotted lines from the Notebook Conversion Flow diagram
  2. Expand the ### validate_sdp section in Processing Steps with a table:
### validate_sdp
Performs Spark Declarative Pipeline validation on the generated code. The validation process executes these steps sequentially:

| Step | Description |
|------|-------------|
| Export Notebook | Writes the converted code to a temporary notebook in workspace |
| Create Pipeline | Creates a temporary Spark Declarative Pipeline referencing the notebook |
| Update Pipeline | Runs a validation-only update to check for SDP syntax errors |
| Delete Pipeline | Cleans up the temporary pipeline after validation |

Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored during validation.

This approach:

  • Keeps the main flow diagram readable
  • Documents SDP details in a logical location (under the validate_sdp processing step)
  • Uses a simple table format instead of an additional mermaid diagram


e1@{ animate: true }
e2@{ animate: true }
e3@{ animate: true }
e4@{ animate: true }
e5@{ animate: true }
e6@{ animate: true }
e7@{ animate: true }
e8@{ animate: true }
e9@{ animate: true }
```

### File Conversion Flow
Expand Down Expand Up @@ -394,6 +422,10 @@ Loads conversion prompts (built-in or custom YAML) and sends file content to the
### validate_python_notebook
Performs syntax validation on the generated code. Python syntax is checked using `ast.parse()`, while SQL statements within `spark.sql()` calls are validated using Spark's `EXPLAIN` command. Any errors are recorded in the result table for potential fixing in the next step.

### validate_sdp
Performs Spark Declarative Pipeline validation on the generated code. A real pipeline is created and validation-only update is performed.
Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored.

### fix_syntax_with_llm
Attempts automatic error correction when syntax issues are detected. Sends error context back to the model serving endpoint, which suggests corrections. The validation and fix process repeats up to `max_fix_attempts` times (default: 1) until errors are resolved or the retry limit is reached.

Expand Down
Loading