-
Notifications
You must be signed in to change notification settings - Fork 81
Update switch documentation for Spark Declarative Pipeline conversion #2174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2174 +/- ##
=======================================
Coverage 63.56% 63.56%
=======================================
Files 100 100
Lines 8503 8503
Branches 885 885
=======================================
Hits 5405 5405
Misses 2931 2931
Partials 167 167 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ 51/51 passed, 4 flaky, 5m6s total Flaky tests:
Running from acceptance #3193 |
5174725 to
7d1071f
Compare
|
|
||
| | Source Technology | Source → Target | | ||
| |--------------|-----------------| | ||
| | `pyspark` | PySpark ETL → Databricks Notebook in Python or SQL for SDP | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a brief note about unknown_etl? It would help clarify that ETL types other than those listed above also transpile to Databricks Notebook in Python or SQL for SDP.
For example:
| ETL Type | Output |
|---|---|
unknown_etl |
Any other ETL → Databricks Notebook in Python or SQL for SDP |
Also, just a thought: would it make more sense to rename unknown_etl to something like other_etl? It might be clearer for users. I understand this would require changes on the Switch side as well, so it's just a suggestion.
| %% SDP validation pipeline operations | ||
| validate_sdp -.-> export_notebook[export notebook] | ||
| export_notebook -.-> create_pipeline[create pipeline] | ||
| create_pipeline -.-> update_pipeline[update pipeline for validation] | ||
| update_pipeline -.-> delete_pipeline[delete pipeline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Move SDP pipeline operations detail to Processing Steps section
The Notebook Conversion Flow diagram has become quite lengthy after adding validate_sdp support. The dotted lines showing the pipeline operations (export_notebook → create_pipeline → update_pipeline → delete_pipeline) make the diagram harder to follow at a glance.
Suggestion: Remove the SDP pipeline operation details from the main diagram and document them in the ### validate_sdp section under Processing Steps instead.
Current diagram includes:
%% SDP validation pipeline operations
validate_sdp -.-> export_notebook[export notebook]
export_notebook -.-> create_pipeline[create pipeline]
create_pipeline -.-> update_pipeline[update pipeline for validation]
update_pipeline -.-> delete_pipeline[delete pipeline]
Proposed change:
- Remove these 4 dotted lines from the Notebook Conversion Flow diagram
- Expand the
### validate_sdpsection in Processing Steps with a table:
### validate_sdp
Performs Spark Declarative Pipeline validation on the generated code. The validation process executes these steps sequentially:
| Step | Description |
|------|-------------|
| Export Notebook | Writes the converted code to a temporary notebook in workspace |
| Create Pipeline | Creates a temporary Spark Declarative Pipeline referencing the notebook |
| Update Pipeline | Runs a validation-only update to check for SDP syntax errors |
| Delete Pipeline | Cleans up the temporary pipeline after validation |
Note: `TABLE_OR_VIEW_NOT_FOUND` errors are ignored during validation.This approach:
- Keeps the main flow diagram readable
- Documents SDP details in a logical location (under the validate_sdp processing step)
- Uses a simple table format instead of an additional mermaid diagram
hiroyukinakazato-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In docs/lakebridge/docs/transpile/pluggable_transpilers/switch/customizing_switch.mdx, the Conversion Result Table Schema is missing a column for SDP validation errors. Please add result_sdp_errors similar to the existing result_python_parse_error and result_sql_parse_errors columns.
Changes
Update the documentation of switch for conversion to Spark Declarative Pipeline (SDP).
This will be depend on this PR, which changes switch: https://github.com/databrickslabs/switch/pull/46
What does this PR do?
Relevant implementation details
This is documentation update.
Caveats/things to watch out for when reviewing:
Linked issues
Resolves #..
Functionality
databricks labs lakebridge ...Tests