-
Notifications
You must be signed in to change notification settings - Fork 81
Description
Is there an existing issue for this?
- I have searched the existing issues
Category of Bug / Issue
Reconcile bug
Current Behavior
In the reconciliation process, the hash calculation for the source (T-SQL/Synapse) environment is inconsistent with the hash calculation for the target (Databricks) environment, leading to reconciliation mismatches for certain rows.
-
T-SQL/Synapse Source Hashing: The query generator explicitly truncates the concatenated string of column values to 256 characters before calculating the hash. This issue happens in the file src/databricks/labs/lakebridge/reconcile/query_builder/expression_generator.py at line 298, where the function string is defined as:
func="CONVERT(VARCHAR(256), HASHBYTES('SHA2_256', CONVERT(VARCHAR(256),{})), 2)"
If a row's concatenated string of values exceeds 256 characters, the calculated T-SQL hash is based only on the first 256 characters. -
Databricks Target Hashing: The target system (Databricks) calculates the hash based on the full, untruncated concatenated string.
This disparity means that for any record where the concatenated string exceeds 256 characters, the hash calculated in T-SQL will not match the hash calculated in Databricks, even if the row data is identical. This results in the reconciliation process incorrectly reporting these rows as being different.
Expected Behavior
The hash calculation in the T-SQL/Synapse query should use the full, un-truncated concatenated string of column values to ensure the hash uniquely identifies the row data, consistent with the Databricks calculation.
The VARCHAR(256) limit on the inner CONVERT must be removed or significantly increased (e.g., to VARCHAR(8000) or VARCHAR(MAX) if supported) to avoid truncation and ensure consistent hashing across both source and target.
Steps To Reproduce
- Define a source table where the concatenation of all column values for at least one row exceeds 256 characters.
- Ensure this data is correctly loaded into both the T-SQL/Synapse source and the Databricks target.
- Execute the Lakebridge reconcile process for this table.
- Observe that the reconciliation process incorrectly reports a mismatch for the row(s) whose concatenated values exceeded 256 characters, due to the T-SQL hash being based on the truncated string.
Relevant log output or Exception details
No explicit exception is thrown. The symptom is reconciliation failure (mismatch) for rows where the concatenated values exceed 256 characters, despite the row data being identical in both systems.Logs Confirmation
- I ran the command line with
--debug - I have attached the
lsp-server.logunder USER_HOME/.databricks/labs/remorph-transpilers/<converter_name>/lib/lsp-server.log
Sample Query
Operating System
Windows
Version
latest via Databricks CLI