-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2166 databricks direct loading #2219
base: devel
Are you sure you want to change the base?
Conversation
donotpush
commented
Jan 15, 2025
- Resolves Load data into databricks without external staging and auth. #2166
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of this review:
We should be able to run databricks without configured staging so we can enable many tests that will do that.
We have destination config in destinations_configs
test util:
- remove databricks here:
destination_configs += [
DestinationTestConfiguration(destination_type=destination)
for destination in SQL_DESTINATIONS
if destination
not in ("athena", "synapse", "databricks", "dremio", "clickhouse", "sqlalchemy")
]
- add staging to this and move it in with other staging databricks setup
destination_configs += [
DestinationTestConfiguration(
destination_type="databricks",
file_format="parquet",
bucket_url=AZ_BUCKET,
extra_info="az-authorization",
)
]
you can also ping me and I can prepare a valid commit
# databricks authentication: get context config | ||
from databricks.sdk import WorkspaceClient | ||
|
||
w = WorkspaceClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code is correct but you must handle the situation when default credentials do not exist (ie. outside of notebook). I get this exception in this case:
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
just skip the code that assign values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I haven't tested yet on the notebook, and you are also right about that exception context - I have to catch that exception, and add some tests.
else: | ||
return "", file_name | ||
|
||
volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how IMO we should handle volumes:
- Allow to define
staging_volume_name
inDatabricksClientConfiguration
. This should be (I think) fully qualified name. - If
staging_volume_name
is empty: create here (ad hoc) a volume with_dlt_temp_load_volume
- we do not need to handle volumes on the level of
sql_client
. you can drop additional method you added - we do not need to care to drop
_dlt_temp_load_volume
. it belong to current schema. so if schema is dropped, the volume will be dropped as well (I hope!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need time_ns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree about the volume_name
I'll answer the path (time_ns) and file_name in another comment
return "", file_name | ||
|
||
volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}" | ||
volume_file_name = ( # replace file_name for random hex code - databricks loading fails when file_name starts with - or . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the same name as in file_name
cant be used here? it will be unique (will contain file_id part which is uniq_id() already)
|
||
file_name = FileStorage.get_file_name_from_file_path(local_file_path) | ||
file_format = "" | ||
if file_name.endswith(".parquet"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that you need to know the file format here. just upload a file we have. it has proper extension. also keep the file_name
as mentioned above
@@ -63,6 +63,7 @@ def iter_df(self, chunk_size: int) -> Generator[DataFrame, None, None]: | |||
|
|||
class DatabricksSqlClient(SqlClientBase[DatabricksSqlConnection], DBTransaction): | |||
dbapi: ClassVar[DBApi] = databricks_lib | |||
volume_name: str = "_dlt_temp_load_volume" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to DatabricksConfiguration
(as mentioned above)
@@ -102,6 +103,18 @@ def close_connection(self) -> None: | |||
self._conn.close() | |||
self._conn = None | |||
|
|||
def create_volume(self) -> None: | |||
self.execute_sql(f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create volume ad hoc before uploading file
dlt/destinations/job_client_impl.py
Outdated
@@ -176,6 +176,7 @@ def initialize_storage(self, truncate_tables: Iterable[str] = None) -> None: | |||
self.sql_client.create_dataset() | |||
elif truncate_tables: | |||
self.sql_client.truncate_tables(*truncate_tables) | |||
self.sql_client.create_volume() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can drop all of those
tests/.dlt/config.toml
Outdated
@@ -1,4 +1,6 @@ | |||
[runtime] | |||
log_level="DEBUG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember to remove before final push
" and the server_hostname." | ||
) | ||
|
||
self.direct_load = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need this. if we have a local file we do direct load. we do not need to be in a notebook context to do it. just the default access token needs notebook
I have managed to run tests without staging this way:
|
Answering (1) CREATE/DROP volume and (2) file name issues (1) CREATE/DROP Volume I agree that we can implement your proposed approach for CREATE/DROP VOLUME. However, it will be less efficient as the volume will remain until either the user decides to clean it or we implement a cleanup process (e.g., deleting files after copy statement). The reason I introduced the create_volume and drop_volume methods is that the DatabricksLoadJob.run method is executed for every single file in parallel, treating each file as a separate COPY statement. This approach is inefficient for “direct loading” since we could load multiple files into a single table using one COPY statement. Additionally, the COPY statement from a volume isn’t file-based—the path must refer to a directory that contains multiple files. This design constraint necessitates creating a unique directory for each file in the current implementation. Your proposal to run multiple CREATE VOLUME statements might not heavily impact ODBC performance, but it was problem for when I was creating the volume with the SDK (API rate limits) . While this might not be an issue with ODBC since it supports CREATE VOLUME IF NOT EXISTS, it will execute a lot of redudant statements but ODBC will handle it. Another reason for introducing these methods was to allow the DROP VOLUME statement to be executed outside the run_pool or worker context, which is not feasible otherwise. That said, I’m flexible. If you prefer, I can remove these methods, execute CREATE VOLUME as many times as there are files in the pipeline, and leave the volumes as is, relying on the user for cleanup. (2) file name issues I spent an entire day troubleshooting this issue:
After extensive debugging, I discovered that Databricks runs everything on Spark, and files with names starting with _ or . are treated as hidden files. This caused the COPY command to fail. As a result, we need a way to rename files to avoid conflicts, ensuring file names never depend on the exact table name. For instance, it always failed on the DLT table _dlt_pipeline_state (file _dlt_pipeline_state.1b87dba623.0.parquet). Since the COPY statement requires a directory (not a file path) and each directory can contain 1 to N files, our current design creates a directory per file. I’ve been using time.time_ns() to create unique directories inside the volume and a random UID as a hex code for the file name. Changing the file name also required determining the correct file extension or format. To summarize: I’m open to adjusting the path format to something more standardized or continuing with random UIDs. Also, let me know your thoughts on deleting files from the volume after loading. Should we leave them for the user to clean, or implement automated cleanup? |
@rudolfix I’ve implemented most of your recommendations, but I ran into an issue with enabling the non-staging tests. Unfortunately, your suggested approach didn’t work as expected, the destination gets added, but the files are not local. I’m unsure if it’s possible to achieve this by only changing the configuration without modifying how dlt.destination.databricks is instantiated in the tests That said, I did add a new test that uses the local file successfully. Let me know if you have further suggestions or ideas on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enables no staging mode as a default for databricks.
- please update the documentation
- is it hard to delete staging files from databrick volumes? I bet you can do that from SQL... we have similar functionality for snowflake. IMO should be pretty easy
otherwise LGTM! good job
w = WorkspaceClient() | ||
self.access_token = w.dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None) # type: ignore[union-attr] | ||
|
||
# pick the first warehouse on the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take a list of warehouses when server_hostname
or http_path
and then set only the empty. configuration takes precedence. and if we have both values we skip getting warehouses altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fully_qualified_volume_name = self._job_client.config.staging_volume_name | ||
volume_catalog, volume_database, volume_name = fully_qualified_volume_name.split(".") | ||
|
||
self._sql_client.execute_sql(f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create only unnamed volumes. if staging_volume_name
assume that it exists. please also document this. mention that if you go for production it is better to use named volume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CREATE VOLUME IF NOT EXISTS {fully_qualified_volume_name} | ||
""") | ||
|
||
volume_path = f"/Volumes/{volume_catalog}/{volume_database}/{volume_name}/{time.time_ns()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you should take uniq_id()
from dlt common instead of time? for example on windows you get timer resolution of 20ms. you may get clashing names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Thanks for the review, and helping out with enabling the tests. I've implemented all your recommendations.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for the tests and docs.