-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add lazy loading of tables #3
base: main
Are you sure you want to change the base?
Conversation
@@ -247,7 +248,7 @@ def __init__( | |||
This can decrease latency if there are many files in the log since the last checkpoint, | |||
but will also increase memory usage. Possible rate limits of the storage backend should | |||
also be considered for optimal performance. Defaults to 4 * number of cpus. | |||
|
|||
:param lazy_load: when true the table metadata isn't loaded | |||
""" | |||
self._storage_options = storage_options | |||
self._table = RawDeltaTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table and metadata initialization are already below, I think we should remove them from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and updates docstring and functions
a1cd31d
to
e77089f
Compare
@dsandesari do you think we should add unit tests for this? On rust side and/or python side? |
6f9704c
to
b59a982
Compare
python/src/lib.rs
Outdated
@@ -121,6 +121,43 @@ impl RawDeltaTable { | |||
}) | |||
} | |||
|
|||
#[classmethod] | |||
#[pyo3(signature = (table_uri, version = None, storage_options = None, without_files = false))] | |||
fn load_lazy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between load_lazy
and new
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_lazy
uses builder.build()
, while new
uses builder.load()
. builder.load()
build the DeltaTable and load its state. builder.build()
only build DeltaTable. See details here: https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/table/builder.rs#L269-L293
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to include all the options in load_lazy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a test with options and without options for load_lazy to ensure it works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builder.build
doesn't use version option, so I removed it. https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/table/builder.rs#L269-L280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PengLiVectra for links like the line links, make sure you anchor GitHub to a tag so the stanza is consistent. RN the highlight highlights something else instead of build
and load
. Looks good otherwise though. We could consider DRY by pulling out the first part of each function (before we call build
vs load
). Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider it, most of the code seems repetitive.
How is this work related to github/delta-io/delta-rs/issues/1361? |
Do we know why the tests are not passing? |
9ca9531
to
c717f5d
Compare
python/deltalake/table.py
Outdated
log_buffer_size=log_buffer_size, | ||
) | ||
self._metadata = None | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a test, using lazy_load.
python/src/lib.rs
Outdated
@@ -121,6 +121,43 @@ impl RawDeltaTable { | |||
}) | |||
} | |||
|
|||
#[classmethod] | |||
#[pyo3(signature = (table_uri, version = None, storage_options = None, without_files = false))] | |||
fn load_lazy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a test with options and without options for load_lazy to ensure it works as expected.
ad082dd
to
e4e074e
Compare
d127edf
to
b6fba58
Compare
83478f5
to
6ba7634
Compare
6ba7634
to
b5acd83
Compare
Description
Add lazy loading of tables. When preforming streaming operations we don't need any version of the table loaded. Large tables are slow to load, so we see a huge performance boost by avoiding CPU time spent loading the table metadata.
Related Issue(s)
Partly with delta-io#1361
Testing
Breaking Change
Not a breaking change.
Documentation