Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB: Improve support for reading JSON/BSON files #261

Merged
merged 10 commits into from
Sep 13, 2024
Merged

Conversation

amotl
Copy link
Member

@amotl amotl commented Sep 11, 2024

About

A few features that have not been ready (lacked software tests) to be included into GH-255.

Details

  • Unlock processing multiple collections, either from server database, or from filesystem directory
  • Unlock processing JSON files from HTTP resource, using https+bson://
  • Optionally filter server collection using MongoDB query expression

Documentation

https://cratedb-toolkit--261.org.readthedocs.build/io/mongodb/loader.html

Install

pip install --upgrade 'cratedb-toolkit[mongodb] @ git+https://github.com/crate/cratedb-toolkit.git@amo/mongodb-more-2'

@amotl amotl force-pushed the amo/mongodb-more-2 branch 2 times, most recently from 414e128 to e8001c0 Compare September 11, 2024 11:57
@amotl amotl marked this pull request as ready for review September 11, 2024 12:00
@amotl amotl mentioned this pull request Sep 11, 2024
6 tasks

This comment was marked as resolved.

@bmunkholm

This comment was marked as resolved.

@cla-bot cla-bot bot added the cla-signed label Sep 11, 2024

This comment was marked as resolved.

progress=True,
)

elif source_url_obj.scheme.startswith("http"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this up to if source_url_obj.scheme.startswith("file") or source_url_obj.scheme.startswith("http") as the body of both branches looks the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significantly improved with fb70a36. Thanks!

tasks.append(
MongoDBFullLoad(
mongodb_url=str(mongodb_uri_effective),
cratedb_url=str(cratedb_uri_effective),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the URL->String conversion here needed as the MongoDBFullLoad will convert the String back to an URL anyhow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved with 6bbda4d. Thanks again!

@amotl amotl requested a review from seut September 12, 2024 15:08
Comment on lines 140 to 146
mongodb_uri = source_url
cratedb_uri = target_url
# What the hack?
if (
mongodb_uri.scheme.startswith("mongodb")
and Path(mongodb_uri.path).is_absolute()
and mongodb_uri.path[-1] != "/"
):
mongodb_uri.path += "/"
if cratedb_uri.path[-1] != "/":
cratedb_uri.path += "/"
mongodb_query_parameters = mongodb_uri.query_params
mongodb_adapter = mongodb_adapter_factory(mongodb_uri)
address_pair_root = AddressPair(source_url=source_url, target_url=target_url)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hackery has now also been refactored and wrapped away into an AddressPair object, in order to separate concerns...

Comment on lines 153 to 161
for collection_path in collections:
mongodb_uri_effective = mongodb_uri.navigate(Path(collection_path).name)
mongodb_uri_effective.query_params = mongodb_query_parameters
cratedb_uri_effective = cratedb_uri.navigate(Path(collection_path).stem)
address_pair = address_pair_root.navigate(
source_path=Path(collection_path).name,
target_path=Path(collection_path).stem,
)
tasks.append(
MongoDBFullLoad(
mongodb_url=mongodb_uri_effective,
cratedb_url=cratedb_uri_effective,
mongodb_url=address_pair.source_url,
cratedb_url=address_pair.target_url,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and to minimize its API surface. Now, you are just invoking .navigate() on that composite object instance and it will adjust its managed URL instances correspondingly.

Comment on lines 144 to 145
source_url_query_parameters = self.source_url.query_params
target_url_query_parameters = self.target_url.query_params
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the query params are not changed in this method, why do we need to store (and copy) them separately?

Copy link
Member Author

@amotl amotl Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental .navigate() method on the URL object implicitly gets rid of them, but we want to propagate them, so we need to store and forward them explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probably, it makes more sense to not use .navigate() at all, because it apparently provides so many obstacles that need workarounds, and just manipulate the .path attribute directly instead.

Copy link
Member Author

@amotl amotl Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

94b52a8 stops using the .navigate() method of the designated URL library, and uses standard urljoin() instead to compute and directly set the .path attribute.

The result looks more streamlined and compact than before, as it does not need to work around the obstacles of .navigate() any longer, so we save the need to explicitly store+forward the URL query parameters. Thanks!

source_url_query_parameters = self.source_url.query_params
target_url_query_parameters = self.target_url.query_params

source_url = URL(str(self.source_url))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is no better API at the URL class to avoid string generation + parsing only to copy the instance, right?

Copy link
Member Author

@amotl amotl Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like standard deepcopy is just fine, as already applied on other spots in this file. Improved with 2a5df04. Thanks!

@amotl amotl requested a review from seut September 13, 2024 09:47
- Do not use the fundamental `.navigate()` method, as it needs too many
  workarounds.
- Do not store and copy query parameters, because the implementation
  does not use `.navigate()` any longer.
- Manipulate the `.path` property directly instead, computing it using
  the canonical `urljoin` function.
- Adjustments about missing trailing slashes still need to take place.
Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thx

@amotl amotl merged commit a25127e into main Sep 13, 2024
26 checks passed
@amotl amotl deleted the amo/mongodb-more-2 branch September 13, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants