Skip to content

Added support for MinIO and B2 buckets #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 12, 2025
Merged

Added support for MinIO and B2 buckets #620

merged 5 commits into from
Mar 12, 2025

Conversation

TaperChipmunk32
Copy link
Collaborator

@TaperChipmunk32 TaperChipmunk32 commented Jan 7, 2025

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup

With this PR, everyone can start using Backblaze B2 or MinIO once they have their environment set up.


This change is Reviewable

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)


bucket_setup.md line 7 at r1 (raw file):

### Note For MinIO setup

In order to access the MinIO bucket locally, you must have a VPN connected to its network.

It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."


bucket_setup.md line 13 at r1 (raw file):

**Windows**

The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.

What's the motivation behind the B drive folder and M drive folder having different names?


silnlp/common/environment.py line 33 at r1 (raw file):

        self.bucket_service = os.getenv("BUCKET_SERVICE", "").lower()

        self.set_s3_bucket()

I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.


silnlp/common/environment.py line 183 at r1 (raw file):

        # TEMPORARY: This allows users to still connect to AWS S3 if they have not set up MinIO or B2 yet. This will be removed in the future.
        if self.bucket_service == "aws" or (os.getenv("MINIO_ACCESS_KEY") is None and os.getenv("B2_KEY_ID") is None):
            LOGGER.warning("Support for AWS S3 will soon be removed. Please set up MinIO and/or B2 credentials.")

This is a great idea to include!


silnlp/common/environment.py line 212 at r1 (raw file):

                self.bucket_service = "minio"
            except Exception as e:
                LOGGER.info(e)

It's probably best to upgrade this to a warning level, since it does involve a failure to connect.


silnlp/common/environment.py line 226 at r1 (raw file):

                self.bucket_service = "b2"
            except Exception as e:
                LOGGER.info(e)

Same as above.

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 12 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.

Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.


silnlp/common/environment.py line 212 at r1 (raw file):

Previously, mshannon-sil wrote…

It's probably best to upgrade this to a warning level, since it does involve a failure to connect.

Done.


silnlp/common/environment.py line 226 at r1 (raw file):

Previously, mshannon-sil wrote…

Same as above.

Done.


bucket_setup.md line 7 at r1 (raw file):

Previously, mshannon-sil wrote…

It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."

Done.


bucket_setup.md line 13 at r1 (raw file):

Previously, mshannon-sil wrote…

What's the motivation behind the B drive folder and M drive folder having different names?

This allows both drives to be mounted at the same time, if anyone ever wants to.

@TaperChipmunk32 TaperChipmunk32 linked an issue Jan 8, 2025 that may be closed by this pull request
@TaperChipmunk32 TaperChipmunk32 removed a link to an issue Jan 8, 2025
Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r2, all commit messages.
Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, TaperChipmunk32 (Matthew Beech) wrote…

Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.

Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?


silnlp/common/environment.py line 145 at r1 (raw file):

                else:
                    raise Exception(
                        f"The path defined by environment variable data_path ({data_path}) is not a "

Same as above.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 12 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil and @TaperChipmunk32)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?

I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.


README.md line 143 at r2 (raw file):

   B2_KEY_ID=xxxxxxxx
   B2_APPLICATION_KEY=xxxxxxxx
   MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000

Will most users only need to setup B2 and not MinIO?

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


README.md line 143 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Will most users only need to setup B2 and not MinIO?

Syncing between the two buckets using rclone takes 8 minutes, even when there are no transfers needed, so we will likely need to have users connect to MinIO.


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.

Done.


silnlp/common/environment.py line 145 at r1 (raw file):

Previously, mshannon-sil wrote…

Same as above.

Done.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's hold on on merging this until we know what we're doing with the VPN/syncing issue.

Reviewed 2 of 4 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup
@TaperChipmunk32
Copy link
Collaborator Author

TaperChipmunk32 commented Mar 6, 2025

@ddaspit @mshannon-sil I have successfully run jobs on AQuA, Cheetah, and GCP using MinIO. One note for GCP is that checkpoint uploads to the MinIO bucket take 10-15 minutes. This PR is ready to be reviewed again and hopefully merged soon.

Once this PR is merged, I will do the following:

  1. Send out MinIO/B2 keys and VPN instructions for everyone who needs it
  2. Submit a ticket to LT TechOps for VPN configurations for everyone who needs it
  3. Ask everyone to clean out the AWS S3 bucket of any files, especially checkpoints, that they do not need anymore to reduce transfer costs
  4. Update the wiki

Then, once everyone has MinIO set up and the S3 bucket is cleaned up, I will:

  1. Transfer all the remaining data from S3 to MinIO
  2. Sync MinIO and B2
  3. Delete the research data from the S3 bucket

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit)


silnlp/nmt/experiment.py line 63 at r3 (raw file):

            raise RuntimeError(f"ERROR: Config file does not exist in experiment folder {exp_dir}.")
        SIL_NLP_ENV.copy_experiment_from_bucket(self.name)
        if self.config.has_parent:

Is this related to MinIO/B2 support? It seems like it might be part of the multilingual experiments changes, and if so it should go in a separate PR. Could you review the files and remove any changes that should be in a separate PR?


silnlp/nmt/hugging_face_config.py line 193 at r3 (raw file):

def get_parent_last_checkpoint(model_dir: Path) -> Path:

Seems like it belongs to a different PR as mentioned earlier.


silnlp/nmt/hugging_face_config.py line 699 at r3 (raw file):

            categories_set: Optional[Set[str]] = None if categories is None else set(categories)

            if terms_config["include_glosses"]:

Seems like it belongs to a different PR as mentioned earlier.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @TaperChipmunk32)

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)


silnlp/nmt/experiment.py line 63 at r3 (raw file):

Previously, mshannon-sil wrote…

Is this related to MinIO/B2 support? It seems like it might be part of the multilingual experiments changes, and if so it should go in a separate PR. Could you review the files and remove any changes that should be in a separate PR?

I am not seeing these changes listed for this PR. The changes you are referring to are from this PR.


silnlp/nmt/hugging_face_config.py line 193 at r3 (raw file):

Previously, mshannon-sil wrote…

Seems like it belongs to a different PR as mentioned earlier.

These changes are also from the previous PR mentioned earlier.


silnlp/nmt/hugging_face_config.py line 699 at r3 (raw file):

Previously, mshannon-sil wrote…

Seems like it belongs to a different PR as mentioned earlier.

This change is also from a previous PR, but not the same as the others. I am not seeing any of these in the "Files Changed" section of this PR.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @TaperChipmunk32)


silnlp/nmt/experiment.py line 63 at r3 (raw file):

Previously, TaperChipmunk32 (Matthew Beech) wrote…

I am not seeing these changes listed for this PR. The changes you are referring to are from this PR.

Weird, not sure why they're showing up in reviewable for me. I checked the files in github and you're right those changes aren't there. Maybe it has something to do with the reverted changes, or maybe I have to change something in my reviewable settings. Either way I'll dismiss this now.

@TaperChipmunk32 TaperChipmunk32 merged commit e7c0c4d into master Mar 12, 2025
1 check passed
AmeWenJ pushed a commit that referenced this pull request Mar 19, 2025
* Added support for MinIO and B2 buckets

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup

* Updated clean_s3 to support MinIO

* Made 'minio' the default bucket_service
AmeWenJ pushed a commit that referenced this pull request Mar 19, 2025
* Added support for MinIO and B2 buckets

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup

* Updated clean_s3 to support MinIO

* Made 'minio' the default bucket_service
AmeWenJ pushed a commit that referenced this pull request Apr 11, 2025
* Added support for MinIO and B2 buckets

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup

* Updated clean_s3 to support MinIO

* Made 'minio' the default bucket_service
@TaperChipmunk32 TaperChipmunk32 deleted the minio_b2 branch May 6, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants