Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2418

Conversation

aishwaryaraimule21
Copy link

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

google-oss-robot and others added 12 commits January 7, 2025 02:55
…2377)

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
* Add MNIST example with SPMD for JAX

Illustrate how to use JAX's `pmap` to express and execute
single-program multiple-data (SPMD) programs for data parallelism
along a batch dimension

Signed-off-by: Sandipan Panda <[email protected]>

* Update CONTRIBUTING.md

Use -- server-side to install the latest local changes of Training
Operator control plane

Signed-off-by: Sandipan Panda <[email protected]>

* Add JAXJob output

Signed-off-by: Sandipan Panda <[email protected]>

* Update JAXJob CI images

Signed-off-by: Sandipan Panda <[email protected]>

* Adjust jaxjob spmd example batch size

Signed-off-by: Sandipan Panda <[email protected]>

* Add JAX Example Docker Image Build in CI

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix script name typo

Signed-off-by: sailesh duddupudi <[email protected]>

* Update script permissions

Signed-off-by: sailesh duddupudi <[email protected]>

* Add KIND_CLUSTER env var

Signed-off-by: sailesh duddupudi <[email protected]>

* Increase timeouts

Signed-off-by: sailesh duddupudi <[email protected]>

* Test higher resources

Signed-off-by: sailesh duddupudi <[email protected]>

* Increase Timeout

Signed-off-by: sailesh duddupudi <[email protected]>

* remove resource reqs

Signed-off-by: sailesh duddupudi <[email protected]>

* test low batch size

Signed-off-by: sailesh duddupudi <[email protected]>

* test small batch size

Signed-off-by: sailesh duddupudi <[email protected]>

* Hardcode number of batches

Signed-off-by: sailesh duddupudi <[email protected]>

---------

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
Co-authored-by: Sandipan Panda <[email protected]>
Co-authored-by: sailesh duddupudi <[email protected]>
)

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.30.0 to 0.33.0.
- [Commits](golang/net@v0.30.0...v0.33.0)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: ChristianZaccaria <[email protected]>
Co-authored-by: ChristianZaccaria <[email protected]>
…beflow#2417)

This commit adds jaxjobs to the aggregation ClusterRole for Kubeflow,
which allows Kubeflow Profiles to have edit and admin rights over this CR.

Fixes kubeflow#2416

Signed-off-by: Daniela Plascencia <[email protected]>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants