-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend ci suite #1080
Extend ci suite #1080
Conversation
Sorry - all reviewers please feel free to remove yourselves - I meant to open this as a draft PR for now. |
ca0d758
to
ff0983f
Compare
3123768
to
8c63f72
Compare
Helpful for unit tests because it allows use of a randomly initialised model
Primary version lives in `tests/model/test_fused_kernels.py`
Resolves `Cannot re-initialize CUDA in forked subprocess` error when running distributed unit tests
8c63f72
to
be57aef
Compare
Clean CI run of CPU only tests is available here: https://github.com/mkerin/gpt-neox/actions/runs/6954696176 |
We're still missing a test coverage for a couple of things in Stella's initial list, but I think this is worth merging. @Quentin-Anthony if you could take a look when you have time it would be greatly appreciated. The major categories of test coverage that we're still missing are:
I won't be online much over the next couple of weeks, but I intend to take another look at these when I get back. Some other issues that I encountered whilst working on this which are worth flagging:
|
Thanks for this work! I'll review over the next couple of days. |
My original request I believe @zphang has used our library with the |
Thanks @StellaAthena. To clarify, I believe that training on one and two nodes correspondes to training with world_size=1 or world_size=2 (equivalent to one or two cores on a GPU). So to test the first case of training on one GPU, we want to set the host file such that world_size =n where n is all cores available on that GPU? |
In addition, one of the prerequisites of gpt-neox (best-download) is currently broken on pypi. I believe all that is required to fix it is just to update the pypi release of best-download It would be great if you could bump the PyPi release of best-download so that we don't need to point to the git-latest in the requirements file. |
d5237d7
to
6b227bc
Compare
The & I confirmed that installing from source is now broken (as stated on discord) I updated the PR to use pypi as the source for best-download. |
Can confirm this is all working for me. Great work! |
Will hopefully close #957 when done.
Breakdown of requested new unit test coverage