Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ci suite #1080

Merged
merged 8 commits into from
Dec 4, 2023
Merged

Extend ci suite #1080

merged 8 commits into from
Dec 4, 2023

Conversation

mkerin
Copy link
Contributor

@mkerin mkerin commented Nov 15, 2023

Will hopefully close #957 when done.

Breakdown of requested new unit test coverage

| Description                                                             | Has Test | Test Function                       | Comment                                                                       |
|-------------------------------------------------------------------------|----------|-------------------------------------|-------------------------------------------------------------------------------|
| Data Processing                                                         | NA       | SECTION-HEADER                      |                                                                               |
| Download   scripts run                                                  | y        | test_url_accessibility              | Unavailable subsets   of ThePile are marked as skipped                        |
| Preprocessing   with each supported tokenizer works                     | y        | test_preprocess_data                | 5/6 tokenizers pass,   SentPiece failing                                      |
| Training a new   tokenizer                                              | y        | test_train_tokenizer                | Passes                                                                        |
| Primary   Functions                                                     | NA       | SECTION-HEADER                      |                                                                               |
| Launcher   scripts                                                      | y        | test_train_launcher                 | Passes                                                                        |
| Training (on   one GPU, one node, and one pod)                          | partial  | test_train_launcher                 | Only covers running on 1 node                                                 |
| Finetuning   (especially loading and training without optimizer states) | y        | test_finetune                       | Passes                                                                        |
| Inference                                                               | y        | test_generate                       | Passes                                                                        |
| Evaluation                                                              | y        | test_evaluate                       | Passes                                                                        |
| Optimizations   and Parallelizations                                    | NA       | SECTION-HEADER                      |                                                                               |
| ZeRO works and   memory usage is within prescribed limits               | n        |                                     |                                                                               |
| fp16 and bf16                                                           | y        |        test_model_training_options  | bf16 failing -   possibly due to incompatible options in the unit test set up |
| Optimizer   types                                                       | n        |                                     |                                                                               |
| Various MP and   PP values                                              | n        |                                     |                                                                               |
| Flash   Attention                                                       | n        |                                     | Out of scope                                                                  |
| Model Options                                                           | NA       | SECTION-HEADER                      |                                                                               |
| GPT-J residual                                                          | y        |        test_model_training_options  |                                                                               |
| LLaMA MLP                                                               | y        |        test_model_training_options  |                                                                               |
| Positional   embeddings                                                 | y        |        test_model_training_options  |                                                                               |
| Sparse   attention                                                      | y        |        test_model_training_options  |                                                                               |
| Dropout and   Weight decay                                              | y        |        test_model_training_options  |                                                                               |
| Kernel fusions                                                          | n        |                                     |                                                                               |
| With / without   bias terms                                             | y        | test_model_options                  |                                                                               |
| Conversion   Scripts                                                    | NA       | SECTION-HEADER                      |                                                                               |
| NeoX -> HF transformers library                                         | y        |        test_gpt_neox_to_huggingface | Failing; unclear how   to resolve                                             |
| NeoX ->   Megatron-DS                                                   | n        |                                     | Script doesn't exist                                                          |
| NeoX ->   SafeTensors                                                   | n        |                                     | Script doesn't exist                                                          |
| NeoX V1 ->   NeoX V2                                                    | n        |                                     | Script doesn't exist                                                          |
| Misc Features                                                           | NA       | SECTION-HEADER                      |                                                                               |
| Library   installs correctly and packages don’t have conflicts          | partial  | test_dependencies.py                |                                                                               |
| MuP (currently   bugged, see  #956)                                     | n        |                                     | Out of scope                                                                  |

@mkerin mkerin requested a review from a team as a code owner November 15, 2023 18:21
@mkerin mkerin marked this pull request as draft November 15, 2023 18:21
@mkerin
Copy link
Contributor Author

mkerin commented Nov 15, 2023

Sorry - all reviewers please feel free to remove yourselves - I meant to open this as a draft PR for now.

@mkerin mkerin force-pushed the extend_ci_suite branch 2 times, most recently from ca0d758 to ff0983f Compare November 21, 2023 22:21
@mkerin mkerin force-pushed the extend_ci_suite branch 4 times, most recently from 3123768 to 8c63f72 Compare November 22, 2023 07:48
Helpful for unit tests because it allows use of a randomly initialised model
Primary version lives in `tests/model/test_fused_kernels.py`
Resolves `Cannot re-initialize CUDA in forked subprocess` error when running distributed unit tests
@mkerin
Copy link
Contributor Author

mkerin commented Nov 22, 2023

Clean CI run of CPU only tests is available here: https://github.com/mkerin/gpt-neox/actions/runs/6954696176

@mkerin mkerin marked this pull request as ready for review November 22, 2023 08:42
@mkerin
Copy link
Contributor Author

mkerin commented Nov 22, 2023

We're still missing a test coverage for a couple of things in Stella's initial list, but I think this is worth merging. @Quentin-Anthony if you could take a look when you have time it would be greatly appreciated.

The major categories of test coverage that we're still missing are:

  • I have only added test coverage for MP=1 and PP=1
  • Tests to check that all supported optimizers run
  • Tests for flash attention

I won't be online much over the next couple of weeks, but I intend to take another look at these when I get back.

Some other issues that I encountered whilst working on this which are worth flagging:

  • Running tests in parallel with pytest --forked tests is currently broken. I fixed one blocker for this (CUDA being initialised too early), but hit another (the master_socket is hardcoded, & if it’s busy - eg. By the first test to run - then subsequent tests which expect it to be available will fail). Running tests in serial with pytest tests is fine though.
  • We have a test for ‘convert_sequential_to_hf.py’, but it’s failing for reasons that are unclear to me. I’ve marked it as failure expected & left some notes about the error.
  • I think we should avoid downloading data for the CPU CI run, and instead rely on (small) test data stored in tests/data/

@Quentin-Anthony
Copy link
Member

We're still missing a test coverage for a couple of things in Stella's initial list, but I think this is worth merging. @Quentin-Anthony if you could take a look when you have time it would be greatly appreciated.

The major categories of test coverage that we're still missing are:

  • I have only added test coverage for MP=1 and PP=1
  • Tests to check that all supported optimizers run
  • Tests for flash attention

I won't be online much over the next couple of weeks, but I intend to take another look at these when I get back.

Some other issues that I encountered whilst working on this which are worth flagging:

  • Running tests in parallel with pytest --forked tests is currently broken. I fixed one blocker for this (CUDA being initialised too early), but hit another (the master_socket is hardcoded, & if it’s busy - eg. By the first test to run - then subsequent tests which expect it to be available will fail). Running tests in serial with pytest tests is fine though.
  • We have a test for ‘convert_sequential_to_hf.py’, but it’s failing for reasons that are unclear to me. I’ve marked it as failure expected & left some notes about the error.
  • I think we should avoid downloading data for the CPU CI run, and instead rely on (small) test data stored in tests/data/

Thanks for this work! I'll review over the next couple of days.

@StellaAthena
Copy link
Member

My original request Training (on one GPU, one node, and one pod) is a typo, I meant to write Training (on one GPU, one node, and two nodes).

I believe @zphang has used our library with the SentencePieceTokenizer and may be able to identify why its failing.

@mkerin
Copy link
Contributor Author

mkerin commented Nov 26, 2023

Thanks @StellaAthena.

To clarify, I believe that training on one and two nodes correspondes to training with world_size=1 or world_size=2 (equivalent to one or two cores on a GPU). So to test the first case of training on one GPU, we want to set the host file such that world_size =n where n is all cores available on that GPU?

@mkerin
Copy link
Contributor Author

mkerin commented Nov 26, 2023

In addition, one of the prerequisites of gpt-neox (best-download) is currently broken on pypi. I believe all that is required to fix it is just to update the pypi release of best-download
EleutherAI/best-download#6

It would be great if you could bump the PyPi release of best-download so that we don't need to point to the git-latest in the requirements file.

@mkerin
Copy link
Contributor Author

mkerin commented Nov 29, 2023

The best-download dependency has been updated
image

& I confirmed that installing from source is now broken (as stated on discord)
https://github.com/mkerin/gpt-neox/actions/runs/7030214909
but installing from PyPi is fixed (for this gh run I dropped the commit to install best-download from latest gh instead of pypi)
https://github.com/mkerin/gpt-neox/actions/runs/7030248294

I updated the PR to use pypi as the source for best-download.

@Quentin-Anthony
Copy link
Member

Can confirm this is all working for me. Great work!

@Quentin-Anthony Quentin-Anthony merged commit 3be59a4 into EleutherAI:main Dec 4, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Robust testing suite
3 participants