Extend ci suite #1080

mkerin · 2023-11-15T18:21:01Z

Will hopefully close #957 when done.

Breakdown of requested new unit test coverage

| Description                                                             | Has Test | Test Function                       | Comment                                                                       |
|-------------------------------------------------------------------------|----------|-------------------------------------|-------------------------------------------------------------------------------|
| Data Processing                                                         | NA       | SECTION-HEADER                      |                                                                               |
| Download   scripts run                                                  | y        | test_url_accessibility              | Unavailable subsets   of ThePile are marked as skipped                        |
| Preprocessing   with each supported tokenizer works                     | y        | test_preprocess_data                | 5/6 tokenizers pass,   SentPiece failing                                      |
| Training a new   tokenizer                                              | y        | test_train_tokenizer                | Passes                                                                        |
| Primary   Functions                                                     | NA       | SECTION-HEADER                      |                                                                               |
| Launcher   scripts                                                      | y        | test_train_launcher                 | Passes                                                                        |
| Training (on   one GPU, one node, and one pod)                          | partial  | test_train_launcher                 | Only covers running on 1 node                                                 |
| Finetuning   (especially loading and training without optimizer states) | y        | test_finetune                       | Passes                                                                        |
| Inference                                                               | y        | test_generate                       | Passes                                                                        |
| Evaluation                                                              | y        | test_evaluate                       | Passes                                                                        |
| Optimizations   and Parallelizations                                    | NA       | SECTION-HEADER                      |                                                                               |
| ZeRO works and   memory usage is within prescribed limits               | n        |                                     |                                                                               |
| fp16 and bf16                                                           | y        |        test_model_training_options  | bf16 failing -   possibly due to incompatible options in the unit test set up |
| Optimizer   types                                                       | n        |                                     |                                                                               |
| Various MP and   PP values                                              | n        |                                     |                                                                               |
| Flash   Attention                                                       | n        |                                     | Out of scope                                                                  |
| Model Options                                                           | NA       | SECTION-HEADER                      |                                                                               |
| GPT-J residual                                                          | y        |        test_model_training_options  |                                                                               |
| LLaMA MLP                                                               | y        |        test_model_training_options  |                                                                               |
| Positional   embeddings                                                 | y        |        test_model_training_options  |                                                                               |
| Sparse   attention                                                      | y        |        test_model_training_options  |                                                                               |
| Dropout and   Weight decay                                              | y        |        test_model_training_options  |                                                                               |
| Kernel fusions                                                          | n        |                                     |                                                                               |
| With / without   bias terms                                             | y        | test_model_options                  |                                                                               |
| Conversion   Scripts                                                    | NA       | SECTION-HEADER                      |                                                                               |
| NeoX -> HF transformers library                                         | y        |        test_gpt_neox_to_huggingface | Failing; unclear how   to resolve                                             |
| NeoX ->   Megatron-DS                                                   | n        |                                     | Script doesn't exist                                                          |
| NeoX ->   SafeTensors                                                   | n        |                                     | Script doesn't exist                                                          |
| NeoX V1 ->   NeoX V2                                                    | n        |                                     | Script doesn't exist                                                          |
| Misc Features                                                           | NA       | SECTION-HEADER                      |                                                                               |
| Library   installs correctly and packages don’t have conflicts          | partial  | test_dependencies.py                |                                                                               |
| MuP (currently   bugged, see  #956)                                     | n        |                                     | Out of scope                                                                  |

mkerin · 2023-11-15T18:27:22Z

Sorry - all reviewers please feel free to remove yourselves - I meant to open this as a draft PR for now.

…s/` folder

Helpful for unit tests because it allows use of a randomly initialised model

Primary version lives in `tests/model/test_fused_kernels.py`

Resolves `Cannot re-initialize CUDA in forked subprocess` error when running distributed unit tests

mkerin · 2023-11-22T08:40:16Z

Clean CI run of CPU only tests is available here: https://github.com/mkerin/gpt-neox/actions/runs/6954696176

mkerin · 2023-11-22T08:52:16Z

We're still missing a test coverage for a couple of things in Stella's initial list, but I think this is worth merging. @Quentin-Anthony if you could take a look when you have time it would be greatly appreciated.

The major categories of test coverage that we're still missing are:

I have only added test coverage for MP=1 and PP=1
Tests to check that all supported optimizers run
Tests for flash attention

I won't be online much over the next couple of weeks, but I intend to take another look at these when I get back.

Some other issues that I encountered whilst working on this which are worth flagging:

Running tests in parallel with pytest --forked tests is currently broken. I fixed one blocker for this (CUDA being initialised too early), but hit another (the master_socket is hardcoded, & if it’s busy - eg. By the first test to run - then subsequent tests which expect it to be available will fail). Running tests in serial with pytest tests is fine though.
We have a test for ‘convert_sequential_to_hf.py’, but it’s failing for reasons that are unclear to me. I’ve marked it as failure expected & left some notes about the error.
I think we should avoid downloading data for the CPU CI run, and instead rely on (small) test data stored in tests/data/

Quentin-Anthony · 2023-11-22T19:31:46Z

We're still missing a test coverage for a couple of things in Stella's initial list, but I think this is worth merging. @Quentin-Anthony if you could take a look when you have time it would be greatly appreciated.

The major categories of test coverage that we're still missing are:

I have only added test coverage for MP=1 and PP=1

Tests to check that all supported optimizers run

Tests for flash attention

I won't be online much over the next couple of weeks, but I intend to take another look at these when I get back.

Some other issues that I encountered whilst working on this which are worth flagging:

Running tests in parallel with pytest --forked tests is currently broken. I fixed one blocker for this (CUDA being initialised too early), but hit another (the master_socket is hardcoded, & if it’s busy - eg. By the first test to run - then subsequent tests which expect it to be available will fail). Running tests in serial with pytest tests is fine though.

We have a test for ‘convert_sequential_to_hf.py’, but it’s failing for reasons that are unclear to me. I’ve marked it as failure expected & left some notes about the error.

I think we should avoid downloading data for the CPU CI run, and instead rely on (small) test data stored in tests/data/

Thanks for this work! I'll review over the next couple of days.

StellaAthena · 2023-11-25T16:31:33Z

My original request Training (on one GPU, one node, and one pod) is a typo, I meant to write Training (on one GPU, one node, and two nodes).

I believe @zphang has used our library with the SentencePieceTokenizer and may be able to identify why its failing.

mkerin · 2023-11-26T12:10:45Z

Thanks @StellaAthena.

To clarify, I believe that training on one and two nodes correspondes to training with world_size=1 or world_size=2 (equivalent to one or two cores on a GPU). So to test the first case of training on one GPU, we want to set the host file such that world_size =n where n is all cores available on that GPU?

mkerin · 2023-11-26T12:11:57Z

In addition, one of the prerequisites of gpt-neox (best-download) is currently broken on pypi. I believe all that is required to fix it is just to update the pypi release of best-download
EleutherAI/best-download#6

It would be great if you could bump the PyPi release of best-download so that we don't need to point to the git-latest in the requirements file.

mkerin · 2023-11-29T09:17:07Z

The best-download dependency has been updated

& I confirmed that installing from source is now broken (as stated on discord)
https://github.com/mkerin/gpt-neox/actions/runs/7030214909
but installing from PyPi is fixed (for this gh run I dropped the commit to install best-download from latest gh instead of pypi)
https://github.com/mkerin/gpt-neox/actions/runs/7030248294

I updated the PR to use pypi as the source for best-download.

megatron/tokenizer/train_tokenizer.py

Quentin-Anthony · 2023-12-04T08:56:06Z

Can confirm this is all working for me. Great work!

mkerin requested a review from a team as a code owner November 15, 2023 18:21

mkerin requested review from Quentin-Anthony and ShivanshuPurohit November 15, 2023 18:21

mkerin marked this pull request as draft November 15, 2023 18:21

mkerin force-pushed the extend_ci_suite branch 2 times, most recently from ca0d758 to ff0983f Compare November 21, 2023 22:21

mkerin added 2 commits November 22, 2023 05:44

Use .yml extensions in README to reflect extensions used in `config…

246e298

…s/` folder

Rename save_interval -> checkpoint_factor

47282ee

mkerin force-pushed the extend_ci_suite branch 4 times, most recently from 3123768 to 8c63f72 Compare November 22, 2023 07:48

mkerin added 6 commits November 22, 2023 08:28

Mark expected failures in existing tests

9ce3942

Fix minor typos

634d25c

Allow creation of checkpoint at iteration 0 when do_train=False

edb3ef5

Helpful for unit tests because it allows use of a randomly initialised model

Delete duplicated test_fused_kernels.py

3211520

Primary version lives in `tests/model/test_fused_kernels.py`

Avoid initializing CUDA whenever megatron is imported

cce30fe

Resolves `Cannot re-initialize CUDA in forked subprocess` error when running distributed unit tests

Extend suite of unit tests

6b227bc

mkerin force-pushed the extend_ci_suite branch from 8c63f72 to be57aef Compare November 22, 2023 08:28

mkerin marked this pull request as ready for review November 22, 2023 08:42

Quentin-Anthony self-assigned this Nov 29, 2023

mkerin force-pushed the extend_ci_suite branch from d5237d7 to 6b227bc Compare November 29, 2023 09:03

Quentin-Anthony reviewed Dec 4, 2023

View reviewed changes

megatron/tokenizer/train_tokenizer.py Show resolved Hide resolved

Quentin-Anthony approved these changes Dec 4, 2023

View reviewed changes

Quentin-Anthony merged commit 3be59a4 into EleutherAI:main Dec 4, 2023
2 checks passed

Quentin-Anthony mentioned this pull request Dec 4, 2023

Patch coverity scan #1090

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend ci suite #1080

Extend ci suite #1080

mkerin commented Nov 15, 2023 •

edited

Loading

mkerin commented Nov 15, 2023

mkerin commented Nov 22, 2023 •

edited

Loading

mkerin commented Nov 22, 2023 •

edited

Loading

Quentin-Anthony commented Nov 22, 2023

StellaAthena commented Nov 25, 2023

mkerin commented Nov 26, 2023

mkerin commented Nov 26, 2023

mkerin commented Nov 29, 2023 •

edited

Loading

Quentin-Anthony commented Dec 4, 2023

Extend ci suite #1080

Extend ci suite #1080

Conversation

mkerin commented Nov 15, 2023 • edited Loading

mkerin commented Nov 15, 2023

mkerin commented Nov 22, 2023 • edited Loading

mkerin commented Nov 22, 2023 • edited Loading

Quentin-Anthony commented Nov 22, 2023

StellaAthena commented Nov 25, 2023

mkerin commented Nov 26, 2023

mkerin commented Nov 26, 2023

mkerin commented Nov 29, 2023 • edited Loading

Quentin-Anthony commented Dec 4, 2023

mkerin commented Nov 15, 2023 •

edited

Loading

mkerin commented Nov 22, 2023 •

edited

Loading

mkerin commented Nov 22, 2023 •

edited

Loading

mkerin commented Nov 29, 2023 •

edited

Loading