Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifeval: Dowload punkt_tab on rank 0 #2267

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Ifeval: Dowload punkt_tab on rank 0 #2267

wants to merge 4 commits into from

Conversation

baberabb
Copy link
Contributor

closes #2266. Also removed the pkg_resources dependency as that's depreciated.

nltk.download("punkt_tab")
print("Downloaded punkt_tab")
else:
time.sleep(5)
Copy link
Contributor Author

@baberabb baberabb Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really necessary as the code runs at the beginning (before generations). But couldn't hurt.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: if sleep is not needed we should not add it.

@al093
Copy link

al093 commented Aug 30, 2024

Sharing a slightly more verbose version.
I would do this:

def download_nltk_resources_guarded() -> None:
    """Download 'punkt_tab' tokenizer.

    Downloading nltk with distributed barrier otherwise race condition can occur
    when multiple processes try to download the same resource later.
    """
    local_rank = os.environ("LOCAL_RANK", 0)

    if local_rank == 0:
        try:
            nltk.data.find("tokenizers/punkt_tab")
        except LookupError:
            logger.info(f"Local rank {local_rank}: Downloading NLTK 'punkt_tab' resource.")
            nltk.download("punkt_tab")
            logger.info(f"Local rank {local_rank}: Downloaded NLTK 'punkt_tab' resource.")

    if torch.distributed.is_initialized():
         torch.distributed.barrier()
    try:
        nltk.data.find("tokenizers/punkt_tab")
    except LookupError:
        logger.error(
            f"Local rank {local_rank}: NLTK 'punkt' resource not found."
            f"This should have been downloaded by local rank 0."
        )
        raise

@al093
Copy link

al093 commented Aug 30, 2024

suggestion: would rethink the current download on import behaviour.

@baberabb
Copy link
Contributor Author

@al093 . Thanks very much for your suggestion. I'm hesitant in add a torch dependency here, as none of the tasks currently require it. I removed time.sleep() as you suggested and we do have a barrier later in the evaluation loop. To add an extra layer we could add a condition to have the user manually download and cache the data: please run python -c "import nltk; nltk.download('punkt'). Thoughts @haileyschoelkopf ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IFEval fails when multiple gpus are used (for DDP)
2 participants