Skip to content

Advice on training a language model? #89

@Mattias421

Description

@Mattias421

Hello,

I am trying to train a DFM LM from a uniform distribution with the KLD loss. I am training on text data from LibriSpeech , which contains around 130,000 sentences. Given that LibriSpeech text data is much smaller than webtext, I have changed the config to reflect the dataset difference. My main choices have been guided by what hyperparameters and methods are used for training autoregressive LMs on LibriSpeech.

I trained the model for about 350k steps with a batch size of 256, which achieves 2.8 KLD loss on the validation set. Surprisingly, the ELBO for this model is over 600000, which is a lot larger than I was expecting. Looking at the model outputs, it is similar to a 3-gram language model.

This leads to my question: are there any obvious design considerations for training a DFM LM on a dataset much smaller than webtext?

Additionally, I am slightly confused by the correct way to report ELBO, the eval scripts uses ELBO: {torch.exp(elbo / num_elements), but this reminds me more of perplexity than ELBO. Although I am only just beginning to wrap my head around the ELBO presented in the paper XD

Any tips and discussion would be greatly appreciated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions