-
Notifications
You must be signed in to change notification settings - Fork 291
Description
Hello,
I am trying to train a DFM LM from a uniform distribution with the KLD loss. I am training on text data from LibriSpeech , which contains around 130,000 sentences. Given that LibriSpeech text data is much smaller than webtext, I have changed the config to reflect the dataset difference. My main choices have been guided by what hyperparameters and methods are used for training autoregressive LMs on LibriSpeech.
I trained the model for about 350k steps with a batch size of 256, which achieves 2.8 KLD loss on the validation set. Surprisingly, the ELBO for this model is over 600000, which is a lot larger than I was expecting. Looking at the model outputs, it is similar to a 3-gram language model.
This leads to my question: are there any obvious design considerations for training a DFM LM on a dataset much smaller than webtext?
Additionally, I am slightly confused by the correct way to report ELBO, the eval scripts uses ELBO: {torch.exp(elbo / num_elements), but this reminds me more of perplexity than ELBO. Although I am only just beginning to wrap my head around the ELBO presented in the paper XD
Any tips and discussion would be greatly appreciated :)