Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordPiece tokenizer vs Full Tokenizer #6

Open
chiehminwei opened this issue Apr 20, 2019 · 2 comments
Open

WordPiece tokenizer vs Full Tokenizer #6

chiehminwei opened this issue Apr 20, 2019 · 2 comments

Comments

@chiehminwei
Copy link

I noticed that in data.py for BERT you use wordpiece tokenizer instead of full tokenizer. I tried switching to full tokenizer, but strangely it gave me much worse results. I suspect there could be a bug somewhere when you're aligning the embeddings, because I then calculated the average myself when I'm calculating the embeddings, and the drop in performance got fixed (i.e. when I create the embedding file, it is word-level already, so I don't have to rely on your code to calculate the average).

By the way, I think it would be nice if you could mention in the README.md file that the ELMO config files allow you to use word-level (instead of subword-level) embeddings. This is good because if you pre-calculate these (save word-level embeddings instead of subword-level embeddings) when you prepare your BERT embedding file and just use the ELMO config files, you don't have to waste time aligning embeddings every time you start a new experiment.

Also, I think a better workflow may be generating one embedding file for each BERT layer instead of generating one big file with all the layers and have many config files each specifying a different layer index. This is because I tried the original workflow on a bigger dataset (Czech UD treebank) and ran into out of memory issues. The original workflow also makes preparing the data for training really slow.

@john-hewitt
Copy link
Owner

Hi; thanks for your notes!

The decision to use only the wordpiece tokenizer is an assumption used by the alignment function; it explicitly uses the "##" wordpiece denotation while deciding how alignments are made. I'm not surprised it gives bad results for using the full tokenizer. Because the treebanks used are pre-tokenized, I don't think using the simple splitting tokenizer of pytorch-pretrained-bert on already-tokenized text makes sense.

However, there's also the problem that the treebank tokenization doesn't actually totally match what would happen if you natively tokenized the original string, so BERT is dealing with somewhat out-of-domain tokenization. Ah well.

Yes, ELMo provides token embeddings, not subword embeddings; I'll make a note that people can use that to handle arbitrary token embeddings! I agree that alignment takes an annoyingly long time; I suppose I didn't pre-align in case I wanted to change alignment stuff during development, but then never did.

Hmm yes there is a lot of IO time due to loading stuff from disk; saving each layer in a different file would make sense. However, we do currently have a single config for each layer. Also, unfortunately, this may not help the RAM issues, since we don't keep around the other layers anyway once we've loaded each one. Maybe while the hdf5 file is open it tries to keep everything in RAM? But I doubt it, since the BERT-large embeddings for PTB train are >100GB, which is not the amount of RAM I devote to each job.

@chiehminwei
Copy link
Author

Using the "##" assumption to do alignment breaks down when you run it on a language like Chinese (which is what I tried to do), which uses character tokenization. Interestingly I got slightly higher numbers (spearman, root acc, etc.) on English when I tried full tokenization and averaging myself.

I guess you're right about the RAM issue. And I guess I'll need many config files anyways, one for each layer (instead of changing layer index, change dataset directory), but this will speed up IO a bit.

In any case, really nice work! Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants