WordPiece tokenizer vs Full Tokenizer #6

chiehminwei · 2019-04-20T15:05:37Z

I noticed that in data.py for BERT you use wordpiece tokenizer instead of full tokenizer. I tried switching to full tokenizer, but strangely it gave me much worse results. I suspect there could be a bug somewhere when you're aligning the embeddings, because I then calculated the average myself when I'm calculating the embeddings, and the drop in performance got fixed (i.e. when I create the embedding file, it is word-level already, so I don't have to rely on your code to calculate the average).

By the way, I think it would be nice if you could mention in the README.md file that the ELMO config files allow you to use word-level (instead of subword-level) embeddings. This is good because if you pre-calculate these (save word-level embeddings instead of subword-level embeddings) when you prepare your BERT embedding file and just use the ELMO config files, you don't have to waste time aligning embeddings every time you start a new experiment.

Also, I think a better workflow may be generating one embedding file for each BERT layer instead of generating one big file with all the layers and have many config files each specifying a different layer index. This is because I tried the original workflow on a bigger dataset (Czech UD treebank) and ran into out of memory issues. The original workflow also makes preparing the data for training really slow.

john-hewitt · 2019-04-21T02:29:23Z

Hi; thanks for your notes!

The decision to use only the wordpiece tokenizer is an assumption used by the alignment function; it explicitly uses the "##" wordpiece denotation while deciding how alignments are made. I'm not surprised it gives bad results for using the full tokenizer. Because the treebanks used are pre-tokenized, I don't think using the simple splitting tokenizer of pytorch-pretrained-bert on already-tokenized text makes sense.

However, there's also the problem that the treebank tokenization doesn't actually totally match what would happen if you natively tokenized the original string, so BERT is dealing with somewhat out-of-domain tokenization. Ah well.

Yes, ELMo provides token embeddings, not subword embeddings; I'll make a note that people can use that to handle arbitrary token embeddings! I agree that alignment takes an annoyingly long time; I suppose I didn't pre-align in case I wanted to change alignment stuff during development, but then never did.

Hmm yes there is a lot of IO time due to loading stuff from disk; saving each layer in a different file would make sense. However, we do currently have a single config for each layer. Also, unfortunately, this may not help the RAM issues, since we don't keep around the other layers anyway once we've loaded each one. Maybe while the hdf5 file is open it tries to keep everything in RAM? But I doubt it, since the BERT-large embeddings for PTB train are >100GB, which is not the amount of RAM I devote to each job.

chiehminwei · 2019-04-22T05:35:59Z

Using the "##" assumption to do alignment breaks down when you run it on a language like Chinese (which is what I tried to do), which uses character tokenization. Interestingly I got slightly higher numbers (spearman, root acc, etc.) on English when I tried full tokenization and averaging myself.

I guess you're right about the RAM issue. And I guess I'll need many config files anyways, one for each layer (instead of changing layer index, change dataset directory), but this will speed up IO a bit.

In any case, really nice work! Thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordPiece tokenizer vs Full Tokenizer #6

WordPiece tokenizer vs Full Tokenizer #6

chiehminwei commented Apr 20, 2019

john-hewitt commented Apr 21, 2019

chiehminwei commented Apr 22, 2019

WordPiece tokenizer vs Full Tokenizer #6

WordPiece tokenizer vs Full Tokenizer #6

Comments

chiehminwei commented Apr 20, 2019

john-hewitt commented Apr 21, 2019

chiehminwei commented Apr 22, 2019