diff --git a/README.md b/README.md index da8f792a0..14ee8077e 100644 --- a/README.md +++ b/README.md @@ -283,7 +283,7 @@ Label: NotNextSentence ``` We then train a large model (12-layer to 24-layer Transformer) on a large corpus -(Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M +(Wikipedia + [BookCorpus](https://paperswithcode.com/dataset/bookcorpus)) for a long time (1M update steps), and that's BERT. Using BERT has two stages: *Pre-training* and *fine-tuning*.