diff --git a/README.md b/README.md
index da8f792a0..14ee8077e 100644
--- a/README.md
+++ b/README.md
@@ -283,7 +283,7 @@ Label: NotNextSentence
 ```
 
 We then train a large model (12-layer to 24-layer Transformer) on a large corpus
-(Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M
+(Wikipedia + [BookCorpus](https://paperswithcode.com/dataset/bookcorpus)) for a long time (1M
 update steps), and that's BERT.
 
 Using BERT has two stages: *Pre-training* and *fine-tuning*.