We provide preprocessing scripts and datasets that we use in our paper.
At this time we are preparing most of the (cross-lingual) task data for public release. If you'd like to receive a preliminary (undocumented) version of the data please write an e-mail to us.
As part of our work we trained word embeddings (BIVCD) and (re-)mapped others with the method described in the appendix of our paper.
- en-de word embeddings: BIVCD, AttractRepel, Fasttext (300K), Fasttext (Full)
- en-fr word embeddings: BIVCD, AttractRepel, Fasttext (300K), Fasttext (Full)
Fasttext 300K only contain the 300K most frequent tokens (of both languages). The full versions are mapped variants of the full pre-trained fasttext. Use the full versions to reproduce our results.
We trained our cross-lingual adaptations of InferSent on (machine-) translated cross-lingual variants of SNLI:
The above contain SNLI with all possible language combinations of the sentence pairs (en-en, en-de, de-en, de-de). Thus, the datasets are four times as large as the original.
We plan to release translated SNLI corpora in different languages soon (de,fr,es,ar).
Please read LICENSE.txt and NOTICE.txt in the project root. We distribute derivational data under the same license as the original.