Fasttext supports word embeddings for 157 languages and is trained on both Common Crawl and Wikipedia. You can download the embeddings here. Note that this featurizer is a dense featurizer. Beware that these embedding files tend to be big: about 6-7Gb.
In order to use this tool you'll need to ensure the correct dependencies are installed.
pip install "rasa_nlu_examples[fasttext] @ https://github.com/RasaHQ/rasa-nlu-examples.git"
- cache_dir: pass it the name of the directory where you've downloaded the embeddings
- file: pass it the name of the file that contains the word embeddings
The configuration file below demonstrates how you might use the fasttext embeddings. In this example
we're building a pipeline for the Dutch language and we're assuming that the embeddings have been
downloaded beforehand and save over at
language: nl pipeline: - name: WhitespaceTokenizer - name: LexicalSyntacticFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: rasa_nlu_examples.featurizers.dense.FastTextFeaturizer cache_dir: downloaded/beforehand file: cc.nl.300.bin - name: DIETClassifier epochs: 100