This page discusses some properties of the
Note that this featurizer is a dense featurizer.
Gensim is a popular python library that makes it relatively easy to train your own word vectors. This can be useful if your corpus is very different than what most popular embeddings are trained on. We'll give a small guide on how to train your own embeddings here but you can also read the guide on the gensim docs.
Training Your Own¶
Training your own gensim model can be done in a few lines of code. A demonstration is shown below.
from gensim.models import Word2Vec # Gensim needs a list of lists to represent tokens in a document. # In real life you’d read a text file and turn it into lists here. text = ["this is a sentence", "so is this", "and we're all talking"] tokens = [t.split(" ") for t in text] # This is where we train new word embeddings. model = Word2Vec(sentences=tokens, size=10, window=3, min_count=1, iter=5, workers=2) # This is where they are saved to disk. model.wv.save("wordvectors.kv")
wordvectors.kv file should contain all the vectors that you've trained. It's this
file that you can pass on to this component.
- cache_dir: pass it the name of the directory where you've downloaded/saved the embeddings
- file: pass it the name of the
.kvfile that contains the word embeddings
The configuration file below demonstrates how you might use the gensim embeddings. In this example
we're building a pipeline for the English language and we're assuming that you've trained your
own embeddings which have been saved upfront as
language: en pipeline: - name: WhitespaceTokenizer - name: LexicalSyntacticFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: rasa_nlu_examples.featurizers.dense.GensimFeaturizer cache_dir: saved/beforehand file: filename.kv - name: DIETClassifier epochs: 100