This featurizer is a dense featurizer. If you're interested in learning how these work you might appreciate readingthe original article. Recognition should be given to Benjamin Heinzerling and Michael Strube for making these available.
A main feature of these types of embeddings is that they are relatively lightweight but also their availability in many languages. BytePair embeddings exist for 277 languages that are pretrained on wikipedia. There's also availability for a multi-language setting.
More information on these embeddings can be found here. When you scroll down you will notice a large of languages that are available. Here's some examples from that list that give a detailed view of available vectors:
- lang: specifies the lanuage that you'll use, default =
- dim: specifies the dimension of the subword embeddings, default =
- vs: specifies the vocabulary size of the segmentation model, default =
- vs_fallback: if set to True and the given vocabulary size can't be loaded for the given model, the closest size is chosen, default=
- cache_dir: specifies the folder in which downloaded BPEmb files will be cached, default =
- model_file: specifies the path to a custom model file, default=
- emb_file: specifies the path to a custom embedding file, default=
The configuration file below demonstrates how you might use the BytePair embeddings. In this example
we're not using any cached folders and the library will automatically download the correct embeddings
for you and save them in
~/.cache. Both the embeddings as well as a model file will be saved.
language: en pipeline: - name: WhitespaceTokenizer - name: LexicalSyntacticFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer lang: en vs: 1000 dim: 25 - name: DIETClassifier epochs: 100
If you're using pre-downloaded embedding files (in docker you might have this on a mounted disk) then you can prevent a download from happening. We'll be doing that in the example below.
language: en pipeline: - name: WhitespaceTokenizer - name: LexicalSyntacticFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer lang: en vs: 10000 dim: 100 cache_dir: "tests/data" - name: DIETClassifier epochs: 100
Note that in this case we expect two files to be present in the
You can also overwrite the names of these files via the
emb_file settings. But it
is preferable to stick to the library naming convention. Also note that if you use the
emb_file settings that you must provide full filepaths and that the
cache_dir will be ignored. It is
still considered good practice to manually specify the
vs parameter in this situation.