TfIdfFeaturizer
This featurizer is a sparse featurizer. It builds on the scikit-learn implementation to convert text into sparse features that take the frequency of words into account. If we were to feed the direct count data directly to a classifier very frequent terms might shadow the frequencies of rarer, but potentially more interesting words.
Configurable Variables¶
- analyzer: determines how tokens are split. possible choices are
word
,char
andchar_wb
, default isword
. - min_ngram: the lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
- max_ngram: the upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
Base Usage¶
The configuration file below demonstrates how you might use the TfIdfFeaturizer featurizer.
language: en
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: rasa_nlu_examples.featurizers.sparse.TfIdfFeaturizer
min_ngram: 1
max_ngram: 2
- name: DIETClassifier
epochs: 100