TfIdfFeaturizer
This featurizer is a sparse featurizer. It builds on the scikit-learn implementation to convert text into sparse features that take the frequency of words into account. If we were to feed the direct count data directly to a classifier very frequent terms might shadow the frequencies of rarer, but potentially more interesting words.
Configurable Variables¶
- analyzer: determines how tokens are split. possible choices are 
word,charandchar_wb, default isword. - min_ngram: the lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
 - max_ngram: the upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
 
Base Usage¶
The configuration file below demonstrates how you might use the TfIdfFeaturizer featurizer.
language: en
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: rasa_nlu_examples.featurizers.sparse.TfIdfFeaturizer
  min_ngram: 1
  max_ngram: 2
- name: DIETClassifier
  epochs: 100