HashingFeaturizer
This page discusses some properties of the HashingFeaturizer
.
Note that this featurizer is a sparse featurizer.
The featurizer is a wrapper a around scikit-learn's
HashingVectorizer.
It uses the "hashing trick" to transform input texts to a sparse vector by mapping each token to a column index using a
fixed hash function. The featurizer has no state and cannot be trained. For a small number of columns (defined by the
n_features
parameter), hash coalitions are more likely, meaning that two words can get mapped to the
same index.
Configurable Variables¶
- n_features: the number of columns the input is mapped to.
- analyzer: determines how tokens are split. possible choices are
word
,char
andchar_wb
. - lowercase: convert input strings to lowercase.
- strip_accents: remove accents using one of the methods
ascii
orunicode
. - stop_words: filter by a list of stop words.
- min_ngram: the lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
- max_ngram: the upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted.
- norm: the normalization applied to each row vector (options are
l1
,l2
ornull
). - binary: if
True
, all non-zero elements are mapped to1
, instead of absolute counts. - alternate_sign: apply the sign of the hashing function in order to reduce the effect of hash coalitions.
Base Usage¶
The configuration file below demonstrates how you might use the hashing featurizer.
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: rasa_nlu_examples.featurizers.sparse.HashingFeaturizer
n_features: 1024
- name: DIETClassifier
epochs: 100
Combining several hashing featurizers¶
In order to use a combination of several hash functions, multiple HashingFeaturizer
instances can be
added to the pipeline. However, note that since the hash function is deterministic, one needs to set a
varying number of n_features
for each. Otherwise one would end up with the same sparse vector being
concatenated multiple times. See the discussion
here
how this, in combination with DIETClassifier
, is related to
Bloom embeddings.
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: rasa_nlu_examples.featurizers.sparse.HashingFeaturizer
n_features: 1021
- name: rasa_nlu_examples.featurizers.sparse.HashingFeaturizer
n_features: 1022
- name: rasa_nlu_examples.featurizers.sparse.HashingFeaturizer
n_features: 1023
- name: rasa_nlu_examples.featurizers.sparse.HashingFeaturizer
n_features: 1024
- name: DIETClassifier
epochs: 100