def get_features_by_wordbag(): global max_features x_train, x_test, y_train, y_test=load_all_files() vectorizer = CountVectorizer( decode_error='ignore', strip_accents='ascii', max_features=max_features, stop_words='english', max_df=1.0, min_df=1 ) print vectorizer x_train=vectorizer.fit_transform(x_train) x_train=x_train.toarray() vocabulary=vectorizer.vocabulary_ vectorizer = CountVectorizer( decode_error='ignore', strip_accents='ascii', vocabulary=vocabulary, stop_words='english', max_df=1.0, min_df=1 ) print vectorizer x_test=vectorizer.fit_transform(x_test) x_test=x_test.toarray() return x_train, x_test, y_train, y_test
词袋模型示例:
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
... ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)