源代码的链接为http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
Loading 20 newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] 3387 documents 4 categories Extracting features from the training dataset using a sparse vectorizer done in 2.980000s n_samples: 3387, n_features: 10000 Clustering sparse data with MiniBatchKMeans(batch_size=1000, compute_labels=True, init='k-means++', init_size=1000, max_iter=100, max_no_improvement=10, n_clusters=4, n_init=1, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=False) done in 0.514s Homogeneity: 0.506 Completeness: 0.576 V-measure: 0.539 Adjusted Rand-Index: 0.477 Silhouette Coefficient: 0.006 Top terms per cluster: Cluster 0: hst nasa mission jpl ___ gov baalke access orbit __ Cluster 1: space henry nasa access toronto com alaska digex pat sky Cluster 2: god com people sandvik keith don jesus article say think Cluster 3: graphics com university thanks posting image host nntp computer ac
一、
TfidfVectorizer
HashingVectorizer
二、
Two algorithms are demoed: ordinary k-means and its more scalable cousin minibatch k-means
(To be continued)