• 理解 t-SNE (Python)


    t-SNE(t-distribution Stochastic Neighbor Embedding)是目前最为流行的高维数据的降维算法。

    t-SNE 成立的前提基于这样的一个假设:我们现实世界观察到的数据集,都在本质上有一种低维的特性(low intrinsic dimensionality),尽管它们嵌入在高维空间中,甚至可以说,高维数据经过降维后,在低维状态下,更能显现其本质特性,这其实也是流形学习(Manifold Learning)的基本思想。

    原始论文请见,论文链接(pdf)

    1. sklearn 仿真

    • import 必要的库;

      import numpy as np
      from numpy import linalg
      from numpy.linalg import norm
      from scipy.spatial.distance import squareform, pdist
      
      
      # We import sklearn.
      
      import sklearn
      from sklearn.manifold import TSNE
      from sklearn.datasets import load_digits
      from sklearn.preprocessing import scale
      
      
      # We'll hack a bit with the t-SNE code in sklearn 0.15.2.
      
      from sklearn.metrics.pairwise import pairwise_distances
      from sklearn.manifold.t_sne import (_joint_probabilities,
                                          _kl_divergence)
      from sklearn.utils.extmath import _ravel
      
      # Random state.
      
      RS = 20150101
      
      
      # We'll use matplotlib for graphics.
      
      import matplotlib.pyplot as plt
      import matplotlib.patheffects as PathEffects
      import matplotlib
      %matplotlib inline
      
      
      # We import seaborn to make nice plots.
      
      import seaborn as sns
      sns.set_style('darkgrid')
      sns.set_palette('muted')
      sns.set_context("notebook", font_scale=1.5,
                      rc={"lines.linewidth": 2.5})
      
      
      # We'll generate an animation with matplotlib and moviepy.
      
      from moviepy.video.io.bindings import mplfig_to_npimage
      import moviepy.editor as mpy
    • 加载数据集

      digits = load_digits()
              # digits.data.shape ⇒ (1797L, 64L)
    • 调用 sklearn 工具箱中的 t-SNE 类

      X = np.vstack([digits.data[digits.target==i]
                     for i in range(10)])
      y = np.hstack([digits.target[digits.target==i]
                     for i in range(10)])
      digits_proj = TSNE(random_state=RS).fit_transform(X)
              # digits_proj:(1797L, 2L),ndarray 类型
    • 可视化

      def scatter(x, colors):
          # We choose a color palette with seaborn.
          palette = np.array(sns.color_palette("hls", 10))
      
          # We create a scatter plot.
          f = plt.figure(figsize=(8, 8))
          ax = plt.subplot(aspect='equal')
          sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40,
                          c=palette[colors.astype(np.int)])
          plt.xlim(-25, 25)
          plt.ylim(-25, 25)
          ax.axis('off')
          ax.axis('tight')
      
          # We add the labels for each digit.
          txts = []
          for i in range(10):
              # Position of each label.
              xtext, ytext = np.median(x[colors == i, :], axis=0)
              txt = ax.text(xtext, ytext, str(i), fontsize=24)
              txt.set_path_effects([
                  PathEffects.Stroke(linewidth=5, foreground="w"),
                  PathEffects.Normal()])
              txts.append(txt)
      
          return f, ax, sc, txts
      scatter(digits_proj, y)
      plt.savefig('images/digits_tsne-generated.png', dpi=120)

    An illustrated introduction to the t-SNE algorithm

  • 相关阅读:
    ride关键字
    怎么分析《软件需求文档》
    linux系统在线搭建禅道
    用fiddler不能抓取https及证书无法导出
    mybatis There is no getter for property named 'xx' in 'class java.lang.String
    GMT与UTC
    cron表达式详解
    hdu 2083 简易版之最短距离
    hdu 2070 Fibbonacci Number
    hdu 2076 夹角有多大(题目已修改,注意读题)
  • 原文地址:https://www.cnblogs.com/mtcnn/p/9423129.html
Copyright © 2020-2023  润新知