• Integrating Hub with one sklearn mnist example


    one sklearn mnist example

    https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html

    如下例子,使用逻辑回归,基于mnist数据构建手写识别分类器。

    Here we fit a multinomial logistic regression with L1 penalty on a subset of the MNIST digits classification task. We use the SAGA algorithm for this purpose: this a solver that is fast when the number of samples is significantly larger than the number of features and is able to finely optimize non-smooth objective functions which is the case with the l1-penalty. Test accuracy reaches > 0.8, while weight vectors remains sparse and therefore more easily interpretable.

    Note that this accuracy of this l1-penalized linear model is significantly below what can be reached by an l2-penalized linear model or a non-linear multi-layer perceptron model on this dataset.

    Classification vector for...

    Out:

    Sparsity with L1 penalty: 79.95%
    Test score with L1 penalty: 0.8322
    Example run in 36.916 s
    
     
    import time
    import matplotlib.pyplot as plt
    import numpy as np
    
    from sklearn.datasets import fetch_openml
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.utils import check_random_state
    
    print(__doc__)
    
    # Author: Arthur Mensch <arthur.mensch@m4x.org>
    # License: BSD 3 clause
    
    # Turn down for faster convergence
    t0 = time.time()
    train_samples = 5000
    
    # Load data from https://www.openml.org/d/554
    X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
    
    random_state = check_random_state(0)
    permutation = random_state.permutation(X.shape[0])
    X = X[permutation]
    y = y[permutation]
    X = X.reshape((X.shape[0], -1))
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=train_samples, test_size=10000)
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Turn up tolerance for faster convergence
    clf = LogisticRegression(
        C=50. / train_samples, penalty='l1', solver='saga', tol=0.1
    )
    clf.fit(X_train, y_train)
    sparsity = np.mean(clf.coef_ == 0) * 100
    score = clf.score(X_test, y_test)
    # print('Best C % .4f' % clf.C_)
    print("Sparsity with L1 penalty: %.2f%%" % sparsity)
    print("Test score with L1 penalty: %.4f" % score)
    
    coef = clf.coef_.copy()
    plt.figure(figsize=(10, 5))
    scale = np.abs(coef).max()
    for i in range(10):
        l1_plot = plt.subplot(2, 5, i + 1)
        l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',
                       cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
        l1_plot.set_xticks(())
        l1_plot.set_yticks(())
        l1_plot.set_xlabel('Class %i' % i)
    plt.suptitle('Classification vector for...')
    
    run_time = time.time() - t0
    print('Example run in %.3f s' % run_time)
    plt.show()

    数据源改造点

    上面的例子中, 数据源头的获取,是从openml站点上拉去, 如:

    # Load data from https://www.openml.org/d/554
    X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

    数据源探索

    我们需要将这部分替换为hub获取, 数据源名称为 activeloop/mnist。

    在如下网站的搜索框中,输入 activeloop/mnist, 找到此数据源

    https://app.activeloop.ai/datasets/popular

    点击右上角“Visualize”按钮,查看数据元信息,

    数据总条目为 7万个

    包括 image 和 label 属性

    点击每个属性, 可以查看属性的 shape

    右边可以查看 训练图片的图片展示。

    数据源加载

    https://github.com/activeloopai/hub#access-public-data-fast

    from hub import Dataset
    from pprint import pprint
    
    mnist = Dataset("activeloop/mnist")  # loading the MNIST data lazily
    # saving time with *compute* to retrieve just the necessary data
    
    print(type(mnist))
    print(dir(mnist))
    #pprint(mnist.__dict__)
    #
    print("-----------------------------------")
    
    
    data = mnist["image"][0:1000].compute()
    
    print(dir(data))
    
    print(type(data))
    print(data.shape)
    print(data[0])
    
    
    data = mnist["label"][0:1000].compute()
    
    print(dir(data))
    
    print(type(data))
    print(data.shape)
    print(data[0])

    输出

    <class 'hub.api.dataset.Dataset'>
    -----------------------------------
    <class 'numpy.ndarray'>
    (1000, 28, 28, 1)
    <class 'numpy.ndarray'>
    (1000,)

    如果不要输出索引的限制, 即加载所有数据, 则去掉索引

    X = mnist["image"].compute()
    y = mnist["label"].compute()

    https://docs.activeloop.ai/en/latest/concepts/dataset.html#load-the-data

    此处给出另一个接口,加载数据,在API上看到解释,跟compute一致,功能上应该是一致的。

    ds = Dataset('username/basic')
    
    # Use .numpy() to get the numpy array of the element
    print(ds["image"][0].numpy())
    print(ds["label", 100:110].numpy())

    从源码上看, 与猜想一致。

    两种角度的不同命名:

    1) compute, 从懒加载角度看, compute是实际加载数据的接口

    2)numpy, 从使用者角度看, 从远端获取numpy-like数据后, 转化为 用户需要的 numpy数据, 也是模型需要的。

        def compute(self, label_name=False):
            """Gets the value from different tensorview objects in the datasetview schema
    
            Parameters
            ----------
            label_name: bool, optional
                If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names
                instead of the label encoded integers, otherwise this parameter is ignored.
            """
            return self.numpy(label_name=label_name)

    改造结果

    https://github.com/fanqingsong/code_snippet/blob/master/machine_learning/hub/mnist_sklearn.py

    """
    Test mnist learning on sklearn with hub data
    """
    
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.utils import check_random_state
    
    from hub import Dataset
    
    print(__doc__)
    
    # Author: Arthur Mensch <arthur.mensch@m4x.org>
    # License: BSD 3 clause
    
    # Turn down for faster convergence
    train_samples = 5000
    
    print("loading mnist data from hub...")
    mnist = Dataset("activeloop/mnist")  # loading the MNIST data lazily
    
    X = mnist["image"].compute()
    y = mnist["label"].compute()
    
    print("------- X.shape", X.shape)
    print("------- y.shape", y.shape)
    
    
    print("now train mnist model...")
    
    random_state = check_random_state(0)
    permutation = random_state.permutation(X.shape[0])
    
    print("---------- permutation")
    print(permutation)
    
    X = X[permutation]
    y = y[permutation]
    
    print("------- X.shape", X.shape)
    print("------- y.shape", y.shape)
    
    X = X.reshape((X.shape[0], -1))
    
    print("------- X.shape", X.shape)
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=train_samples, test_size=10000)
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Turn up tolerance for faster convergence
    clf = LogisticRegression(
        C=50. / train_samples, penalty='l1', solver='saga', tol=0.1
    )
    clf.fit(X_train, y_train)
    
    sparsity = np.mean(clf.coef_ == 0) * 100
    score = clf.score(X_test, y_test)
    
    # print('Best C % .4f' % clf.C_)
    print("Sparsity with L1 penalty: %.2f%%" % sparsity)
    print("Test score with L1 penalty: %.4f" % score)

    output

    root@xxx:~/win10/mine/code-snippet/machine_learning/hub# python mnist_sklearn.py

    Test mnist learning on sklearn with hub data

    loading mnist data from hub...
    ------- X.shape (70000, 28, 28, 1)
    ------- y.shape (70000,)
    now train mnist model...
    ---------- permutation
    [10840 56267 14849 ... 42613 43567 68268]
    ------- X.shape (70000, 28, 28, 1)
    ------- y.shape (70000,)
    ------- X.shape (70000, 784)
    Sparsity with L1 penalty: 82.72%
    Test score with L1 penalty: 0.8358

    其中 reshape可以将多维数据进行降低为一维的工具:

    X = X.reshape((X.shape[0], -1))

    原shape

    ------- X.shape (70000, 28, 28, 1)

    变化后shape

    ------- X.shape (70000, 784)

    后续尝试

    (1)安装opencv工具, 转换任意手写图片为 符合 MNIST 标准的数据。

    参考

    https://www.cnblogs.com/ningskyer/articles/7606174.html

    def image2Digit(image):
        # 调整为8*8大小
        im_resized = scipy.misc.imresize(image, (8,8))
        # RGB(三维)转为灰度图(一维)
        im_gray = cv2.cvtColor(im_resized, cv2.COLOR_BGR2GRAY)
        # 调整为0-16之间(digits训练数据的特征规格)像素值——16/255
        im_hex = Fraction(16,255) * im_gray
        # 将图片数据反相(digits训练数据的特征规格——黑底白字)
        im_reverse = 16 - im_hex
        return im_reverse.astype(np.int)

    (2)于深度学习工具对接

    https://docs.activeloop.ai/en/latest/concepts/dataset.html#convert-to-pytorch

    ds = ds.to_pytorch()
    ds = torch.utils.data.DataLoader(
        ds,
        batch_size=8,
        num_workers=2,
    )
    
    # Iterate over the data
    for batch in ds:
        print(batch["image"], batch["label"])

    https://docs.activeloop.ai/en/latest/concepts/dataset.html#convert-to-tensorflow

    ds = ds.to_tensorflow().batch(8)
    
    # Iterate over the data
    for batch in ds:
        print(batch["image"], batch["label"])

    https://github.com/JaimeTang/PyTorch-and-mnist/blob/master/pytorch-and-mnist.ipynb

    https://github.com/wxs/keras-mnist-tutorial/blob/master/MNIST%20in%20Keras.ipynb

    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    HTML_<select>
    HTML_<a>
    MySQL_知识点
    MySQL_常用SQL语句
    GIT
    MyBatis_传入参数的问题
    js函数防抖与节流总结
    node 常用方法 生成密钥 token验证 验证码生成 (持续更新)
    node 登陆拦截中间件(egg)
    js 给页面容器增加水印
  • 原文地址:https://www.cnblogs.com/lightsong/p/14469252.html
Copyright © 2020-2023  润新知