• Machine Learning Note


    <Chapter1-Chapter10, 待更> 

    Chapter11- 应用机器学习的建议

    1) 当机器学习系统工作不如预期的时候怎么做?

    • A diagnostic will guide user on which option may work, and which one may not. Then, what to do next? The following.

    2) Evaluating a hypothesis: split the data set into: training set (70%), and test set (30%), and see the test error.

    3) Model selection problem: how to decide if the degree of  polynomial, how big the lamda, etc.

    • Get parameters of each model, test it on the validation set, and choose the one with the lowest validation error.
    • Note: It's validation set to evaluate your model, not the test set. Reason: test set would be used to evaluate the generization error (previous section).
    • data set -> training/validation/test set (60%/20%/20%). 

     4) Diagnosing bias and variance: know your hypothesis is suffering underfitting or overfitting.

    • high bias: training error and validation error are both high.
    • high variance: training error is low, but validation error is high.

    5) regularizaiton and bias/variance - Choosing lamda

    6) learning curve - Error change with sample number increases.

    • Good model v.s. High bias v.s. High variance (more training example will help for this case).

    7) Revisit - Summary

     NN: Overfitting+Regularization is better than underfitting. The cost is computational expense. 

    Chapter12 Machine learning system design

    1,The recommended approach to solving machine learning problems is to:

    • Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
    • Plot learning curves to decide if more data, more features, etc. are likely to help.
    • Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

    It is difficult to tell which of the options will be most helpful. A good way is to do the error analysis.

    2, Error analysis 

    3,  Handling skewed classes

    What is skewed classes: one class has a lot of more classes than another, just like the cancer case.

    What's its difference: Always output the class with most examples can still get a good accuracy.

    A different evaluation error metric: Precision/Recall. 

    4, In many application, we want to control how to do the trade off between precision (查准率) and recall (查全率). 

    The way to control: adjusting the threshold of the classifying algorithm. 

    Is there a way to choose the threshold automatically? Or a similar question is how to compare precision/recall numbers?

    • The answer is F1 score. Only compare the one single index.
    • Choose the threshold that gives the highest F1 score.

    5, Large data's importance: The below graph.

    Why is that? - Larege data rationale.

     To better use large data set, two things need to make sure: 

    • We train a algorithm with a large number of parameters (to learn complex function).
    • The feature contains sufficient information to be able to predict the result correctly.

    Chapter13 Clustering - An unsupervised Algorithm 

    1, K-means algorithm.

    Steps: 1) Cluster assignment; 2) Move centroid.

     2, Optimization objective - Cost function, also called distortion function(distortion of the K-mean algorithm).

    Note the meaning of the each step.

    What optimization objective can do?

    • Knowing if K-mean is working correctly.
    • To avoid local optima.

    3, Random initialization

    • The recommanded way: 
    •  What can be done if K-mean get stuck by local optima:

      • Try random initialization many times. 

      • Choose the one gives the lowest cost.
      • Note: In the case of small cluster numbers, a few tries (2-5) can always give good results.

    4, Choosing the number of the clusters

    • Mostly, it is by hand (human insight).
    • Sometimes, elbow method will work. The other time, choose the number based on the purpose.

    Chapter14 Dimensional Reduction

    1, Motivation: Data compression and visualizaiton.

    • Data visualization: reduce the dimension form N (N>3) dimensions to k dimensions (k <=3).
    • Save memory and speed up the algorithms.

    2, Most popular tool or algorithm to do dimensional reduction: PCA (Principle Component Analysis). 

    • What is PCA: PCA is to find a lower dimensional surface, onto which to project the data so that the sum of squares of these little blue line segments is minimized.
      •   
    • Difference between PCA and linear regression.

    3, Steps of PCA algorithms

    4, Reconstruction from compressed representation

    5, How to choose K?

    •  

     6, Apply the PCA

    • For algorithms speed up, like in ML or linear regression. And save memory.
    • Bad use case of PCA: to avoid overfitting.
      • Why? because PCA throws away or reduces the dimension of your data without knowing what the values of y is, it might also throw away some valuable information. The right way is to use regularization to avoid overfitting.
    • Do not use PCA untill you have to (use original data first).

     Chapter15 Anomaly Detection

    1, What is anomaly detection? - A very common concept used in my work.

    • We define a "model" p(x) that tells us the probability the example is not anomalous.
    • We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.

    2, Gaussian distribution and anomaly detection algorithms.

    • Guassian distribution

    • Evaluating an anomaly detection system and determin the threshold (ϵ). 
      • To evaluate our learning algorithm, we take some labeled data, categorized into anomalous and non-anomalous examples.
      • Split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50 between the CV and test sets.
      •  

    3, Anomaly Detection vs. supervised learning - how to choose?

    •  The cases with a lot of both positive and negative examples - supervised learning.
    • The cases with few positivie examples - Anomaly detection.

     4, Choosing what feature to use.

    • Check if the features are gaussian distribution: if yes, good, do nothing; if not, try to transform it to let it meet gaussian.
    • One common problem is when p(x) is similar for both types of examples (large for both positive and negative cases).
    • Error analysis for anomaly detection:
      • Choose features that might take on unusually large or small values in the event of an anomaly.

    5,Multivariate Gaussian Distribution -  Feature's dependency assumption not work.

    A very good note: https://www.cnblogs.com/newbyang/p/10338697.html 

    <chapter16 skip for now>

     Chapter17 Large Scale Machine Learning

    1, Guideline of this chapter:

    Why large scale dataset: higher accuracy.

    The problem of large scale: computational load.

    The things need to do for large scale machine learning: cheaper algorithm, or ways to save time.

    • Two ideas: stochastic gradient descent, and map reduce.

     2, Stochastic gradient descent

    • Concept:

    • A derivative way:Mini-batch gradient descent.
      • Concept: 
      • Can be faster compared with batch GD and stochastic GD:

        • Faster than batch GD: straight forward.

        • Faster than stochastic GD: with appropriate vectorization implementation, mini-batch GD can take the advantage of batch computation.  

    3,Stochastic gradient descent converge

    4,Online learning 

    • 在线学习算法是指,对于一个成型的模型,每次有新的数据来临时,对参数进行更新,更新之后丢弃这个数据,因为不断有新的数据来临,所以不是用整个数据集一次性进行更新参数。
    • The algorithm that we apply to it is really very similar to this schotastic gradient descent algorithm, only instead of scanning through a fixed training set, we're instead getting one example from a user, learning from that example, then discarding it and moving on.

     5, Map-Reduce

    • Seperate the tasks/data set into multiple machines or machine with multiple cores, to speed up the computation.
    • Note: when parallelizing with a multi-core machine, some numerical linear algebra libraries can automatically parallelize their linear algebra operations across multiple cores within the machine.

     Week18 Photo OCR

    A good note on Photo OCR: http://www.fanyeong.com/2017/08/08/machine-learning-photo-ocr/

    More details:

     1, Pipeline

    • A system with several components, several of which are machine learning problem.
    • A typical Photo OCR pipeline:
      • text detection -> character segmentation -> character classification.
      • text detection: see the next.
      • character segmentation: supervised learing method. Classifier: can the sliding window split in the middle? 
      • character classification:learned in the previous chapter. The method could be NN, or SVM, etc.

    2, text detection method: Supervised learning method.

    • try different silding window size to do the sliding window classifier.
      • Resize (re-sample) the patches to a fixed size as the data set before feeding them into the sliding window classifier.
    • Mark the region out with text, and expand it to the nearby.
    • Filter out the regions with strange aspect ration. Then, done!

    3, Artifact data systhesis

    • Creating data from scratch.
    • Creating data by adding distortions.

    4, Ceiling analysis: decide how to spend your resources.

  • 相关阅读:
    BZOJ_2460_[BeiJing2011]元素_线性基
    BZOJ_4448_[Scoi2015]情报传递_主席树
    BZOJ_4004_[JLOI2015]装备购买_线性基
    BZOJ_3110_[Zjoi2013]K大数查询_整体二分+树状数组
    BZOJ_4128_Matrix_矩阵乘法+哈希+BSGS
    BZOJ_4378_[POI2015]Logistyka_树状数组
    BZOJ_2527_[Poi2011]Meteors_整体二分
    BZOJ_2738_矩阵乘法_整体二分
    BZOJ_3687_简单题_bitset
    HDU 4501
  • 原文地址:https://www.cnblogs.com/sanlangHit/p/11782792.html
Copyright © 2020-2023  润新知