Machine Learning Note

Machine Learning Note
<Chapter1-Chapter10, 待更>

Chapter11- 应用机器学习的建议

1）当机器学习系统工作不如预期的时候怎么做？
- A diagnostic will guide user on which option may work, and which one may not. Then, what to do next? The following.
2) Evaluating a hypothesis: split the data set into: training set (70%), and test set (30%), and see the test error.

3) Model selection problem: how to decide if the degree of polynomial, how big the lamda, etc.
- Get parameters of each model, test it on the validation set, and choose the one with the lowest validation error.
- Note: It's validation set to evaluate your model, not the test set. Reason: test set would be used to evaluate the generization error (previous section).
- data set -> training/validation/test set (60%/20%/20%).
4) Diagnosing bias and variance: know your hypothesis is suffering underfitting or overfitting.
- high bias: training error and validation error are both high.
- high variance: training error is low, but validation error is high.
5) regularizaiton and bias/variance - Choosing lamda

6) learning curve - Error change with sample number increases.
- Good model v.s. High bias v.s. High variance (more training example will help for this case).
7) Revisit - Summary

NN: Overfitting+Regularization is better than underfitting. The cost is computational expense.

Chapter12 Machine learning system design

1，The recommended approach to solving machine learning problems is to:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
It is difficult to tell which of the options will be most helpful. A good way is to do the error analysis.

2, Error analysis

3, Handling skewed classes

What is skewed classes: one class has a lot of more classes than another, just like the cancer case.

What's its difference: Always output the class with most examples can still get a good accuracy.

A different evaluation error metric: Precision/Recall.

4, In many application, we want to control how to do the trade off between precision (查准率) and recall (查全率).

The way to control: adjusting the threshold of the classifying algorithm.

Is there a way to choose the threshold automatically? Or a similar question is how to compare precision/recall numbers?
- The answer is F1 score. Only compare the one single index.
- Choose the threshold that gives the highest F1 score.
5, Large data's importance: The below graph.

Why is that? - Larege data rationale.

To better use large data set, two things need to make sure:
- We train a algorithm with a large number of parameters (to learn complex function).
- The feature contains sufficient information to be able to predict the result correctly.
Chapter13 Clustering - An unsupervised Algorithm

1, K-means algorithm.

Steps: 1) Cluster assignment; 2) Move centroid.

2, Optimization objective - Cost function, also called distortion function(distortion of the K-mean algorithm).

Note the meaning of the each step.

What optimization objective can do?
- Knowing if K-mean is working correctly.
- To avoid local optima.
3, Random initialization
- The recommanded way:
- 　What can be done if K-mean get stuck by local optima:
  - Try random initialization many times.
  - Choose the one gives the lowest cost.
  - Note: In the case of small cluster numbers, a few tries (2-5) can always give good results.
4, Choosing the number of the clusters
- Mostly, it is by hand (human insight).
- Sometimes, elbow method will work. The other time, choose the number based on the purpose.
Chapter14 Dimensional Reduction

1, Motivation: Data compression and visualizaiton.
- Data visualization: reduce the dimension form N (N>3) dimensions to k dimensions (k <=3).
- Save memory and speed up the algorithms.
2, Most popular tool or algorithm to do dimensional reduction: PCA (Principle Component Analysis).
- What is PCA: PCA is to find a lower dimensional surface, onto which to project the data so that the sum of squares of these little blue line segments is minimized.
  - 　　
- Difference between PCA and linear regression.
3, Steps of PCA algorithms
4, Reconstruction from compressed representation
5, How to choose K?
6, Apply the PCA
- For algorithms speed up, like in ML or linear regression. And save memory.
- Bad use case of PCA: to avoid overfitting.
  - Why? because PCA throws away or reduces the dimension of your data without knowing what the values of y is, it might also throw away some valuable information. The right way is to use regularization to avoid overfitting.
- Do not use PCA untill you have to (use original data first).
Chapter15 Anomaly Detection

1, What is anomaly detection? - A very common concept used in my work.
- We define a "model" p(x) that tells us the probability the example is not anomalous.
- We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.
2, Gaussian distribution and anomaly detection algorithms.
- Guassian distribution
- Evaluating an anomaly detection system and determin the threshold (ϵ).
  - To evaluate our learning algorithm, we take some labeled data, categorized into anomalous and non-anomalous examples.
  - Split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50 between the CV and test sets.
3, Anomaly Detection vs. supervised learning - how to choose?
- The cases with a lot of both positive and negative examples - supervised learning.
- The cases with few positivie examples - Anomaly detection.
4, Choosing what feature to use.
- Check if the features are gaussian distribution: if yes, good, do nothing; if not, try to transform it to let it meet gaussian.
- One common problem is when p(x) is similar for both types of examples (large for both positive and negative cases).
- Error analysis for anomaly detection:
  - Choose features that might take on unusually large or small values in the event of an anomaly.
5，Multivariate Gaussian Distribution - Feature's dependency assumption not work.

A very good note: https://www.cnblogs.com/newbyang/p/10338697.html

<chapter16 skip for now>

Chapter17 Large Scale Machine Learning

1, Guideline of this chapter:

Why large scale dataset: higher accuracy.

The problem of large scale: computational load.

The things need to do for large scale machine learning: cheaper algorithm, or ways to save time.
- Two ideas: stochastic gradient descent, and map reduce.
2, Stochastic gradient descent
- Concept：
- A derivative way：Mini-batch gradient descent.
  - Concept:
  - Can be faster compared with batch GD and stochastic GD:
    
    Faster than batch GD: straight forward.
    
    Faster than stochastic GD: with appropriate vectorization implementation, mini-batch GD can take the advantage of batch computation.
3，Stochastic gradient descent converge

4，Online learning
- 在线学习算法是指，对于一个成型的模型，每次有新的数据来临时，对参数进行更新，更新之后丢弃这个数据，因为不断有新的数据来临，所以不是用整个数据集一次性进行更新参数。
- The algorithm that we apply to it is really very similar to this schotastic gradient descent algorithm, only instead of scanning through a fixed training set, we're instead getting one example from a user, learning from that example, then discarding it and moving on.
5, Map-Reduce
- Seperate the tasks/data set into multiple machines or machine with multiple cores, to speed up the computation.
- Note: when parallelizing with a multi-core machine, some numerical linear algebra libraries can automatically parallelize their linear algebra operations across multiple cores within the machine.
Week18 Photo OCR

A good note on Photo OCR: http://www.fanyeong.com/2017/08/08/machine-learning-photo-ocr/

More details:

1, Pipeline
- A system with several components, several of which are machine learning problem.
- A typical Photo OCR pipeline:
  - text detection -> character segmentation -> character classification.
  - text detection: see the next.
  - character segmentation: supervised learing method. Classifier: can the sliding window split in the middle?
  - character classification：learned in the previous chapter. The method could be NN, or SVM, etc.
2, text detection method: Supervised learning method.
- try different silding window size to do the sliding window classifier.
  - Resize (re-sample) the patches to a fixed size as the data set before feeding them into the sliding window classifier.
- Mark the region out with text, and expand it to the nearby.
- Filter out the regions with strange aspect ration. Then, done！
3, Artifact data systhesis
- Creating data from scratch.
- Creating data by adding distortions.
4, Ceiling analysis: decide how to spend your resources.
相关阅读:
BZOJ_2460_[BeiJing2011]元素_线性基
 BZOJ_4448_[Scoi2015]情报传递_主席树
 BZOJ_4004_[JLOI2015]装备购买_线性基
 BZOJ_3110_[Zjoi2013]K大数查询_整体二分+树状数组
 BZOJ_4128_Matrix_矩阵乘法+哈希+BSGS
BZOJ_4378_[POI2015]Logistyka_树状数组
 BZOJ_2527_[Poi2011]Meteors_整体二分
 BZOJ_2738_矩阵乘法_整体二分
 BZOJ_3687_简单题_bitset
HDU 4501
原文地址：https://www.cnblogs.com/sanlangHit/p/11782792.html