Clustering
K-means:
基本思想是先随机选择要分类数目的点,然后找出距离这些点最近的training data 着色,距离哪个点近就算哪种类型,再对每种分类算出平均值,把中心点移动到平均值处,重复着色算平均值,直到分类成功.
![](https://images2018.cnblogs.com/blog/1336977/201809/1336977-20180909105030043-1469479833.png)
![](https://www.evernote.com/shard/s334/res/0f8b1a4e-c2b7-4392-bec3-83152c405b0f.png?search=mean)
![](https://img2020.cnblogs.com/blog/1336977/202004/1336977-20200424160755638-269751883.png)
为了防止k-means 算法得到的是local optima, 可以多次运行k-means, 然后选取得到J最小值的那次初始化方法.
![](https://www.evernote.com/shard/s334/res/e5c5d778-7282-4948-afd0-36f6acbd9143.png?search=mean)
One way to choose K is elbow method
![](https://www.evernote.com/shard/s334/res/6b22a298-ba9d-4a78-942b-ed2a304838fe.png?search=mean)
Dimentionality Reduction
Dimentionality Reduction: 1. data compression to save space of memory and speed up compute. 2. 还有一个作用是可以用降维来visualize data.
![](https://www.evernote.com/shard/s334/res/7b907dfd-7792-4971-b66d-848c0dbc0d7b.png?search=mean)
降维最常用的算法PCA (Principal Component Analysis)
![](https://www.evernote.com/shard/s334/res/c56f577f-914b-44b4-8492-df8873bc6752.png?search=mean)
![](https://www.evernote.com/shard/s334/res/37537556-4efe-4f2a-9135-4256011d04c5.png?search=mean)
the 1st step of PCA algo is data preprocessing
![](https://www.evernote.com/shard/s334/res/8a3ea808-6caa-4de9-b0b5-60571f9b8555.png?search=mean)
PCA algo in matlab:
![](https://images2018.cnblogs.com/blog/1336977/201809/1336977-20180910222456591-1884245424.png)
![](https://www.evernote.com/shard/s334/res/8341ca1d-12b4-4641-b43e-c7ed7b3c2a6e.png?search=mean)
How to de-compress back from 100-dimentional to 1000-dimentional
![](https://www.evernote.com/shard/s334/res/ee120fd5-1705-400a-815f-fb2e85ed621a.png?search=mean)
How to choose the parameter K
![](https://www.evernote.com/shard/s334/res/2eb09783-d4a0-484b-b100-c350676061b2.png?search=mean)
![](https://images2018.cnblogs.com/blog/1336977/201809/1336977-20180910223936967-548571484.png)
Advice for using PCA. PCA is often used for data compresion and visualization. it is bad to use it to prevent overfitting.