【2】facebook大数据搜索库faiss使用

【2】facebook大数据搜索库faiss使用
Faiss建立在一些基础算法之上，这些基础算法都使用了非常高效率的实现方式：K-means聚类，PCA，PQ编码/解码。

聚类

对存储在给定的2维tensor x中的向量集合进行聚类，如下所示：
```
ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)
```
结果中心点存储在kmeans.centroids中。目标函数的值（即kmeans的平方误差）以及迭代次数存储在kmeans.obj中。

kmeans训练结束以后，为了计算向量集x到聚类中心的距离，D,I=kmeans.index.search(x,1)，返回值I针对x的每一行向量给出一个最近邻聚类中心。D则包含了对应的平方L2距离。
为了找到x中与各中心点最近邻的15个点，用下面的方法：
```
index  = faiss.IndexFlat2(d)
index.add(x)
D, I = index.search(kmeans.centroids, 15)
```
I包含了每个中心点最近邻，尺寸为（centroids, 15）。

在一个或多个GPU上聚类需要调整一下index对象。

PCA计算

将40维降低到10维：
```
mt = np.random.rand(1000,40).astype('float32')
mat = faiss.PCAMatrix(40,10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
print(tr ** 2).sum(0)
```
在C++中apply_py由apply替换，在python中apply是一个保留关键字。

PQ编码/解码

ProductQuantizer可以用于对向量进行编码解码：
```
d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2)
```
相关阅读:
学习ASP.NET Core Blazor编程系列一——综述
 PHPExcel插件生成exel表：有的excel能打开，有的excel打不开
 测试架构师如何落地性能测试方案（二）
pytest数据驱动 pandas
测试开发工程师到底是做什么的？
什么是测试架构师（经验总结）
测试架构师CAP原理（最简单）
测试开发mysql性能调优总结（一）
测试开发HTTP请求过程（一）
pytest数据驱动（最简单）
原文地址：https://www.cnblogs.com/imagezy/p/8328220.html

【2】facebook大数据搜索库faiss使用

聚类

PCA计算

PQ编码/解码