Jordan Lecture Note-12: Kernel典型相关分析(Kernel Canonical Correlation Analysis, KCCA).

Jordan Lecture Note-12: Kernel典型相关分析(Kernel Canonical Correlation Analysis, KCCA).
Kernel典型相关分析

（一）KCCA

    同样，我们可以引入Kernel函数，通过非线性的坐标变换达到之前CCA所寻求的目标。首先，假设映射$Phi_X: x ightarrow Phi_X(x), Phi_Y: y ightarrow Phi_Y(y)$，记$mathbf{Phi_X}=(Phi_X(x_1),Phi_X(x_2),cdots,Phi_X(x_p))^prime, mathbf{Phi_Y}=(Phi_Y(y_1),Phi_Y(y_2),cdots,Phi_Y(y_q))^prime$。我们要寻找典型变量$u,v$使相关系数最大，其中$u=langle a,Phi_X(x) angle=Phi_X^prime a, v=langle b,Phi_Y(y) angle=Phi_Y^prime b$，$a,b$的维度为映射后的空间。根据上一个笔记的分析，我们应该优化如下模型：

egin{align}mathop{max}&quad a^primemathbf{Phi_X}^primemathbf{Phi_Y}b onumber\mathop{s.t.}&quad a^primemathbf{Phi_X}^primemathbf{Phi_X}a=1 onumber\&quad b^primemathbf{Phi_Y}^primemathbf{Phi_Y}b=1label{model:koriginal}end{align}

此时，如果我们直接优化上面的模型的话，就无法引进Kernel函数，因为我们凑不出$Phi_X(x)^primePhi_X(y)$这种形式。这样的话，我们就得知道映射$Phi$的具体形式。但实际上，这里的$a,b$其实是可以表示成数据$Phi_X(x_1),cdots,Phi_X(x_n)$以及数据$Phi_Y(y_1),cdots,Phi_Y(y_n)$的线性组合。原因蛮复杂的，大概是当映射后的Hilbert空间的维度很大，那么这里的$a,b$就一定在数据张成的空间里。具体可以参见一下两篇论文(Nonlinear component analysis as a kernel eigenvalue problem以及Kernel independent component analysis)。另外在KCCA刚提出的那篇论文里（A kernel method for canonical correlation analysis），没有从那么深奥的理论去解释，但他是直接从正则化的KCCA那边出发去解释的，这里也稍微说明一下。正则化的Lagrange函数为：

egin{equation}L(a,b,lambda_1,lambda_2)=a^primemathbf{Phi_X}^primemathbf{Phi_Y} b-frac{lambda_1}{2}(a^primemathbf{Phi_X}^primemathbf{Phi_X}a-1)-frac{lambda_2}{2}(b^primemathbf{Phi_Y}^primemathbf{Phi_Y}b-1)+frac{eta}{2}(|a|^2+|b|^2)end{equation}

将Lagrange函数对$a$求导并令导数为零得：

egin{equation}frac{partial L}{partial a}=mathbf{Phi_X}^primemathbf{Phi_Y}b-lambda_1mathbf{Phi_X}^primemathbf{Phi_X}a+eta a=0end{equation}

故$a=frac{mathbf{Phi_X}^prime(lambda_1mathbf{Phi_X}a-mathbf{Phi_Y}b)}{eta}$，其中我们可以把$frac{lambda_1mathbf{Phi_X}a-mathbf{Phi_Y}b}{eta}$记作向量$c$，也就是说$a$可以表示成$a=mathbf{Phi_X}^prime c$。同理，$b$也可表示成$b=mathbf{Phi_Y}^prime d$。接下去我们就用$mathbf{Phi_X}^prime c, mathbf{Phi_Y}d$来替换$a,b$，先从KCCA开始，然后在引入正则化的KCCA。

    利用以上结果模型 ef{model:koriginal}可转化为：

egin{align}mathop{max}&quad cmathbf{Phi_X}^primemathbf{Phi_X}mathbf{Phi_Y}^primemathbf{Phi_Y}d onumber\mathop{s.t.}&quad cmathbf{Phi_X}^primemathbf{Phi_X}mathbf{Phi_X}^primemathbf{Phi_X}c=1 onumber\&quad dmathbf{Phi_Y}^primemathbf{Phi_Y}mathbf{Phi_Y}^primemathbf{Phi_Y}d=1label{model:k}end{align}

我们用核函数$mathbf{K_X}=mathbf{Phi_X}^primemathbf{Phi_X}$和$mathbf{K_Y}=mathbf{Phi_Y}^primemathbf{Phi_Y}$代替上述模型得：

egin{align}mathop{max}&quad cmathbf{K_X}mathbf{K_Y}d onumber\mathop{s.t.}&quad cmathbf{K_X}mathbf{K_X}d=1 onumber\&quad dmathbf{K_Y}mathbf{K_Y}d=1label{model:kernel}end{align}

对应的Lagrange函数为：

egin{equation}L(c,d,lambda_1,lambda_2)=c^primemathbf{K_X}mathbf{K_Y}d-frac{lambda_1}{2}(c^primemathbf{K_X}^2c-1)-frac{lambda_2}{2}(dmathbf{K_Y}^2d-1)end{equation}

对Lagrange函数求导得：

egin{align}&frac{partial L}{partial c}=mathbf{K_X}mathbf{K_Y}d-lambda_1mathbf{K_X}^2d=0label{equ:partialc}\&frac{partial L}{partial d}=mathbf{K_Y}mathbf{K_X}c-lambda_2mathbf{K_Y}^2d=0label{equ:partiald}end{align}

由$c^prime$左乘式子 ef{equ:partialc}，$d^prime$左乘式子 ef{equ:partiald}，相减得：

egin{equation}lambda_2d^primemathbf{K_Y}^2d-lambda_1c^primemathbf{K_X}^2c=0Longrightarrow lambda_1=lambda_2 riangleq lambdaend{equation}

再结合$c^prime mathbf{K_X}mathbf{K_Y}d-lambda_1cmathbf{K_X}^2c=c^primemathbf{K_X}mathbf{K_Y}d-lambda_1=0$可得：

egin{equation}lambda=cmathbf{K_X}mathbf{K_Y}dend{equation}

即$lambda$就是典型变量$u,v$的相关系数。接下去同样采用Note-11中的推导方法，也能得到一个求矩阵特征值的方程。这里就不再赘述了。

现在我们来看一个问题，当$mathbf{K_X},mathbf{K_Y}$可逆的情况下，由式子 ef{equ:partiald}可得：$d=frac{mathbf{K_Y}^{-2}mathbf{K_Y}mathbf{K_X}d}{lambda}=frac{mathbf{K_Y}^{-1}mathbf{K_X}c}{lambda}$，将其代入 ef{equ:partialc}得：

egin{equation}mathbf{K_XK_YK_Y}^{-1}mathbf{K_X}c-lambda^2mathbf{K_X}^2c=0Longrightarrow lambda=1end{equation}

也就是说，对于任何一个$d$总存在$c$使$u,v$的相关系数为1，显然这样的结果表明了模型出现了过拟合现象。事实上在有些论文中也提到了，当映射后的Hilbert空间的维数很大时，Lagrange函数会产生ill-posed，此时我们引入二次正则项得到一个well-posed Lagrange。也就是正则化的KCCA。

（二）正则化的KCCA

这里介绍两种正则化的KCCA，一种是对Lagrange加入正则项，一种是对限制条件做一下修改。

1) 引入正则项的Lagrange函数为：

egin{equation}L=c^primemathbf{K_XK_Y}d-frac{lambda_1}{2}(c^primemathbf{K_X}^2c-1)-frac{lambda_2}{2}(d^primemathbf{K_Y}^2d-1)+frac{eta}{2}(|c|^2+|d|^2)end{equation}

同样令Lagrange函数的导数为0，得：

egin{align}&mathbf{K_XK_Y}d=lambda(frac{lambda_1}{lambda}mathbf{K_X}^2-frac{eta}{lambda}mathbf{I})c\&mathbf{K_YK_X}c=lambda(frac{lambda_2}{lambda}mathbf{K_Y}^2-frac{eta}{lambda}mathbf{I})dend{align}

记$mathbf{K_O}=left[egin{array}&0&mathbf{K_XK_Y}\mathbf{K_Yk_X}&0end{array} ight]$,$mathbf{K_D}=left[egin{array}&frac{lambda_1}{lambda}mathbf{K_X}^2+frac{eta}{lambda}mathbf{I}&0\0&frac{lambda_2}{lambda}mathbf{K_Y}^2+frac{eta}{lambda}mathbf{I}end{array} ight]$,$gamma=left[egin{array}&c\dend{array} ight]$，则：

egin{equation}mathbf{K_O}gamma=lambdamathbf{K_D}gammaLongrightarrowmathbf{K_D}^{-1}mathbf{K_O}gamma=lambdagammaend{equation}

得到一个求特征值的问题。

2）对限制条件作修改后的模型为：

egin{align}mathop{max}&quad c^primemathbf{K_X}mathbf{K_Y}d onumber\mathop{s.t.}&quad(1- au)c^primemathbf{K_X}^2c+ au c^primemathbf{K_X}c=1 onumber\&quad(1- au)d^primemathbf{K_Y}^2d+ au dmathbf{K_Y}d=1end{align}

其Lagrange函数为：

egin{equation}L=c^primemathbf{K_XK_Y}d-frac{lambda_1}{2}[(1- au)c^primemathbf{K_X}^2c+ au c^primemathbf{K_X}c-1]-frac{lambda_2}{2}[(1- au)d^primemathbf{K_Y}^2d+ au d^primemathbf{K_Y}d-1]end{equation}

求导：

egin{align}&frac{partial L}{partial c}=mathbf{K_XK_Y}d-lambda_1[(1- au)mathbf{K_X}^2c+ aumathbf{K_X}c]label{equ:partialconc}\&frac{partial L}{partial d}=mathbf{K_YK_X}c-lambda_2[(1- au)mathbf{K_Y}^2d+ aumathbf{K_Y}d]label{equ:partialcond}end{align}

$c^prime imes$ ef{equ:partialconc}，$d^prime imes$ ef{equ:partialcond}，相减得到：

egin{align*}&lambda_1c^prime[(1- au)mathbf{K_X}^2c+ aumathbf{K_X}c]=lambda_2d^prime[(1- au)mathbf{K_Y}^2d+ aumathbf{K_Y}d]\Longrightarrow&lambda_1=lambda_2=lambda=c^primemathbf{K_XK_Y}dend{align*}

同样记$mathbf{K_O}=left[egin{array}&0&mathbf{K_XK_Y}\mathbf{K_YK_X}&0end{array} ight]$，$mathbf{K_D}=left[egin{array}&(1- au)mathbf{K_X}^2+ aumathbf{K_X}&0\0&(1- au)mathbf{K_Y}^2+ aumathbf{K_Y}end{array} ight]$，$gamma=left[egin{array}&c\dend{array} ight]$，则$mathbf{K_O}gamma=lambdamathbf{K_D}gamma$，得到一个求特征值的问题。

（三）Cholesky Decomposition

    由于解上述特征值问题时会遇到效率与精度问题，故可对上述的矩阵进行SVD分解，或对上述正定矩阵进行完全Cholesky分解，或者不完全Cholesky分解。

1）完全Cholesky分解

   a) 对于一个正定矩阵$mathbf{A}$，我们总可以将$mathbf{A}$分解成三角矩阵相乘的形式，即$mathbf{A}=mathbf{LL}^prime$，其中$mathbf{L}$为下三角矩阵。由$mathbf{A}=mathbf{LL}^prime$可知：

egin{equation*}a_{ij}=sum_{k=1}^{j}l_{ik}l_{jk}=sum_{k=1}^{j-1}l_{ik}l_{jk}+l_{ij}l_{jj},quad j<iend{equation*}

所以，由$a_{jj}=sum_{k=1}^{j-1}l_{jk}^2+l_{jj}^2Longrightarrow l_{jj}^2=a_{jj}-sum_{k=1}^{j-1}l_{jk}^2$，继而$l_{ij}=frac{1}{l_{jj}}(a_{ij}-sum_{k=1}^{j-1}l_{ik}l_{jk})$。从上述方程可以看出$mathbf{L}$的计算过程从左到右，逐列计算，称为left-looking cholesky。

   b) 由$mathbf{A}=mathbf{LL}^prime=left[egin{array}&lambda_{11}&0\l_1& ilde{mathbf{L}}end{array} ight]left[egin{array}&lambda_{11}&0\l_1& ilde{mathbf{L}}end{array} ight]^prime=left[egin{array}&a_{11}&a_1^prime\a_1& ilde{mathbf{A}}end{array} ight]$。所以：

$$a_{11}=lambda_{11}^2Longrightarrow lambda_{11}=sqrt{a_{11}}$$

$$a_1=lambda_{11}l_1Longrightarrow l_1=frac{a_1}{sqrt{a_{11}}}$$

$$ ilde{mathbf{A}}=l_1l_1^prime+ ilde{mathbf{L}} ilde{mathbf{L}}^primeLongrightarrow ilde{mathbf{L}} ilde{mathbf{L}}^prime= ilde{mathbf{A}}-frac{a_1a_1^prime}{a_{11}}$$

可以用递归的方式不断的计算出$ ilde{mathbf{L}}$，称为right-looking cholesky。

2）不完全cholesky分解

    与完全cholesky分解不同，不完全cholesky分解的目标是寻找一个矩阵$ ilde{mathbf{G}}_{N imes M}$（其中$M<<N$），使$mathbf{A}approx ilde{mathbf{G}} ilde{mathbf{G}}^prime$，即$|mathbf{A}- ilde{mathbf{G}} ilde{mathbf{G}}^prime|_F<eta$，这里的$ ilde{mathbf{G}}$也是下三角矩阵。其中算法的主要原理是：如果对角元素$l_{ii}$很小的话，那么我们可以把该列去掉，去掉该列之后，在其后的列都必须进行更新。算法从最大的对角元素开始，然后把该列交换到最前面，然后进行更新。算法流程：

Input: $N imes N$正定矩阵$mathbf{K}$，和$eta$.
1. 初始化：$i=1,mathbf{ar{K}}=mathbf{K},mathbf{P}=mathbf{I},mathbf{G}=0$，$G_{jj}=K_{jj}, j=1,cdots,N$.
2. while $sum_{j=i}^NG_{jj}>eta$
  1. 找出最大的$G_{jj}$: $j^*=mathop{argmax}_{jin[i,N]} G_{jj}$.
  2. 更新交换矩阵$mathbf{P}$:
    $Pnext = mathbf{I}$, $Pnext_{ii}=0$, $Pnext_{j^*j^*}=0$, $Pnext_{ij^*}=1$, $Pnext_{j^*i}=1$（Pnext为交换$i,j^*$行或列的交换矩阵），$mathbf{P}=mathbf{P}Pnext$
  3. 对$mathbf{ar{K}}$进行行列交换：$mathbf{ar{K}}=Pnextmathbf{ar{K}}Pnext$.
  4. 交换矩阵$mathbf{G}$中第$i$行和第$j^*$行，注意只交换前$i$个元素，$j^*geq i$: $mathbf{G}_{i,1:i}longleftrightarrowmathbf{G}_{j^*,1:i}$.
  5. $mathbf{G}_{ii}=sqrt{mathbf{ar{K}}_{ii}}$.
  6. 更新第$i$列$mathbf{G}$: $mathbf{G}_{i+1:n,i}=frac{1}{mathbf{G}_{ii}}(mathbf{ar{K}}_{i+1:N,i}-sum_{i=1}^{i-1}mathbf{G}_{i+1:N,j}mathbf{G}_{ij})$.
  7. 更新$i$后面的对角元素：for $jin[i+1,N], mathbf{G}_{jj}=mathbf{K}_{jj}-sum_{k=1}^imathbf{G}_{jk}^2$.
  8. $i=i+1$.
3. 得到$mathbf{P},mathbf{G}$，$M=i-1$.
Output: 一个$N imes M$的矩阵$mathbf{G}$，和交换矩阵$mathbf{P}$，使$|mathbf{PKP}^prime-mathbf{GG}^prime|leqeta$.
相关阅读:
浅谈Semaphore类
 Python浅谈requests三方库
 191104
191103
191102
191101
191031
191030
191029
191028
原文地址：https://www.cnblogs.com/boostable/p/lec_kernel_canonical_correlation_analysis.html