本文给出了机器学习中AdaBoost算法的一个简单初等证明,需要使用的数学工具为微积分-1.
Adaboost is a powerful algorithm for predicting models. However, a major disadvantage is that Adaboost may lead to over-fit in the presence of noise. Freund, Y. & Schapire, R. E. (1997) proved that the training error of the ensemble is bounded by the following expression: egin{equation}label{ada1}e_{ensemble}le prod_{t}2cdotsqrt{epsilon_tcdot(1-epsilon_t)} end{equation} where $epsilon_t$ is the error rate of each base classifier $t$. If the error rate is less than 0.5, we can write $epsilon_t=0.5-gamma_t$, where $gamma_t$ measures how much better the classifier is than random guessing (on binary problems). The bound on the training error of the ensemble becomes egin{equation}label{ada2} e_{ensemble}le prod_{t}sqrt{1-4{gamma_t}^2}le e^{-2sum_{t}{gamma_t}^2} end{equation} Thus if each base classifier is slightly better than random so that $gamma_t>gamma$ for some $gamma>0$, then the training error drops exponentially fast. Nevertheless, because of its tendency to focus on training examples that are misclassified, Adaboost algorithm can be quite susceptible to over-fitting. We will give a new simple proof of ef{ada1} and ef{ada2}; additionally, we try to explain why the parameter $alpha_t=frac{1}{2}cdotlogfrac{1-epsilon_t}{epsilon_t}$ in boosting algorithm.
AdaBoost Algorithm:
Recall the boosting algorithm is:
Given $(x_1, y_1), (x_2, y_2), cdots, (x_m, y_m)$, where $x_iin X, y_iin Y={-1, +1}$.
Initialize $$D_1(i)=frac{1}{m}$$ For $t=1, 2, ldots, T$: Train weak learner using distribution $D_t$.
Get weak hypothesis $h_t: X ightarrow {-1, +1}$ with error [epsilon_t=Pr_{isim D_t}[h_t (x_i) e y_i]] If $epsilon_i >0.5$, then the weights $D_t (i)$ are reverted back to their original uniform values $frac{1}{m}$.
Choose egin{equation}label{boost3} alpha_t=frac{1}{2}cdot logfrac{1-epsilon_t}{epsilon_t} end{equation}
Update egin{equation}label{boost4} D_{t+1}(i)=frac{D_{t}(i)}{Z_t} imes left{egin{array}{c c} e^{-alpha_t} & quad extrm{if $h_t(x_i)=y_i$}\ e^{alpha_t} & quad extrm{if $h_t(x_i) e y_i$} end{array} ight. end{equation} where $Z_t$ is a normalization factor.
Output
the final hypothesis: [H(x)= ext{sign}left(sum_{t=1}^{T}alpha_tcdot h_t(x) ight)]
Proof:
Firstly, we will prove ef{ada1}, note that $D_{t+1}(i)$ is the distribution and its summation $sum_{i}D_{t+1}(i)$ equals to 1, hence [Z_t=sum_{i}D_{t+1}(i)cdot Z_t=sum_{i}D_t(i) imes left{egin{array}{c c} e^{-alpha_t} & quad extrm{if $h_t(x_i)=y_i$}\ e^{alpha_t} & quad extrm{if $h_t(x_i) e y_i$} end{array} ight.] [=sum_{i: h_t(x_i)=y_i}D_t(i)cdot e^{-alpha_t}+sum_{i: h_t(x_i) e y_i}D_t(i)cdot e^{alpha_t}] [=e^{-alpha_t}cdot sum_{i: h_t(x_i)=y_i}D_t(i)+e^{alpha_t}cdot sum_{i: h_t(x_i) e y_i}D_t(i)] egin{equation}label{boost5} =e^{-alpha_t}cdot (1-epsilon_t)+e^{alpha_t}cdot epsilon_t end{equation} In order to find $alpha_t$ we can minimize $Z_t$ by making its first order derivative equal to 0. [{[e^{-alpha_t}cdot (1-epsilon_t)+e^{alpha_t}cdot epsilon_t]}^{'}=-e^{-alpha_t}cdot (1-epsilon_t)+e^{alpha_t}cdot epsilon_t=0] [Rightarrow alpha_t=frac{1}{2}cdot logfrac{1-epsilon_t}{epsilon_t}] which is ef{boost3} in the boosting algorithm. Then we put $alpha_t$ back to ef{boost5} [Z_t=e^{-alpha_t}cdot (1-epsilon_t)+e^{alpha_t}cdot epsilon_t=e^{-frac{1}{2}logfrac{1-epsilon_t}{epsilon_t}}cdot (1-epsilon_t)+e^{frac{1}{2}logfrac{1-epsilon_t}{epsilon_t}}cdotepsilon_t] egin{equation}label{boost6} =2sqrt{epsilon_tcdot(1-epsilon_t)} end{equation} On the other hand, derive from ef{boost4} we have [D_{t+1}(i)=frac{D_t(i)cdot e^{-alpha_tcdot y_icdot h_t(x_i)}}{Z_t}=frac{D_t(i)cdot e^{K_t}}{Z_t}] Since the product will either be $1$ if $h_t (x_i )=y_i$ or $-1$ if $h_t (x_i ) e y_i$. Thus we can write down all of the equations [D_1(i)=frac{1}{m}] [D_2(i)=frac{D_1(i)cdot e^{K_1}}{Z_1}] [D_3(i)=frac{D_2(i)cdot e^{K_2}}{Z_2}] [ldotsldotsldots] [D_{t+1}(i)=frac{D_t(i)cdot e^{K_t}}{Z_t}] Multiply all equalities above and obtain [D_{t+1}(i)=frac{1}{m}cdotfrac{e^{-y_icdot f(x_i)}}{prod_{t}Z_t}] where $f(x_i)=sum_{t}alpha_tcdot h_t(x_i)$. Thus egin{equation}label{boost7} frac{1}{m}cdot sum_{i}e^{-y_icdot f(x_i)}=sum_{i}D_{t+1}(i)cdotprod_{t}Z_t=prod_{t}Z_t end{equation} Note that if $epsilon_i>0.5$ the data set will be re-sampled until $epsilon_ile0.5$. In other words, the parameter $alpha_tge0$ in each valid iteration process. The training error of the ensemble can be expressed as [e_{ensemble}=frac{1}{m}cdotsum_{i}left{egin{array}{c c} 1 & quad extrm{if $y_i e h_t(x_i)$}\ 0 & quad extrm{if $y_i=h_t(x_i)$} end{array} ight. =frac{1}{m}cdot sum_{i}left{egin{array}{c c} 1 & quad extrm{if $y_icdot f(x_i)le0$}\ 0 & quad extrm{if $y_icdot f(x_i)>0$} end{array} ight.] egin{equation}label{boost8} lefrac{1}{m}cdotsum_{i}e^{-y_icdot f(x_i)}=prod_{t}Z_t end{equation} The last step derives from ef{boost7}. According to ef{boost6} and ef{boost8}, we have proved ef{ada1} egin{equation}label{boost9} e_{ensemble}le prod_{t}2cdotsqrt{epsilon_tcdot(1-epsilon_t)} end{equation} In order to prove ef{ada2}, we have to firstly prove the following inequality: egin{equation}label{boost10} 1+xle e^x end{equation} Or the equivalence $e^x-x-1ge0$. Let $f(x)=e^x-x-1$, then [f^{'}(x)=e^x-1=0Rightarrow x=0] Since $f^{''}(x)=e^x>0$, so [{f(x)}_{min}=f(0)=0Rightarrow e^x-x-1ge0] which is desired. Now we go back to ef{boost9} and let [epsilon_t=frac{1}{2}-gamma_t] where $gamma_t$ measures how much better the classifier is than random guessing (on binary problems). Based on ef{boost10} we have [e_{ensemble}leprod_{t}2cdotsqrt{epsilon_tcdot(1-epsilon_t)}] [=prod_{t}sqrt{1-4gamma_t^2}] [=prod_{t}[1+(-4gamma_t^2)]^{frac{1}{2}}] [leprod_{t}(e^{-4gamma_t^2})^frac{1}{2}=prod_{t}e^{-2gamma_t^2}] [=e^{-2cdotsum_{t}gamma_t^2}] as desired.