• 论文笔记(2)—"Adaptive Federated Optimization"


    Intuition

    Authors demonstrated that the gap between centralized and federated performance was caused by two reasons: 1)client drift, 2) a lack of adaptive.

    Different from variance reduction methods, they extended federated learning with adaptive methods, like adam.

    They rewrote the update rule of FedAvg

    [x_{t+1}=x_t-frac{1}{vert mathcal Svert}sum_{iinmathcal S}(x_t-x_i^t) ]

    Let (Delta_i^t=x_{i}^t-x_t), where (x_i^t) denotes the model of client (i) after local training.

    The server learning rate (eta) is FedAvg is (1) with applying SGD and (Delta_i^t) is a kind of pseudo-gradient. They purposed that apart from SGD, the server optimizer could utilize adaptive methods to update server model (x). Their framework is following:

    sotnsA.png

    Convergence

    Multi steps local update, concretely, (E[Delta_i^t] eq K abla F(x^t))), obstacles the analysis of convergence. In my opinion, they offered a roughly bound of error.

    I'll only give my personal analysis of their proof of Theorem 1 and thoughts of Theorem 2 are similar.

    Firstly, we should build relationships between (x^{t+1}) and (x^t). According to the update rule of adaptive methods and (L)-smooth assumption, we have

    soUwCT.png

    Furthermore, like in Adagrad, we will have

    soU5xe.png

    Now, we should bound these two terms (T_1) and (T_2)

    So far, there is no local training involving (T_2) and we follow the same process of Adagrad

    sowiCV.png

    sowZDJ.png

    sodXjg.png

    To bound (T_1), they tried to link (Delta_t) containing local update with ( abla f(x_t)).

    As mentioned above, (Delta_t) is a kind of pseudo-gradient of ( abla f(x_t))

    sowDxS.png

    Again, note that (x_{i, k}) is (k)-th model during local training in client (i) and (x_t) is the server model at round (t).

    sosrHe.png

    In my opinion, how to bound (x_{i,k}^t-x_t) is the most impressive part of the whole paper.

    sosvuT.png

    Honestly, local gradient (g_{i,k}^t) builds the bridge between (x_{i, k}^t) and (x_t) and (E[ abla F_i(x_{i, k-1}^t)] eq abla F_i(x_t))

    soy12t.png

    The second inequity is very rough and unclear. Known (E[eta_l(cdots))They used (E[Vert z_1+z_2+dots+z_rVert^2]leq rE[Vert z_1Vert^2+Vert z_2Vert^2]+dots+Vert z_rVert^2])

  • 相关阅读:
    python functools模块
    python 垃圾回收详解
    python与鸭子类型
    简单写一下图片上传获取宽高的方法
    手写一个简单的webpack的配置
    immutable的使用
    react中实现异步请求的方法二、redux-promise-middleware
    react中实现异步请求的方法一,react-thunk
    浅谈vuex。
    Vue中插槽的使用。
  • 原文地址:https://www.cnblogs.com/DemonHunter/p/14315949.html
Copyright © 2020-2023  润新知