【论文笔记】利用平滑度加速分布式优化——梯度跟踪法(Gradient Tracking)

写在前面

原论文:Harnessing Smoothness to Accelerate Distributed Optimization.

本文是Qu 2018[1]的笔记,主要是对离散系统分布式梯度跟踪算法证明过程的总结。

问题描述和算法

问题描述

minxRdf(x)=1ni=1nfi(x)\min_{x\in\mathbb R^d} f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)

假设1:fif_iα\alpha强凸和β\beta光滑的函数。即

fi(y)fi(x)fi(x)T(yx)+α2yx2fi(x)fi(y)βxy\begin{aligned} f_i(y)-f_i(x)&\geq \nabla f_i(x)^T(y-x)+\frac{\alpha}{2}\|y-x\|^2\\ \|\nabla f_i(x)-\nabla f_i(y)\|&\leq \beta\|x-y\| \end{aligned}

离散系统分布式优化算法

xi(t+1)=i=1nwijxj(t)ηsi(t)si(t+1)=j=1nwijsj(t)+fi(xi(t+1))fi(xi(t))(1)\begin{aligned} x_i(t+1)&=\sum_{i=1}^nw_{ij}x_j(t)-\eta s_i(t)\\ s_i(t+1)&=\sum_{j=1}^nw_{ij}s_j(t)+\nabla f_i(x_i(t+1))-\nabla f_i(x_i(t)) \end{aligned}\qquad (1)

其中xi(t)R1×dx_i(t)\in \mathbb R^{1\times d}si(t)R1×ds_i(t)\in \mathbb R^{1\times d}写成行向量形式,同时满足si(0)=fi(xi(0))s_i(0)=\nabla f_i(x_i(0))。可以看出,si(t)s_i(t)用来跟踪梯度的均值,即1nfi(xi(t))\frac{1}{n}\nabla f_i(x_i(t))。上述算法称为分布式梯度跟踪法(distributed gradient tracking, DGT)。

收敛性证明

给定WW双随机(doubly stochastic)矩阵。算法(1)矩阵形式

x(t+1)=Wx(t)ηs(t)s(t+1)=Ws(t)+(t+1)(t)(2)\begin{aligned} x(t+1)&=W x(t)-\eta s(t)\\ s(t+1)&=W s(t)+\nabla(t+1)-\nabla (t) \end{aligned}\qquad (2)

其中s(0)=(0)s(0)=\nabla (0)Rn×d\nabla\in\mathbb R^{n\times d}是梯度fi(xi(t))\nabla f_i(x_i(t))以行向量形式堆叠的矩阵,同样的xRn×dx\in\mathbb R^{n\times d}sRn×ds\in\mathbb R^{n\times d}也是以行向量形式堆叠的矩阵。

xˉ(t)=(1/n)1nTx(t)\bar x(t)=(1/n)1_n^T x(t)sˉ(t)=(1/n)1nTs(t)\bar s(t)=(1/n)1_n^T s(t)g(t)=(1/n)1nT(t)g(t)=(1/n)1_n^T \nabla(t)

引理1(Lemma 7):以下等式成立:

  • sˉ(t+1)=sˉ(t)+g(t+1)g(t)=g(t+1)\bar s(t+1)=\bar s(t)+g(t+1)-g(t)=g(t+1)
  • xˉ(t+1)=xˉ(t)ηsˉ(t)=xˉ(t)ηg(t)\bar x(t+1)=\bar x(t)-\eta \bar s(t)=\bar x(t)-\eta g(t)

证明:式(2)中两边同乘(1/n)1nT(1/n)1_n^T可得g(t)=sˉ(t)=(1/n)1nTs(t)g(t)=\bar s(t)=(1/n)1_n^T s(t),因为s(0)=(0)s(0)=\nabla (0)。同样的,xˉ(t+1)=xˉ(t)ηg(t)\bar x(t+1)=\bar x(t)-\eta g(t)

下面证明收敛性,证明思路是分别证明梯度跟踪误差一致性误差最优解误差的收敛性,最后证明目标误差的收敛性。

第一步,先看梯度跟踪误差s(k)1ng(k)\|s(k)-1_ng(k)\|,有

s(k)1ng(k)=[Ws(k1)1ng(k1)]+[(k)(k1)]1n[g(k)g(k1)].(3)s(k)-1_ng(k)=[W s(k-1)-1_ng(k-1)]+[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)].\qquad (3)

σ(0,1)\sigma\in(0,1)W(1/n)1n1nTW-(1/n)1_n1_n^T谱范数(spectral norm)。对任意ωRn\omega\in \mathbb R^n,都有

Wω(1/n)1n1nTω=(W(1/n)1n1nT)(ω(1/n)1n1nTω)σω(1/n)1n1nTω.(4)\|W\omega-(1/n)1_n1_n^T \omega\|=\|(W-(1/n)1_n1_n^T)(\omega-(1/n)1_n1_n^T \omega)\|\leq \sigma\|\omega-(1/n)1_n1_n^T \omega\|.\qquad (4)

由式(4)和引理1,式(3)可以取范数得到

s(k)1ng(k)Ws(k1)1ng(k1)+[(k)(k1)]1n[g(k)g(k1)]σs(k1)1ng(k1)+[(k)(k1)]1n[g(k)g(k1)].(5)\begin{aligned} \|s(k)-1_ng(k)\|&\leq \|Ws(k-1)-1_ng(k-1)\|+\|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|\\ &\leq \sigma\|s(k-1)-1_ng(k-1)\|+\|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|. \end{aligned}\qquad (5)

式(5)后半部分可得

[(k)(k1)]1n[g(k)g(k1)]2=(k)(k1)2+(1/n)1n1nT((k)(k1))2((k)(k1))T(1/n)1n1nT((k)(k1))(k)(k1)2.\begin{aligned} \|[\nabla(k)-\nabla(k-1)]-1_n[g(k)-g(k-1)]\|^2&=\|\nabla(k)-\nabla(k-1)\|^2+\|(1/n)1_n1_n^T(\nabla (k)-\nabla(k-1))\|^2\\ &\quad -(\nabla(k)-\nabla(k-1))^T(1/n)1_n1_n^T(\nabla (k)-\nabla(k-1))\\ &\leq \|\nabla(k)-\nabla(k-1)\|^2. \end{aligned}

代入(3)再结合假设1得到

s(k)1ng(k)σs(k1)1ng(k1)+βx(k)x(k1).(6)\|s(k)-1_ng(k)\|\leq \sigma\|s(k-1)-1_ng(k-1)\|+\beta\|x(k)-x(k-1)\|.\qquad (6)

第二步,同样的,一致性误差由式(4)可得

x(k)1nxˉ(k)=Wx(k1)ηs(k1)xˉ(k1)+η1ng(k1)σx(k1)xˉ(k1)+ηs(k1)1ng(k1)(7)\begin{aligned} \|x(k)-1_n\bar x(k)\|&= \|Wx(k-1)-\eta s(k-1)-\bar x(k-1)+\eta1_n g(k-1)\|\\ &\leq \sigma\|x(k-1)-\bar x(k-1)\|+\eta\|s(k-1)-1_ng(k-1)\| \end{aligned}\qquad (7)

定义f=1ni=1nfif=\frac{1}{n}\sum_{i=1}^n f_ixˉ(t)\bar x(t)的梯度向量为h(t)=f(xˉ(t))Rn×dh(t)=\nabla f(\bar x(t))\in\mathbb R^{n\times d}

第三步,再看最优解误差xˉx\|\bar x-x^*\|,得到

xˉ(k)=xˉ(k1)ηh(k1)η[g(k1)h(k1)].(8)\bar x(k)=\bar x(k-1)-\eta h(k-1)-\eta[g(k-1)-h(k-1)].\qquad (8)

引理2(Lemma 3.11 [2]):如果f:RdRf:\mathbb R^d\to\mathbb R满足假设1,那么x,yRd\forall x,y\in\mathbb R^d,有

(f(x)f(y))T(xy)αβα+βxy2+1α+βf(x)f(y)2.(\nabla f(x)-\nabla f(y))^T(x-y)\geq \frac{\alpha\beta}{\alpha+\beta}\|x-y\|^2+\frac{1}{\alpha+\beta}\|\nabla f(x)-\nabla f(y)\|^2.

引理3:xRd\forall x\in\mathbb R^d,定义x+=xηf(x)x^+=x-\eta \nabla f(x),其中0<η<2β0<\eta<\frac{2}{\beta}ff满足假设1,那么

x+xλxx\|x^+-x^*\|\leq \lambda \|x-x^*\|

其中λ=max(1ηα,1ηβ)\lambda =\max(|1-\eta \alpha|,|1-\eta \beta|)

证明:如果0<η2α+β0<\eta\leq \frac{2}{\alpha+\beta}α<β\alpha< \beta,那么2ηαβ\frac{2}{\eta}-\alpha\geq\beta1ηα1ηβ|1-\eta\alpha|\geq |1-\eta\beta|。可知ff也是2ηα\frac{2}{\eta}-\alpha光滑的函数,我们有

xxηf(x)2=xx2+η2f(x)22ηf(x)T(xx)(12ηα(2ηα)α+(2ηα))xx2+(η22ηα+(2ηα))f(x)2=(1ηα)2xx2=λ2xx2\begin{aligned} \|x-x^*-\eta \nabla f(x)\|^2&=\|x-x^*\|^2+\eta^2\|\nabla f(x)\|^2-2\eta \nabla f(x)^T(x-x^*)\\ &\leq \left(1-2\eta\frac{\alpha(\frac{2}{\eta}-\alpha)}{\alpha+(\frac{2}{\eta}-\alpha)}\right)\|x-x^*\|^2+\left(\eta^2-\frac{2\eta}{\alpha+(\frac{2}{\eta}-\alpha)}\right)\|\nabla f(x)\|^2\\ &=(1-\eta \alpha)^2\|x-x^*\|^2\\ &=\lambda^2\|x-x^*\|^2 \end{aligned}

如果2α+β<η<2β\frac{2}{\alpha+\beta}<\eta<\frac{2}{\beta}αβ\alpha\geq\beta,同样可以进行上面的分析。

由引理3和式(8)可得

xˉ(k)xλxˉ(k1)x+ηg(k1)h(k1)λxˉ(k1)x+(ηβ/n)x(k1)1nxˉ(k1)(9)\begin{aligned} \|\bar x(k)-x^*\|&\leq \lambda \|\bar x(k-1)-x^*\|+\eta \|g(k-1)-h(k-1)\|\\ &\leq \lambda \|\bar x(k-1)-x^*\|+(\eta\beta/\sqrt{n}) \|x(k-1)-1_n\bar x(k-1)\| \end{aligned}\qquad(9)

其中λ=max(1ηα,1ηβ)\lambda =\max(|1-\eta \alpha|,|1-\eta \beta|)

第四步,看x(k)x(k1)\|x(k)-x(k-1)\|的有界性,由假设1得到

h(k1)=f(xˉ(k1))βxˉ(k1)x.\|h(k-1) \|=\|\nabla f(\bar x(k-1)) \|\leq \beta\|\bar x(k-1)-x^* \|.

结合式(9)的方法

s(k1)s(k1)1ng(k1)+1ng(k1)1nh(k1)+1nh(k1)s(k1)1ng(k1)+βx(k1)1nxˉ(k1)+βnxˉ(k1)x.\begin{aligned} \|s(k-1)\|&\leq \|s(k-1)-1_ng(k-1) \|+\|1_ng(k-1)-1_n h(k-1) \|+\|1_n h(k-1)\|\\ &\leq \|s(k-1)-1_ng(k-1) \|+\beta\|x(k-1)-1_n \bar x(k-1) \|+\beta \sqrt{n}\|\bar x(k-1)-x^*\|. \end{aligned}

因此

x(k)x(k1)=Wx(k1)x(k1)ηs(k1)=(WI)(x(k1)1nxˉ(k1))ηs(k1)2(x(k1)1nxˉ(k1))+ηs(k1)ηs(k1)1ng(k1)+(ηβ+2)x(k1)1nxˉ(k1)+ηβnxˉ(k1)x.\begin{aligned} \|x(k)-x(k-1) \|&=\|Wx(k-1)-x(k-1)-\eta s(k-1) \|\\ &=\|(W-I)(x(k-1)-1_n\bar x(k-1))-\eta s(k-1) \|\\ &\leq 2\|(x(k-1)-1_n\bar x(k-1))\|+\eta\|s(k-1) \|\\ &\leq \eta\|s(k-1)-1_ng(k-1) \|\\ &\quad+(\eta\beta+2)\|x(k-1)-1_n \bar x(k-1) \|+\eta\beta \sqrt{n}\|\bar x(k-1)-x^*\|. \end{aligned}

将上式代入(6)得到

s(k)1ng(k)(σ+βη)s(k1)1ng(k1)+β(ηβ+2)x(k1)1nxˉ(k1)+ηβ2nxˉ(k1)x(10)\begin{aligned} \|s(k)-1_n g(k)\|&\leq (\sigma+\beta \eta)\|s(k-1)-1_n g(k-1) \|\\ &\quad +\beta(\eta \beta+2)\|x(k-1)-1_n \bar x(k-1) \|+\eta\beta^2 \sqrt{n}\|\bar x(k-1)-x^*\| \end{aligned}\qquad (10)

结合(7)(9)和(10)得到

因为z(k)z(k)G(η)G(\eta)非负,可以直接迭代展开得到

z(k)G(η)kz(0).z(k)\leq G(\eta)^kz(0).

因为G(η)G(\eta)非负,G(η)2G(\eta)^2为正,所以G(η)kG(\eta)^k每一项都是O(ρ(G(η))k)O(\rho(G(\eta))^k)[3]z(h)z(h)每一项以ρ(G(η))k\rho(G(\eta))^k速度收敛。


  1. Qu, G., & Li, N. (2018). Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3), 1245–1260. https://doi.org/10.1109/TCNS.2017.2698261 ↩︎

  2. Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3–4), 231–357. https://doi.org/10.1561/2200000050. ↩︎

  3. R. A. Horn and C. R. Johnson. (2012). Matrix Analysis. Cambridge. U.K.: Cam- bridge Univ. Press. ↩︎