Subgradient Methods in 10 Minutes

Now that we’ve covered topics from convex optimization in a very shallow manner, it’s time to go deeper.

Subgradient methods

What is a subgradient? It’s a substitute for the gradient, when the gradient doesn’t exist.

Convexity Review

To understand subgradients, first we need to review convexity. The Taylor series of a real valued function that is infinitely differentiable at a real number \(x\) is given by

\[f(y) \approx \sum\limits_{k=0}^{\infty} \frac{f^{(k)}(x)}{k!}(y-x)^k\]

Expanding terms, the series resembles

\[f(y) \approx f(x) + \frac{f^{\prime}(x)}{1!}(y-x) + \frac{f^{\prime\prime}(x)}{2!}(y-x)^2 + \frac{f^{\prime\prime\prime}(x)}{3!}(y-x)^3 + \cdots.\]

If a function is convex, its first order Taylor expansion (a tangent line) will always be a global underestimator:

\[f(y) \geq f(x) + \frac{f^{\prime}(x)}{1!}(y-x)\]

If \(y=x + \delta\), then:

\[\begin{aligned} f(x+ \delta) \geq f(x) + \nabla f(x)^T (x+\delta-x) \\ f(x+ \delta) \geq f(x) + \nabla f(x)^T \delta \end{aligned}\]

From the picture, it’s obvious that an affine function is always an underestimator for this quadratic (thus convex) function.

Subgradients

Consider a function \(f\) with a kink inside of it, rendering the function non-differentiable at a point \(x_2\).

The derivative jumps up by some amount (over some interval) at this kink. Any slope within that interval is a valid subgradient. A subgradient is also a supporting hyperplane to the epigraph of the function. The set of subgradients of \(f\) at \(x\) is called the subdifferential \(\partial f(x)\) (a point-to-set mapping). To determine if a function is subdifferentiable at \(x_0\), we must ask, “is there a global affine lower bound on this function that is tight at this point?”.

If a function is convex, it has at least one point in the relative interior of the domain.
If a function is differentiable at \(x\), then \(\partial f(x) = {\nabla f(x)} \) (a singleton set).

For example, for absolute value, we could say the interval \( [\nabla f_{\text{left}}(x_0), \nabla f_{\text{right}}(x_0)] \) is the range of possible slopes (subgradients):

Basic Subgradient Method: Negative Subgradient Update

From [1]: Given a convex function \(f:\mathbb{R}^n \rightarrow \mathbb{R}\), not necessarily differentiable. The subgradient method is just like gradient descent, but replacing gradients with subgradients. i.e., initialize \(x^{(0)}\), then repeat

\[\begin{array}{ll} x^{(k+1)} = x^{(k)} − \alpha_k \cdot g^{(k)}, & k = 0, 1, 2, 3, \cdots \end{array}\]

where \(g^{(k)}\) is any subgradient of \(f\) at \(x^{(k)}\), and \(\alpha_k >0 \) is the \(k\)’th step size.

Unlike gradient descent, in the negative subgradient update it’s entirely possible that \(-g^{(k)}\) is not a descent direction for \(f\) at \(x^{(k)}\). In such cases, we always have \(f(x^{(k+1)}) > f(x^{(k)})\), meaning an iteration of the subgradient method can increase the objective function. To resolve this issue, we keep track of best iterate \(x_{best}^k\) among \(x^{(1)}, \cdots , x^{(k)}\):

\[f(x_{best}^{(k)}) = \underset{i=1,\cdots ,k}{\mbox{ min }} f(x^{(i)})\]

To update each \(x^{(i)}\), there are at least 5 common ways to select the step size, all with different convergence properties:

Constant step size
Constant step length
Square summable but not summable.
Nonsummable diminishing
Nonsummable diminishing step lengths.

See [2] for details regarding each of these possible step size choices and the associated guarantees on convergence.

Subgradient methods for constrained problems

Primal-dual subgradient methods

Stochastic subgradient method

Mirror descent and variable metric methods

Localization methods

Localization and cutting-plane methods

Analytic center cutting-plane method

Ellipsoid method

Decomposition and distributed optimization

Primal and dual decomposition

Decomposition applications

Distributed optimization via circuits

Proximal and operator splitting methods

Proximal algorithms

Monotone operators

Monotone operator splitting methods

Alternating direction method of multipliers (ADMM)

Conjugate gradients

Conjugate-gradient method

Truncated Newton methods

Nonconvex problems

\(l_1\) methods for convex-cardinality problems

\(l_1\) methods for convex-cardinality problems, part II

Sequential convex programming

Branch-and-bound methods

References

[1] Gordon, Geoff. CMU 10-725 Optimization Fall 2012 Lecture Slides, Lecture 6.

[2] Boyd, Stephen. Subgradient Methods: Notes for EE 364B, January 2007.