# Subgradient Methods in 10 Minutes

Now that we’ve covered topics from convex optimization in a very shallow manner, it’s time to go deeper.

## Subgradient methods

What is a subgradient? It’s a substitute for the gradient, when the gradient doesn’t exist.

### Convexity Review

To understand subgradients, first we need to review convexity. The Taylor series of a real valued function that is infinitely differentiable at a real number \(x\) is given by

\[f(y) \approx \sum\limits_{k=0}^{\infty} \frac{f^{(k)}(x)}{k!}(y-x)^k\]Expanding terms, the series resembles

\[f(y) \approx f(x) + \frac{f^{\prime}(x)}{1!}(y-x) + \frac{f^{\prime\prime}(x)}{2!}(y-x)^2 + \frac{f^{\prime\prime\prime}(x)}{3!}(y-x)^3 + \cdots.\]If a function is convex, its first order Taylor expansion (a tangent line) will always be a global underestimator:

\[f(y) \geq f(x) + \frac{f^{\prime}(x)}{1!}(y-x)\]If \(y=x + \delta\), then:

\[\begin{aligned} f(x+ \delta) \geq f(x) + \nabla f(x)^T (x+\delta-x) \\ f(x+ \delta) \geq f(x) + \nabla f(x)^T \delta \end{aligned}\]From the picture, it’s obvious that an affine function is always an underestimator for this quadratic (thus convex) function.

## Subgradients

Consider a function \(f\) with a kink inside of it, rendering the function non-differentiable at a point \(x_2\).

The derivative jumps up by some amount (over some interval) at this kink. Any slope within that interval is a valid subgradient. A subgradient is also a supporting hyperplane to the epigraph of the function. The set of subgradients of \(f\) at \(x\) is called the subdifferential \(\partial f(x)\) (a point-to-set mapping). To determine if a function is subdifferentiable at \(x_0\), we must ask, “is there a global affine lower bound on this function that is tight at this point?”.

- If a function is convex, it has at least one point in the relative interior of the domain.
- If a function is differentiable at \(x\), then \(\partial f(x) = {\nabla f(x)} \) (a singleton set).

For example, for absolute value, we could say the interval \( [\nabla f_{\text{left}}(x_0), \nabla f_{\text{right}}(x_0)] \) is the range of possible slopes (subgradients):

### Basic Subgradient Method: Negative Subgradient Update

From [1]: Given a convex function \(f:\mathbb{R}^n \rightarrow \mathbb{R}\), not necessarily differentiable. The subgradient method is just like gradient descent, but replacing gradients with subgradients. i.e., initialize \(x^{(0)}\), then repeat

\[\begin{array}{ll} x^{(k+1)} = x^{(k)} − \alpha_k \cdot g^{(k)}, & k = 0, 1, 2, 3, \cdots \end{array}\]where \(g^{(k)}\) is **any** subgradient of \(f\) at \(x^{(k)}\), and \(\alpha_k >0 \) is the \(k\)’th step size.

Unlike gradient descent, in the negative subgradient update it’s entirely possible that \(-g^{(k)}\) is not a descent direction for \(f\) at \(x^{(k)}\). In such cases, we always have \(f(x^{(k+1)}) > f(x^{(k)})\), meaning an iteration of the subgradient method can increase the objective function. To resolve this issue, we keep track of best iterate \(x_{best}^k\) among \(x^{(1)}, \cdots , x^{(k)}\):

\[f(x_{best}^{(k)}) = \underset{i=1,\cdots ,k}{\mbox{ min }} f(x^{(i)})\]To update each \(x^{(i)}\), there are at least 5 common ways to select the step size, all with different convergence properties:

- Constant step size
- Constant step length
- Square summable but not summable.
- Nonsummable diminishing
- Nonsummable diminishing step lengths.

See [2] for details regarding each of these possible step size choices and the **associated guarantees on convergence**.

### Subgradient methods for constrained problems

### Primal-dual subgradient methods

### Stochastic subgradient method

### Mirror descent and variable metric methods

## Localization methods

### Localization and cutting-plane methods

### Analytic center cutting-plane method

### Ellipsoid method

## Decomposition and distributed optimization

### Primal and dual decomposition

### Decomposition applications

### Distributed optimization via circuits

## Proximal and operator splitting methods

### Proximal algorithms

### Monotone operators

### Monotone operator splitting methods

### Alternating direction method of multipliers (ADMM)

## Conjugate gradients

### Conjugate-gradient method

### Truncated Newton methods

## Nonconvex problems

### \(l_1\) methods for convex-cardinality problems

### \(l_1\) methods for convex-cardinality problems, part II

### Sequential convex programming

## Branch-and-bound methods

## References

[1] Gordon, Geoff. CMU 10-725 Optimization Fall 2012 Lecture Slides, Lecture 6.

[2] Boyd, Stephen. Subgradient Methods: Notes for EE 364B, January 2007.