Skip to content

First-Order Optimization Techniques

Some notes on first-order optimzation techniques

ML

Table of Contents

Open Table of Contents

Introduction

This section details fundamental optimization algorithms used to tackle machine learning problems. First-order optimality conditions codify how the first derivatives characterize the minima of functions. We’ll explore fundamental concepts related to hyperplanes and, in particular, the first-order Taylor series approximation.

First-Order Optimality Condition

When the derivative of a function is zero at a point, that point is a potential minimum. Analogously, for multi-input functions, any NN-dimensional point v\mathbf{v} where every partial derivative of gg is zero is a potential minimum. This system of NN equations is referred to as the first-order system of equations:

gw1(v)=0,gw2(v)=0,,gwN(v)=0\frac{\partial g}{\partial w_1}(\mathbf{v}) = 0, \quad \frac{\partial g}{\partial w_2}(\mathbf{v}) = 0, \quad \ldots, \quad \frac{\partial g}{\partial w_N}(\mathbf{v}) = 0

We can also write the first-order system more compactly using gradient notation as: g(v)=0N×1\nabla g(\mathbf{v}) = \mathbf{0}_{N \times 1}

This useful characterization of minimum points is the first-order analog to the zero-order condition for optimality. However, there are two problems with the first-order characterization of minima:

  • It is virtually impossible to solve a general function’s first-order system of equations “by hand” (algebraically for closed-form solutions)
  • It only defines global minima for convex functions (like quadratics), and also captures maxima and saddle points of nonconvex functions. These minima, maxima, and saddle points are referred to as stationary or critical points of a function.

Coordinate Descent and the First-Order Optimality Condition

While solving the first-order system simultaneously is often impossible, it is sometimes possible to solve it sequentially. This approach is called coordinate descent and is effective when each of the NN equations can be solved in closed form.

To solve the first-order system sequentially, we initialize an input w0\mathbf{w}^0 and begin by updating the first coordinate by solving: gw1(w0)=0\frac{\partial g}{\partial w_1}(\mathbf{w}^0) = 0

for the optimal weight w1w_1^*. All other weights are kept at their initial values. We then update the first coordinate of vector w0\mathbf{w}^0 with the solution w1w_1^*. Continuing this pattern, we update the nn-th weight by solving: gwn(wn1)=0\frac{\partial g}{\partial w_n}(\mathbf{w}^{n-1}) = 0

for wnw_n^*. After sweeping through all NN weights a single time, we can refine the solution by sweeping through weights again. At the kk-th such sweep, we update the nn-th weight by solving: gwn(wN(k1)+n1)=0\frac{\partial g}{\partial w_n}(\mathbf{w}^{N(k-1)+n-1}) = 0

Geometry of First-Order Taylor Series

Here, we describe characteristics of hyperplanes including directions of steepest ascent and steepest descent. We then study the first-order Taylor series approximation to a function.

Anatomy of Hyperplanes

A general NN-dimensional hyperplane is characterized as: h(w1,w2,w3,,wN)=a+b1w1+b2w2++bNwNh(w_1, w_2, w_3, \ldots, w_N) = a + b_1w_1 + b_2w_2 + \cdots + b_Nw_N

where aa as well as b1b_1 through bNb_N are all scalar parameters. We can rewrite hh more compactly as: h(w)=a+bTwh(\mathbf{w}) = a + \mathbf{b}^T\mathbf{w}

where b\mathbf{b} and w\mathbf{w} are vectors.

Notice that the hyperplane is an NN-dimensional object living in an (N+1)(N+1)-dimensional ambient space, whose input space is NN-dimensional.

Steepest Ascent/Descent Directions

When N=1N = 1, there are only two directions for ww to ‘move’ in. Moving right or left yields an ascent (or descent) direction for any arbitrary hyperplane hh. In the case where N>1N > 1, however, there are infinitely many directions to move in, and some that preserve the value of hh. Thus, it is logical to ask whether we can find the direction that produces the largest ascent (or descent), commonly referred to as the direction of steepest ascent/descent.

We aim to find the unit direction d\mathbf{d} such that the value of h(w0+d)h(\mathbf{w}^0 + \mathbf{d}) is maximal. In other words, we aim to solve: maxdh(w0+d)\max_{\mathbf{d}} h(\mathbf{w}^0 + \mathbf{d}) subject to d=1\|\mathbf{d}\| = 1.

Note that h(w0+d)h(\mathbf{w}^0 + \mathbf{d}) can be written as: h(w0+d)=a+bT(w0+d)=a+bTw0+bTdh(\mathbf{w}^0 + \mathbf{d}) = a + \mathbf{b}^T(\mathbf{w}^0 + \mathbf{d}) = a + \mathbf{b}^T\mathbf{w}^0 + \mathbf{b}^T\mathbf{d}

where the first two terms on the right-hand side are constant with respect to d\mathbf{d}.

Therefore, maximizing the value of h(w0+d)h(\mathbf{w}^0 + \mathbf{d}) is equivalent to maximizing bTd\mathbf{b}^T\mathbf{d}, which can be written using the inner product as: bTd=bdcos(θ)\mathbf{b}^T\mathbf{d} = \|\mathbf{b}\| \|\mathbf{d}\| \cos(\theta)

Note that b\mathbf{b} does not change with respect to d\mathbf{d}, and d=1\|\mathbf{d}\| = 1. Thus, we aim to: maxdcos(θ)\max_{\mathbf{d}} \cos(\theta)

It is clear that the maximality condition occurs when θ=0\theta = 0, or equivalently, when d=bb\mathbf{d} = \frac{\mathbf{b}}{\|\mathbf{b}\|}. Similarly, the minimality condition occurs when d=bb\mathbf{d} = -\frac{\mathbf{b}}{\|\mathbf{b}\|}.

The Gradient

A multi-input function g(w)g(\mathbf{w}) can be approximated locally around a given point w0\mathbf{w}^0 by the hyperplane: h(w)=g(w0)+g(w0)T(ww0)h(\mathbf{w}) = g(\mathbf{w}^0) + \nabla g(\mathbf{w}^0)^T(\mathbf{w} - \mathbf{w}^0)

This can be rewritten in the form h(w)=a+bTwh(\mathbf{w}) = a + \mathbf{b}^T\mathbf{w}, where:

  • a=g(w0)g(w0)Tw0a = g(\mathbf{w}^0) - \nabla g(\mathbf{w}^0)^T\mathbf{w}^0
  • b=g(w0)\mathbf{b} = \nabla g(\mathbf{w}^0)

This hyperplane is tangent to gg at the point w0\mathbf{w}^0. Because hh is designed to approximate gg near w0\mathbf{w}^0, its steepest ascent and descent directions also tell us the direction to travel to increase or decrease the value of the underlying function gg itself at/near w0\mathbf{w}^0.

The gradient g(w0)\nabla g(\mathbf{w}^0) therefore points in the direction of steepest ascent, while g(w0)-\nabla g(\mathbf{w}^0) points in the direction of steepest descent.

Computing Gradients Efficiently

We can compute the derivative of relatively simple functions like g(w)=sin(w2)g(w) = \sin(w^2) easily by applying differentiation rules and derivatives of elementary functions. However, for complicated functions common in machine learning, we need automatic differentiation tools as an alternative to manual computation.

For Python, popular automatic differentiation libraries include:

  • autograd - Simple and lightweight automatic differentiation
  • JAX - High-performance automatic differentiation with GPU support
  • PyTorch - Deep learning framework with built-in autodiff
  • TensorFlow - Another deep learning framework with automatic differentiation

These tools compute gradients using the chain rule systematically, enabling efficient optimization of complex multi-layer functions.

Gradient Descent

We have established that the negative gradient of a function computed at a particular point always defines a valid descent direction at that point. We can construct a local optimization method consisting of steps of the general form: wk=wk1+αdk\mathbf{w}^k = \mathbf{w}^{k-1} + \alpha \mathbf{d}^k

By employing the negative gradient as the descent direction dk=g(wk1)\mathbf{d}^k = -\nabla g(\mathbf{w}^{k-1}), the sequence of steps takes the form: wk=wk1αg(wk1)\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \nabla g(\mathbf{w}^{k-1})

Provided an appropriate step size α\alpha, this iterative process will converge to a point near the local minimum of the target function gg. This is known as the gradient descent algorithm.

Gradient descent is often far superior to local zero-order optimization methods and is the most popular optimization algorithm used in machine learning. This superiority stems from the fact that the descent direction (via the gradient) is almost always easier to compute than seeking out a descent direction at random, particularly as the dimension of the input space increases.

Step Length Choices

As with all local optimization methods, the step length or learning rate parameter α\alpha needs to be carefully chosen. For gradient descent, the most common choices include:

  1. Fixed step size: Using a constant α\alpha value for each step of a gradient descent run
  2. Diminishing step size: Using a decreasing step size like α=1k\alpha = \frac{1}{k} at the kk-th step

In both cases, the aim is to choose α\alpha to induce the most rapid minimization possible.

Mathematical considerations for step size:

  • Too large: α>2λmax\alpha > \frac{2}{\lambda_{\max}} where λmax\lambda_{\max} is the largest eigenvalue of the Hessian (for quadratic functions)
  • Too small: Convergence becomes unnecessarily slow
  • Optimal: α=2λmin+λmax\alpha = \frac{2}{\lambda_{\min} + \lambda_{\max}} for quadratic functions

Oscillation in Cost Function History

In practice, we use the cost function history plot to tune the step length parameter α\alpha. When analyzing the cost function history, it is not ultimately important that the plot is strictly decreasing. What is critically important is finding a value of α\alpha that allows the algorithm to find the lowest value efficiently.

The best choice of α\alpha for a given minimization problem may cause the algorithm to ‘hop around’ and not induce a strict descent at every step, but still converge to the optimal solution faster than a more conservative choice.

Convergence Criteria

In principle, we can wait for gradient descent to get sufficiently close to a stationary point by ensuring the magnitude of the gradient is sufficiently small: g(wk)<ϵ\|\nabla g(\mathbf{w}^k)\| < \epsilon

Other formal convergence criteria include:

  1. Step size criterion: Halt when steps no longer make sufficient progress: wkwk1<ϵ\|\mathbf{w}^k - \mathbf{w}^{k-1}\| < \epsilon

  2. Function value criterion: Stop when function evaluations no longer differ substantially: g(wk)g(wk1)<ϵ|g(\mathbf{w}^k) - g(\mathbf{w}^{k-1})| < \epsilon

  3. Maximum iterations: Set a maximum number of iterations to prevent infinite loops

The tolerance ϵ\epsilon and maximum iteration count are typically set based on computational constraints and desired precision.

Python Implementation

# import automatic differentiator to compute gradient module
from autograd import grad

# gradient descent function
def gradient_descent (g, alpha, max_its , w):

    # compute gradient module using autograd
    gradient = grad(g)

    # gradient descent loop
    weight_history = [w] # weight history container
    cost_history = [g(w)] # cost function history container
    for k in range( max_its ):

        # evaluate the gradient
        grad_eval = gradient (w)

        # take gradient descent step
        w = w - alpha* grad_eval

        # record weight and cost
        weight_history .append (w)
        cost_history .append (g(w))

    return weight_history , cost_history

Two Natural Weaknesses of Gradient Descent

Like any vector, the negative gradient consists of both direction and magnitude. Depending on the function being minimized, either one of these attributes, or both, can present significant challenges:

1. Oscillating Direction Problem

The direction of the negative gradient can rapidly oscillate during optimization, often producing zig-zag steps that take considerable time to reach minima. This occurs when:

  • The condition number κ=λmaxλmin\kappa = \frac{\lambda_{\max}}{\lambda_{\min}} is large
  • The function has elongated contours (elliptical level sets)

Mathematical insight: For a quadratic function g(w)=12wTQwg(\mathbf{w}) = \frac{1}{2}\mathbf{w}^T\mathbf{Q}\mathbf{w}, the convergence rate is governed by: wkw(κ1κ+1)kw0w\left\|\mathbf{w}^k - \mathbf{w}^*\right\| \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^k \|\mathbf{w}^0 - \mathbf{w}^*\|

2. Vanishing Gradient Problem

The magnitude of gradients can vanish rapidly at stationary points, leading to gradient descent crawling slowly near minima and saddle points. This manifests as: g(wk)0 as wkw\|\nabla g(\mathbf{w}^k)\| \to 0 \text{ as } \mathbf{w}^k \to \mathbf{w}^*

Practical implications: These problems are particularly prevalent in machine learning because many objective functions have:

  • Long flat regions where contours become increasingly parallel
  • High-dimensional parameter spaces with poor conditioning
  • Complex loss landscapes with multiple local minima and saddle points

Advanced Topics and Extensions

Convergence Analysis for Convex Functions

For strongly convex functions with Lipschitz continuous gradients, gradient descent enjoys guaranteed convergence rates. Specifically, if gg is μ\mu-strongly convex and LL-smooth, then: g(wk)g(w)(1μL)k[g(w0)g(w)]g(\mathbf{w}^k) - g(\mathbf{w}^*) \leq \left(1 - \frac{\mu}{L}\right)^k [g(\mathbf{w}^0) - g(\mathbf{w}^*)]

This shows linear convergence with rate ρ=1μL\rho = 1 - \frac{\mu}{L}.

Line Search Methods

Instead of using a fixed step size, line search methods adaptively choose αk\alpha^k at each iteration by solving: αk=argminα>0g(wk1αg(wk1))\alpha^k = \arg\min_{\alpha > 0} g(\mathbf{w}^{k-1} - \alpha \nabla g(\mathbf{w}^{k-1}))

Popular line search conditions include:

  • Armijo condition: g(wk)g(wk1)c1αkg(wk1)2g(\mathbf{w}^k) \leq g(\mathbf{w}^{k-1}) - c_1 \alpha^k \|\nabla g(\mathbf{w}^{k-1})\|^2
  • Wolfe conditions: Armijo + g(wk)Tdkc2g(wk1)Tdk\nabla g(\mathbf{w}^k)^T \mathbf{d}^k \geq c_2 \nabla g(\mathbf{w}^{k-1})^T \mathbf{d}^k

Momentum and Acceleration

Classical momentum modifies the update rule to: vk=βvk1αg(wk1)\mathbf{v}^k = \beta \mathbf{v}^{k-1} - \alpha \nabla g(\mathbf{w}^{k-1}) wk=wk1+vk\mathbf{w}^k = \mathbf{w}^{k-1} + \mathbf{v}^k

Nesterov acceleration provides even better convergence rates: wk=wk1αg(wk1+β(wk1wk2))\mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \nabla g(\mathbf{w}^{k-1} + \beta(\mathbf{w}^{k-1} - \mathbf{w}^{k-2}))