Table of Contents
Open Table of Contents
Introduction
This section details fundamental optimization algorithms used to tackle machine learning problems. First-order optimality conditions codify how the first derivatives characterize the minima of functions. We’ll explore fundamental concepts related to hyperplanes and, in particular, the first-order Taylor series approximation.
First-Order Optimality Condition
When the derivative of a function is zero at a point, that point is a potential minimum. Analogously, for multi-input functions, any -dimensional point where every partial derivative of is zero is a potential minimum. This system of equations is referred to as the first-order system of equations:
We can also write the first-order system more compactly using gradient notation as:
This useful characterization of minimum points is the first-order analog to the zero-order condition for optimality. However, there are two problems with the first-order characterization of minima:
- It is virtually impossible to solve a general function’s first-order system of equations “by hand” (algebraically for closed-form solutions)
- It only defines global minima for convex functions (like quadratics), and also captures maxima and saddle points of nonconvex functions. These minima, maxima, and saddle points are referred to as stationary or critical points of a function.
Coordinate Descent and the First-Order Optimality Condition
While solving the first-order system simultaneously is often impossible, it is sometimes possible to solve it sequentially. This approach is called coordinate descent and is effective when each of the equations can be solved in closed form.
To solve the first-order system sequentially, we initialize an input and begin by updating the first coordinate by solving:
for the optimal weight . All other weights are kept at their initial values. We then update the first coordinate of vector with the solution . Continuing this pattern, we update the -th weight by solving:
for . After sweeping through all weights a single time, we can refine the solution by sweeping through weights again. At the -th such sweep, we update the -th weight by solving:
Geometry of First-Order Taylor Series
Here, we describe characteristics of hyperplanes including directions of steepest ascent and steepest descent. We then study the first-order Taylor series approximation to a function.
Anatomy of Hyperplanes
A general -dimensional hyperplane is characterized as:
where as well as through are all scalar parameters. We can rewrite more compactly as:
where and are vectors.
Notice that the hyperplane is an -dimensional object living in an -dimensional ambient space, whose input space is -dimensional.
Steepest Ascent/Descent Directions
When , there are only two directions for to ‘move’ in. Moving right or left yields an ascent (or descent) direction for any arbitrary hyperplane . In the case where , however, there are infinitely many directions to move in, and some that preserve the value of . Thus, it is logical to ask whether we can find the direction that produces the largest ascent (or descent), commonly referred to as the direction of steepest ascent/descent.
We aim to find the unit direction such that the value of is maximal. In other words, we aim to solve: subject to .
Note that can be written as:
where the first two terms on the right-hand side are constant with respect to .
Therefore, maximizing the value of is equivalent to maximizing , which can be written using the inner product as:
Note that does not change with respect to , and . Thus, we aim to:
It is clear that the maximality condition occurs when , or equivalently, when . Similarly, the minimality condition occurs when .
The Gradient
A multi-input function can be approximated locally around a given point by the hyperplane:
This can be rewritten in the form , where:
This hyperplane is tangent to at the point . Because is designed to approximate near , its steepest ascent and descent directions also tell us the direction to travel to increase or decrease the value of the underlying function itself at/near .
The gradient therefore points in the direction of steepest ascent, while points in the direction of steepest descent.
Computing Gradients Efficiently
We can compute the derivative of relatively simple functions like easily by applying differentiation rules and derivatives of elementary functions. However, for complicated functions common in machine learning, we need automatic differentiation tools as an alternative to manual computation.
For Python, popular automatic differentiation libraries include:
autograd
- Simple and lightweight automatic differentiationJAX
- High-performance automatic differentiation with GPU supportPyTorch
- Deep learning framework with built-in autodiffTensorFlow
- Another deep learning framework with automatic differentiation
These tools compute gradients using the chain rule systematically, enabling efficient optimization of complex multi-layer functions.
Gradient Descent
We have established that the negative gradient of a function computed at a particular point always defines a valid descent direction at that point. We can construct a local optimization method consisting of steps of the general form:
By employing the negative gradient as the descent direction , the sequence of steps takes the form:
Provided an appropriate step size , this iterative process will converge to a point near the local minimum of the target function . This is known as the gradient descent algorithm.
Gradient descent is often far superior to local zero-order optimization methods and is the most popular optimization algorithm used in machine learning. This superiority stems from the fact that the descent direction (via the gradient) is almost always easier to compute than seeking out a descent direction at random, particularly as the dimension of the input space increases.
Step Length Choices
As with all local optimization methods, the step length or learning rate parameter needs to be carefully chosen. For gradient descent, the most common choices include:
- Fixed step size: Using a constant value for each step of a gradient descent run
- Diminishing step size: Using a decreasing step size like at the -th step
In both cases, the aim is to choose to induce the most rapid minimization possible.
Mathematical considerations for step size:
- Too large: where is the largest eigenvalue of the Hessian (for quadratic functions)
- Too small: Convergence becomes unnecessarily slow
- Optimal: for quadratic functions
Oscillation in Cost Function History
In practice, we use the cost function history plot to tune the step length parameter . When analyzing the cost function history, it is not ultimately important that the plot is strictly decreasing. What is critically important is finding a value of that allows the algorithm to find the lowest value efficiently.
The best choice of for a given minimization problem may cause the algorithm to ‘hop around’ and not induce a strict descent at every step, but still converge to the optimal solution faster than a more conservative choice.
Convergence Criteria
In principle, we can wait for gradient descent to get sufficiently close to a stationary point by ensuring the magnitude of the gradient is sufficiently small:
Other formal convergence criteria include:
-
Step size criterion: Halt when steps no longer make sufficient progress:
-
Function value criterion: Stop when function evaluations no longer differ substantially:
-
Maximum iterations: Set a maximum number of iterations to prevent infinite loops
The tolerance and maximum iteration count are typically set based on computational constraints and desired precision.
Python Implementation
# import automatic differentiator to compute gradient module
from autograd import grad
# gradient descent function
def gradient_descent (g, alpha, max_its , w):
# compute gradient module using autograd
gradient = grad(g)
# gradient descent loop
weight_history = [w] # weight history container
cost_history = [g(w)] # cost function history container
for k in range( max_its ):
# evaluate the gradient
grad_eval = gradient (w)
# take gradient descent step
w = w - alpha* grad_eval
# record weight and cost
weight_history .append (w)
cost_history .append (g(w))
return weight_history , cost_history
Two Natural Weaknesses of Gradient Descent
Like any vector, the negative gradient consists of both direction and magnitude. Depending on the function being minimized, either one of these attributes, or both, can present significant challenges:
1. Oscillating Direction Problem
The direction of the negative gradient can rapidly oscillate during optimization, often producing zig-zag steps that take considerable time to reach minima. This occurs when:
- The condition number is large
- The function has elongated contours (elliptical level sets)
Mathematical insight: For a quadratic function , the convergence rate is governed by:
2. Vanishing Gradient Problem
The magnitude of gradients can vanish rapidly at stationary points, leading to gradient descent crawling slowly near minima and saddle points. This manifests as:
Practical implications: These problems are particularly prevalent in machine learning because many objective functions have:
- Long flat regions where contours become increasingly parallel
- High-dimensional parameter spaces with poor conditioning
- Complex loss landscapes with multiple local minima and saddle points
Advanced Topics and Extensions
Convergence Analysis for Convex Functions
For strongly convex functions with Lipschitz continuous gradients, gradient descent enjoys guaranteed convergence rates. Specifically, if is -strongly convex and -smooth, then:
This shows linear convergence with rate .
Line Search Methods
Instead of using a fixed step size, line search methods adaptively choose at each iteration by solving:
Popular line search conditions include:
- Armijo condition:
- Wolfe conditions: Armijo +
Momentum and Acceleration
Classical momentum modifies the update rule to:
Nesterov acceleration provides even better convergence rates: