Table of Contents
Open Table of Contents
Second-Order Optimality Condition
To determine whether a general single-input function is convex or concave at a point, we check its curvature (or second derivative information at that point), assuming it is twice differentiable at that point. If the second derivative is greater than or equal to 0, or less than or equal to 0, then is said to be convex, or concave at . For multi-input function, the analog is checking whether the Hessian of the function is positive-definite, positive semi-definite, or non positive-definite, relating to its eigenvalues.
More specifically, a stationary point of a multi-input function is:
- a local (or global) minimum if all eigenvalues of Hessian are positive
- a local (or global) maximum if all eigenvalues of Hessian are negative
- a saddle point if the eigenvalues are of mixed values (some negative, some positive)
The Geometry of Second-Order Taylor Series
The general shape of SI quadratic functions
The basic formula for a quadratic function with a single input takes the form , where , , and are all constant values controlling the shape of the function. In particular, the constant controls the convexity or concavity of the function.
The general shape of MI quadratic functions
The multi-input quadratic function takes a form that is completely generalized from the single-input case, which can be written as
where input is -dimensional, remains a constant, is an vector, and being an matrix (assuming symmetric for our purposes). Because this quadratic is defined along many dimensions, it can take on a wider variety of shapes than its single-input analog. The generalization of the single-input test for convexity/concavity is no longer whether the values of are positive or negative, but rather whether the eigenvalues of are positive or negative. If the eigenvalues are all nonnegative, the function is convex, if all nonpositive, the function is concave, if all 0, it reduces to a linear function that is both concave and convex, and otherwise (some positive, some negative), then it is neither concave nor convex.
Local Curvature and Second-Order Taylor Series
Another way to think about local convexity/concavity of a function at a point is via its second-order Taylor approximation at that point, taking the form
This is a true quadratic built using the first and second derivatives of . Not only does the second-order approximation match the curvature of the underlying function at each point in the function’s domain, but if the function is convex/concave at that point, then it is convex/concave everywhere. This concept holds analogously for multi-input functions as well, since the second order series approximation is given by
Newton’s Method
Since the first order Taylor series approximation yields the local optimization framework of gradient descent, it seems natural that higher order approximations may yield similar descent algorithms.
Descent Direction
We can compute stationary points of quadratics easily using the first-order condition for optimality. To get from we move from in the direction given by . This is formalized as
This is direct analog of the single-input solution and reduces to it when . Thus, to get to stationary point , we move from in direction given by . Repeated travel to points defined by stationary point of the second order approximation. For convex functions, each quadratic approximation’s stationary point seems to lower original function’s evaluation, and could provide an efficient algorithm to minimize cost function - known as Newton’s method.
Algorithm
Repeatedly taking steps to stationary points of second order Taylor approximations of a function - at the th step of this process for single input, we can make a second order approximation centered at as and solve for its stationary point to update as
Not hard to determine a multi-input analog. Notice that Newton’s update formula requires an inversion of an Hessian matrix, where is the input dimension. However, in practice, is found via solving the equivalent symmetric system of equations.
Ensuring Numerical Stability
Newton’s method can become unstable when:
- The Hessian is singular or nearly singular (poor conditioning)
- The Hessian has negative eigenvalues (non-convex regions)
Common fixes:
- Regularization: Add to Hessian:
- Line search: Use step size :
- Trust region methods: Limit step size within a “trusted” region
Newton’s method as a zero-finding algorithm
Newton’s method is fundamentally a root-finding algorithm for :
For optimization, we apply it to find roots of the gradient: :
This connection shows why Newton’s method can converge to saddle points or maxima - it finds any stationary point where .
Python Implementation
import numpy as np
def newton_optimization(grad_func, hess_func, w0, max_iter=50, tol=1e-6):
"""
Newton's method for optimization
"""
w = w0.copy()
for i in range(max_iter):
grad = grad_func(w)
hess = hess_func(w)
# Check convergence
if np.linalg.norm(grad) < tol:
print(f"Converged after {i} iterations")
break
# Solve linear system instead of inverting
try:
p = np.linalg.solve(hess, grad)
w = w - p
except np.linalg.LinAlgError:
print("Singular Hessian - adding regularization")
p = np.linalg.solve(hess + 1e-6*np.eye(len(w)), grad)
w = w - p
return w
# Example: minimize f(x,y) = x^2 + 2y^2
def grad_example(w):
return np.array([2*w[0], 4*w[1]])
def hess_example(w):
return np.array([[2, 0], [0, 4]])
# Usage
w_opt = newton_optimization(grad_example, hess_example, np.array([1.0, 1.0]))
Two Natural Weaknesses of Newton’s Method
Minimizing nonconvex function
For nonconvex functions, Newton’s method has serious limitations:
- Saddle points: Can converge to points where but function is neither minimum nor maximum
- Negative curvature: When Hessian has negative eigenvalues, Newton direction may be an ascent direction
- Local minima: May get stuck in poor local minima instead of global minimum
Example: For , starting near converges to the saddle point at instead of global minima at .
Scaling Limitations
Newton’s method becomes computationally prohibitive for large-scale problems:
- Memory: Storing Hessian requires memory
- Computation: Computing Hessian takes operations per function evaluation
- Linear solve: Solving takes operations
Practical limits:
- Feasible for to variables
- For larger problems, use quasi-Newton methods (BFGS, L-BFGS) that approximate Hessian with or storage
Regularized Newton’s Method
Close to stationary point, then have a 0/0 situation… To address this, we add a small epsilon addition to the denominator .
To avoid division by zero, we can do wk+1 = wk - (grad^2(g(w^k)) + epsilon I)^-1 grad(g(w^k))
(svd), this addititon takes eigen values of hessian and moves them by epsilon…
Assuming that positive semi definite, then by adding epsilon, it avoides the division by 0 (having a matrix that is not invertible - we regularize it by adding the epsilon…) - modified svd..
Taking the general expression for newton’s methond and addition of epsilon, take the quadratic local approdximaiton in the n dimensinoal space and modify the approximation that gies same iteration formula…
Obtain modified newton (regularized) by changing 2nd order approximation — h(w) = g(w^k) + grad g(w^k)(w-wk) + 1/2 (w-wk)^2 grad^2(g(w)) (w-wk) + epsilon/2 * ||w-wk||^2 (the l2 norm squared) -> by adding this term, we obtain modifed newtons - tells us that w should be close to wk
In general the idea of minimization:
attempt to find minimum of h1(w), but for whatever reason, we instead want to minimize min(h1(w) + epsilon/2 h2(w)) -> by adding the stabilizing/regularizng term.
Why regularize? May want to incorporate knowledge about solution, might want to avoid overfitting.
- Homework… take gradient of h set
h1 = w1^2 - w2^2, h2 = w1^2 + w2^2
Dealing with Scaling
A: Hessian free methods, subsample the Hessian only keeping the diagonal values - decouples directions allong each coordinate…
B: Quasi-Newton Methods. (appendix A)
f(x2)-f(x1)/x2-x1 as approximation to derivative, and can do something simular by writing H(x2-x1) is approximately gradient of f(x2) - gradient of f(x1).