Skip to content

Zero-Order Optimization Techniques

Some notes on zero-order optimzation techniques

ML

Table of Contents

Open Table of Contents

Description

Algorithms to find the minimum of a function, where order denotes the number of derivatives used.

  • First-order methods: gradient-based.
  • Zero-order methods: derivative-free.

Cost Functions

Let

g:RnR,g: \mathbb{R}^n \to \mathbb{R},

where g(w)g(\mathbf{w}) is a scalar and w=[w1,w2,,wn]T\mathbf{w} = [w_1, w_2, \dots, w_n]^T is an nn-dimensional vector.

Example:

g(w)=w2=i=1nwi2=wTw.g(\mathbf{w}) = \|\mathbf{w}\|^2 = \sum_{i=1}^n w_i^2 = \mathbf{w}^T \mathbf{w}.

Typical optimization problem:

minwg(w).\min_{\mathbf{w}} g(\mathbf{w}).

Optimality Conditions

  • w\mathbf{w}^* is a global minimum if
    g(w)g(w),wg(\mathbf{w}^*) \leq g(\mathbf{w}), \quad \forall \mathbf{w}

  • w\mathbf{w}^* is a global maximum if
    g(w)g(w),wg(\mathbf{w}^*) \geq g(\mathbf{w}), \quad \forall \mathbf{w}

A Naïve Approach

Evaluate g(w)g(\mathbf{w}) at many points (uniformly or randomly sampled) and choose the smallest value.

  • Problem: in high dimensions, sampling requires exponentially more points to cover the same volume (the “curse of dimensionality”).

General Framework

  1. Start at w0\mathbf{w}_0 (random selection in domain).
  2. Find a descent direction d\mathbf{d}.
  3. Update: wk=wk1+dk1\mathbf{w}_{k} = \mathbf{w}_{k-1} + \mathbf{d}_{k-1}.
  4. Repeat until convergence.
  • Exploration vs. exploitation tradeoff:
    • Start with large steps to explore.
    • Gradually shrink step size to converge.

At step kk, pick random directions {dp}p=1P\{\mathbf{d}^p\}_{p=1}^P.

  1. Evaluate
    s=argminpg(wk1+dp)s = \arg\min_{p} g(\mathbf{w}^{k-1} + \mathbf{d}^p)

  2. If
    g(wk1+ds)<g(wk1)g(\mathbf{w}^{k-1} + \mathbf{d}^s) < g(\mathbf{w}^{k-1})
    then update
    wk=wk1+ds\mathbf{w}^k = \mathbf{w}^{k-1} + \mathbf{d}^s.

  • Variants: coordinate search vs. descent search.