Unconstrained Optimization

Roh's warehouse·2025년 9월 21일

Optimization in Learning

목록 보기
4/8

Theory of Unconstained Optimization

Optimality Conditions

Lemma:
Suppose that f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} is differentiable at xˉ\bar{x}. If there is a vector dRd \in \mathbb{R} such that f(xˉ)Td<0\nabla f(\bar{x})^Td < 0, then dd is a descent direction of ff at xˉ\bar{x}.

  • sketch of proof: f(x+λd)f(x)+λf(x)Tdf(x)f(x+\lambda d) \approxeq f(x) + \lambda \nabla f(x)^Td \le f(x) for λ0\lambda \ge 0.

추후 나올 Gradient Descent 또는 SGD의 기본 원리이다.

Theorem: First-Order Necessary Optimality Conditions (FONC)
Suppose f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} is differentiable at xx^*.
If xx^* is a local minimum, then f(x)=0\nabla f(x^*) = 0.

Theorem: Second-Order Optimality Conditions
Suppose f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} is twice differentiable at xx^*.\
[Necessary] If xx^* is a local minimum, then f(x)=0\nabla f(x^*) = 0 and 2f(x)\nabla^2 f(x^*) is positive semidefinite.\
[Sufficient] If f(x)=0\nabla f(x^*) = 0 and 2f(x)\nabla^2 f(x^*) is positive definite, then xx^* is a strict local minimum.

Determining Local Optima

  1. Find the critical points of f(x,y)f(x,y) by solving the system of simultneous equations: fx=0,fy=0f_x=0, f_y=0.

  2. Let D(x,y)=fxxfyyfxy2D(x,y)=f_{xx}f_{yy} - f_{xy}^2.

  3. Then

    1. D(a,b)>0D(a,b)>0 and fxx(a,b)<0f_{xx}(a,b)<0 implies that f(x,y)f(x,y) has a local maximum at the point (a,b)(a,b).

    2. D(a,b)>0D(a,b)>0 and fxx(a,b)>0f_{xx}(a,b)>0 implies that f(x,y)f(x,y) has a local minimum at the point (a,b)(a,b).

    3. D(a,b)<0D(a,b)<0 implies that f(x,y)f(x,y) has neither a local maximum nor a local minimum at the point (a,b)(a,b), it has instead a saddle point.

    4. D(a,b)=0D(a,b)=0 implies that the test is inconclusive, so some other technique must be used to solve the problem.

Line Search Strategy

Line search는 numerical analysis에서 (근사) 해를 찾는 기법으로 복잡한 diffrentiable function에 대해 적절하게 사용될 수 있다.

기본적으로 주어진 점 xkx_k에서 search direction pkp_k를 계산하고, 해당 방향으로 positive scalar인 step length αk\alpha_k만큼 이동하여 새로운 점 xk+1x_{k+1}을 찾는다.

xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k

따라서, line search method는 적절한 search direction과 step length를 선택하는 것이 중요하다.

Step length는 일반적인 learning algorithm에서 learning rate와 같다.

pkp_k는 descent direction, i.e. pkTfk<0p_k^T\nabla f_k <0, 으로 정하는 것이 합리적이며 많은 line search methods에서, pkp_k는 다음과 같은 form을 갖는다.

pk=Bk1fkp_k = -B_k^{-1}\nabla f_k

where BkB_k is a symmetric and nonsingular matrix.

The Wolfe Conditions

The Wolfe condition은 inexact line search 방법에서 step length를 결정하기 위한 기준을 제공한다.

  • [Armijo condition : f(xk+αkdk)f(xk)+c1αkf(xk)Tdkf(x_k + \alpha_k d_k) \leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^T d_k]
    새로운 점에서의 함수 값이 현재 점에서의 값보다 충분히 감소해야 한다.
  • [curvature condition : f(xk+αkdk)Tdkc2f(xk)Tdk\nabla f(x_k + \alpha_k d_k)^T d_k \geq c_2 \nabla f(x_k)^T d_k]
    새로운 점에서의 경사가 원래 경사의 c2c_2배 이상이어야 한다. (너무 멀리 가지 않고, 너무 평평한 구역에 위치하지 않도록 한다.)

Line Search Methods

  • Steepest Descent
    • xk+1=xkαkfkx_{k+1} = x_k - \alpha_k \nabla f_k
    • The rate-of-convergence is linear.
    • Global convergence if αk\alpha_k satisfies the Wolfe conditions.

우리가 흔히 말하는 Gradient Descent 방법이다.

  • Quasi-Newton Method

    • xk+1=xkαkBk1fkx_{k+1} = x_k - \alpha_k B_k^{-1}\nabla f_k
    • The rate-of-convergence is superlinear.
    • Global convergence if αk\alpha_k satisfies the Wolfe conditions.
    • The BFGS method is the most popular.
  • Newton's Method

    • xk+1=xk(2fk)1fkx_{k+1} = x_k - (\nabla^2 f_k)^{-1}\nabla f_k
    • The rate-of-convergence is quadrtic.
    • Local convergence.

profile
공부랑 연구랑 생각

0개의 댓글