⨏ 방향도함수와 그라디언트

Lightman·2021년 8월 8일
2

MACHINE LEARNING 🦾

목록 보기
6/9

INTRO

함수의 입력이 변할 때 출력이 어떻게 변화하는지를 분석하는 것이 미적분학의 주된 관심 분야이다. 단변수함수일 경우 미분으로 충분했다. 현실에서 자주 접하게 되는 다변수함수에 대해 방향도함수로부터 시작해 그라디언트와 헤시안 매트릭스에 대해 알아보자. (단, 스칼라 함수를 전제로 설명한다)

[목표] 다변수 입력x이 변할 때 목적 함수의 출력f(x)는 어떻게 변화하는가 ?
[결론] 그라디언트를 통해 알 수 있다.

  • [참고] 함수의 분류
  1. [X Univariate, Y ScalarValued] = INPUT:1, OUTPUT:1
  2. [X Univariate, Y VectorValued] = INPUT:1, OUTPUT:M
  3. [X Multivariate, Y ScalarValued] = INPUT:N, OUTPUT:1
  4. [X Multivariate, Y VectorValued] = INPUT:N, OUTPUT:M

📕 다변수 함수의 미분

🎢 방향도함수 Directional Derivative

1. 정의: 벡터 x가 벡터 v의 방향으로 이동할 때 발생하는 f(x)의 변화율(Scalar)

vf(a)limh0f(a+hv)f(a)h\nabla_{\vec{v}}f(\vec{a}) \equiv \lim\limits_{h \rightarrow 0} \frac{f(\vec{a} + h\vec{v}) - f(\vec{a})}{h}
Alias: vf(a)Dvf(a)f(a)v\nabla_{\vec{v}}f(\vec{a}) \equiv D_{\vec v}f(\vec a) \equiv \dfrac{\partial f(\vec a)}{\partial \vec v}

  • [해석]
    • a\vec{a}에서 v\vec{v}방향으로 이동할 때 f의 순간변화율(기울기)은 vf\nabla_{\vec{v}}f 이다
    • 즉, 방향도함수Directional DerivativeSlope를 나타낸다!
Image 1 Image 2

2. 성질

vf(a)=f(a)v\nabla_{\vec{v}}f(\vec{a}) = \nabla f(\vec{a}) \cdot \vec{v}

  • [증명]
    • vf(x)=f(x+tv)t=i=1nf(x+tv)(xi+tvi)(xi+tvi)t=i=1nf(x+tv)(xi+tvi)vi=i=1nf(x)xivi=f(x)v\nabla_{\mathbf {v} }{f}(\mathbf {x} ) \\= \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial t} \\= \sum\limits_{i=1}^n \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial (x_i+tv_i)} \frac{\partial (x_i+tv_i)}{\partial t} \\= \sum\limits_{i=1}^n \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial(x_i+tv_i)} v_i \\= \sum\limits_{i=1}^n \frac{\partial f(\mathbf{x})}{\partial x_i} v_i \\= \nabla f(\mathbf {x} )\cdot \mathbf {v}
  • [해석] 대수적 해석
    • v\vec v 방향으로 이동할 때 생기는 변화량은 v1,v2v_1, v_2로 이동할 때 생기는 변화량의 합과 같다. 이때, v\vec v의 원소 v1,v2v_1, v_2는 이동량이고 그라디언트 f\nabla f의 원소 f(x)x1,f(x)x2\frac{\partial f(x)}{\partial x_1}, \frac{\partial f(x)}{\partial x_2}x1,x2x_1, x_2 방향으로 이동할 때의 변화율을 의미한다. 따라서 단위벡터 v\vec v를 내적하면 v\vec v 방향으로 이동할 때 발생하는 변화율을 알려준다.

∇ GRADIENT VECTOR(Scalar-Valued)

1. 정의: 표준기저벡터 방향으로 이동할 때 발생하는 방향도함수들의 벡터

ff(x)x=[x1  x2]f=[x1f  x2f]\nabla f \equiv \frac{\partial f(\vec x)}{\partial \vec x} = [\frac{\partial}{\partial x_1}~~\frac{\partial}{\partial x_2}]^\top f = [\frac{\partial}{\partial x_1}f ~~ \frac{\partial}{\partial x_2}f]^\top

  • Gradient is Collection of Partial Derivatives
  • [주의] Gradient는 기울기가 아니라 입력 벡터와 같은 차원에서의 벡터이다
    • 그러므로 해당 벡터를 더해 해당 방향으로 이동하는 것은 자연스럽다.

2. 성질

f(a,b)v\nabla f(a, b) \cdot \vec{v} 연산을 통해 vf(a,b)\nabla_{\vec{v}} f(a, b)를 구하는 벡터
f∇f Direction of Steepest Ascent & f2||∇f||^2 Rate of Change

  • v\vec{v}f(a,b)\nabla f(a,b)위에 있을 때 vf(a,b)=f(a,b)v\nabla_{\vec{v}} f(a, b) = \nabla f(a, b) \cdot \vec{v}는 최대가 되고,
    • Normalized Gradient f(a,b)f(a,b)\frac{\nabla f(a,b)} {||\nabla f(a,b)||} 는 최대 변화율을 갖는 단위벡터
    • Magnitude of Gradient f(a,b)2||\nabla f(a,b)||^2최대 변화율이 된다!
  • [해석] 이 내적v\vec{v}f(a,b)\nabla f(a, b)에 대한 사영Projection의 입장에서 해석해보자.

    • 기하적 해석

      • 이때,f(a,b)\nabla f(a, b)이 단위벡터이면 프로젝션 매트릭스이므로 일반성을 소실하지 않으면서 단위벡터라고 가정하자. 그러면 f(a,b)v\nabla f(a, b) \cdot \vec{v}v\vec{v}f(a,b)\nabla f(a, b)에 투영했을 때의 길이가 된다. 따라서 이 길이가 최대가 될 때는 v\vec{v}f(a,b)\nabla f(a, b)의 방향이 일치할 때 이다.
    • 대수적 해석

      • vf(a,b)=f(a,b)v=vf(a,b)f(a,b)=vf(a,b)1\nabla_{\vec{v}} f(a, b) \\= \nabla f(a, b) \cdot \vec{v} \\= ||v^{||\nabla f(a,b)}|| * ||\nabla f(a,b)|| \\= ||v^{||\nabla f(a,b)}||\le1

      • vf(a,b)||\vec{v}^{||\nabla f(a,b)}||최대일 때, 즉 v\vec{v}f(a,b)\nabla f(a,b)위에 있을 때 vf(a,b)\nabla_{\vec{v}} f(a, b)최대

        • arg maxv, v=1vf(a,b)=f(a,b)f(a,b)\argmax\limits_{\vec{v},\text{ }||\vec{v}|| = 1 }||\vec{v}^{||\nabla f(a,b)}|| = \frac{\nabla f(a,b)} {||\nabla f(a,b)||}
    • 단위벡터를 고려하는 이유

      • The reason for considering a unit vector is that the gradient is a vector that points in the direction of maximum increase of the function, and has magnitude equal to the maximum rate of change of the function. So, if we take the dot product of the gradient and a unit vector, we obtain the rate of change of the function in the direction of the unit vector.
      • Moreover, by considering a unit vector, the magnitude of the gradient does not affect the result, and we can compare the rate of change of the function in different directions easily, without having to normalize the gradient.

③ Perpendicular to Contour lines

  • 한 점에서의 여러가지 벡터와 Gradient벡터를 생각해보자. 가장 빠르게 다음 등고선에 접근하는 벡터는 Gradient벡터라고 했다.
  • 만약 Gradient가 Steepest Ascent라면, Countour line에 Perpendicular할 수 밖에 없다!

🇯 JACOBIAN MATRIX(Vector-Valued)

1. 정의

J=[f1x1f1xnfmx1fmxn]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

2. 성질

J[dx1,,dxn]J \cdot[dx_1, \cdots, dx_n] = [df1,,dfn][df_1,\cdots,df_n]

Jdudv=dxdy|J|dudv = dxdy, where JJ is from (u,v)(u,v) to (x,y)(x,y)

🇭 HESSIAN MATRIX

1. 정의

The Hessian matrix is a square matrix that describes the second-order partial derivatives of a scalar-valued function.

H(f)=[2fx122fx1x22fx1xn2fx2x12fx222fxnx12fxn2]H(f) = \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2f}{\partial x_1\partial x_n} \\\\ \frac{\partial^2f}{\partial x_2\partial x_1} & \frac{\partial^2f}{\partial x_2^2} & \cdots & \vdots \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\partial^2f}{\partial x_n\partial x_1} & \cdots & \cdots & \frac{\partial^2f}{\partial x_n^2} \end{bmatrix}

Algebraically, it represents the rate of change of the gradient of the function, which provides information about the curvature of the function.

Geometrically, the Hessian matrix has a significant meaning in terms of the shape of a surface defined by the function.

  • The eigenvalues of the Hessian matrix indicate the curvature of the surface along different directions in the domain. If the eigenvalues are positive, the surface is locally convex, meaning it curves upward like a bowl. If the eigenvalues are negative, the surface is locally concave, meaning it curves downward like a saddle. If the eigenvalues are zero, the surface is locally flat.
  • The sign of the determinant of the Hessian matrix is also related to the local convexity of the surface: a positive determinant indicates a locally convex surface, while a negative determinant indicates a locally concave surface.

In summary, the Hessian matrix provides information about the local behavior of a function and its shape. This information is useful in various applications, such as optimization, where the Hessian matrix is used to determine the type of stationary points (minimum, maximum, or saddle point) and to compute the search direction for optimization algorithms.

2. 성질

[dx1,,dxn]H[dx1,,dxn]T[dx_1, \cdots, dx_n]\cdot H \cdot[dx_1, \cdots, dx_n]^\mathbf{T} = [d2f1,,d2fn][d^2f_1,\cdots,d^2f_n]

[2fx22fxy2fyx2fy2][gxgy]=[hxhy]\begin{bmatrix} \frac{\partial^2f}{\partial x^2} & \frac{\partial^2f}{\partial x\partial y} \\\\ \frac{\partial^2f}{\partial y\partial x} & \frac{\partial^2f}{\partial y^2} \end{bmatrix} \begin{bmatrix} \frac{\partial g}{\partial x} \\\\ \frac{\partial g}{\partial y} \end{bmatrix} = \begin{bmatrix} \frac{\partial h}{\partial x} \\\\ \frac{\partial h}{\partial y} \end{bmatrix}

When we apply the Hessian matrix to another function, what we are actually doing is transforming the second function in a way that is informed by the behavior of the first function. In other words, the Hessian matrix of the first function provides information about the local curvature of the function, and this information is used to transform the second function in a way that reflects that curvature.

Here is a simple example to illustrate the idea. Let's say that we have two functions, f(x, y) and g(x, y). We can think of the Hessian matrix of f(x, y) as a linear transformation, H, that operates on g(x, y) to produce a new function h(x, y). The operation can be written as h(x, y) = H(g(x, y)).

In matrix form, we can write the operation as follows:

[2fx22fxy2fyx2fy2][gxgy]=[hxhy]\begin{bmatrix} \frac{\partial^2f}{\partial x^2} & \frac{\partial^2f}{\partial x\partial y} \\\\ \frac{\partial^2f}{\partial y\partial x} & \frac{\partial^2f}{\partial y^2} \end{bmatrix} \begin{bmatrix} \frac{\partial g}{\partial x} \\\\ \frac{\partial g}{\partial y} \end{bmatrix} = \begin{bmatrix} \frac{\partial h}{\partial x} \\\\ \frac{\partial h}{\partial y} \end{bmatrix}

In this example, the Hessian matrix of f(x, y) acts on the gradient of g(x, y) to produce the gradient of the new function h(x, y). The new function h(x, y) will have properties that reflect the curvature of f(x, y), and these properties can be useful in various optimization and machine learning algorithms.

  • 헤시안 연산 사례를 더 찾아봐야할 것 같아.

~~Here's another example of using the Hessian matrix for a linear transformation:

Suppose we have a function h(x, y) = x^2 + 2y^2, and we want to find the second-order behavior of the function near the point (1, 2). The gradient of the function is given by:

h(x,y)=[hx hy]=[2x 4y]\nabla h(x, y) = \begin{bmatrix} \frac{\partial h}{\partial x} \ \frac{\partial h}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x \ 4y \end{bmatrix}

The Hessian of h is then given by:

Hh(x,y)=[2hx22hxy 2hyx2hy2]=[20 04]\mathbf{H}_h(x, y) = \begin{bmatrix} \frac{\partial^2 h}{\partial x^2} & \frac{\partial^2 h}{\partial x \partial y} \ \frac{\partial^2 h}{\partial y \partial x} & \frac{\partial^2 h}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \ 0 & 4 \end{bmatrix}

Evaluating the Hessian at the point (1, 2), we get:

Hh(1,2)=[20 04]\mathbf{H}_h(1, 2) = \begin{bmatrix} 2 & 0 \ 0 & 4 \end{bmatrix}

We can use this Hessian to linearly transform another function g(x, y) = x + y. The gradient of g is given by:

g(x,y)=[gx gy]=[1 1]\nabla g(x, y) = \begin{bmatrix} \frac{\partial g}{\partial x} \ \frac{\partial g}{\partial y} \end{bmatrix} = \begin{bmatrix} 1 \ 1 \end{bmatrix}

And the Hessian of g is given by:

Hg(x,y)=Hh(x,y)g(x,y)=[20 04][1 1]=[2 4]\mathbf{H}_g(x, y) = \mathbf{H}_h(x, y) \cdot \nabla g(x, y) = \begin{bmatrix} 2 & 0 \ 0 & 4 \end{bmatrix} \begin{bmatrix} 1 \ 1 \end{bmatrix} = \begin{bmatrix} 2 \ 4 \end{bmatrix}

This means that the second-order behavior of the function g near the point (1, 2) is determined by the linear transformation described by the Hessian of h. The resulting Hessian of g provides information about the direction of maximum increase or decrease, and the steepness of the function g.

Why use the Hessian of h, which has nothing to do with g, to find the Hessian of g?

In general, the Hessian matrix of one function may be used to perform a linear transformation on another function because the Hessian matrix encodes information about the second-order behavior of the first function, such as the curvature. This information can then be used to derive information about the second-order behavior of the transformed function. For example, if we have a function g which is a composition of another function f and a linear transformation, then the Hessian of g can be related to the Hessian of f and the matrix representing the linear transformation. In this case, using the Hessian of f can provide valuable information about the second-order behavior of g.

Note that the Hessian of h, which has nothing to do with g, is only used in this context if h is used to model some aspect of the function g.~~

😎끝.


  • 참고자료

  • 증명1 / 증명2
    Here's a proof of vf(x)=f(x)v\nabla _{\mathbf {v} }{f}(\mathbf {x} )=\nabla f(\mathbf {x} )\cdot \mathbf {v}, where f(x)\nabla f(\mathbf{x}) is the gradient of the function f at the point x\mathbf{x}, and v\mathbf{v} is a vector:

    Let f(x+tv)f(\mathbf{x} + t\mathbf{v}) be a scalar function that describes the change in the value of the function f(x)f(\mathbf{x}) as x\mathbf{x} moves in the direction of the vector v\mathbf{v}. The gradient of this function with respect to the scalar t is given by:

    ddtf(x+tv)=vf(x)=f(x+tv)t\frac{d}{dt}f(\mathbf{x} + t\mathbf{v}) = \nabla _{\mathbf {v} }{f}(\mathbf {x} ) = \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial t}

    Using the chain rule of differentiation, we can expand the partial derivative as:

    vf(x)=f(x+tv)t=i=1nf(x+tv)(xi+tvi)(xi+tvi)t=i=1nf(x+tv)(xi+tvi)vi=i=1nf(x)xivi=f(x)v\nabla_{\mathbf {v} }{f}(\mathbf {x} ) = \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial t} \\= \sum_{i=1}^n \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial (x_i+tv_i)} \frac{\partial (x_i+tv_i)}{\partial t} \\= \sum_{i=1}^n \frac{\partial f(\mathbf{x} + t\mathbf{v})}{\partial(x_i+tv_i)} v_i \\= \sum_{i=1}^n \frac{\partial f(\mathbf{x})}{\partial x_i} v_i \\= \nabla f(\mathbf {x} )\cdot \mathbf {v}

    where xix_i is the ii-th component of the vector x\mathbf{x}, and viv_i is the ii-th component of the vector v\mathbf{v}.

    Since the gradient of the function f(x)f(\mathbf{x}) is given by f(x)=[f(x)x1,f(x)x2,,f(x)xn]\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}], we can write:

    $\nabla_{\mathbf {v} }{f}(\mathbf {x} )

    This proves the result vf(x)=f(x)v\nabla _{\mathbf {v} }{f}(\mathbf {x} )=\nabla f(\mathbf {x} )\cdot \mathbf {v}.

profile
현직 데이터 분석가 / 데이터 과학의 정도를 따라 🚲 / About DEV DA ML

0개의 댓글