[ML]Neural ODE

ball·2024년 5월 29일

Summary

It is surprising that Neural ODE can express an differential equation.
I wanna talk about how we can solve the IVP(Initial Value Problem) using Neural Network.

IVP(Initial Value Problem)

IVP is a famous problem in calculus.

h(T) = h(0) + \int_{0}^{T}\frac{dh(t)}{dt}dt

If the $h(0)$ and $\frac{dh(t)}{dt}$ is given, then we can calculate the integral and get h(T).

However, computers cannot do the integral operation. We need an algorithm that can make computers calculate the approximate solution of IVP.

Euler Discretization

Given a IVP

h(T) = h(0) + \int_{0}^{T}\frac{dh(t)}{dt}dt

Let's discretize with step size $s\approx 0$

f(h(t)) = \frac{dh(t)}{dt}

h(t+s) = h(t) + s\cdot f(h(t)) \\ h(t+2s) = h(t) + s\cdot f(h(t+s)) \\ ...\\ h(T) = h(T-s) + s\cdot f(h(T-s))

Given $h(t)$ and $f(h(t))$ , evaluating $h(T)$ is a forward problem.

Given data $h(x_1), h(x_2), ... h(x_n)$ , getting $f(h(t))$ is a backword problem.

Euler Discretization with ResNet

It is surprising that Euler Discretization algorithm can be implemented by ResNet. Following image is part of ResNet. We can see the Euler Discretization algorithm can be expressed with ResNet.

Alternative ODE-Solver

Euler Discretization is a very simple(and powerful) algorithm for solving IVP. There are many other ODE-Solver such as Runge-Kutta Method or DOPRI Method. They are all based on Euler Discretization method. It just have more calculation to make it accurate.
In DOPRI Method, it's step size changes depending on the slope of $h(t)$ .

Training Neural ODE

We can train the Neural ODE(ResNet) using Normal Backpropagation of ResNet. However, we have a problem.
Assume the total # of steps in DOPRI-Solver is 10,000. If we use Normal Backpropagation Method, we need 10,000 Layers. But this is impossible. Recent research use severel hundreds of layer, and it already requires massive amount of computation for backpropagation.

We need a alternative method for training Neural ODE.

Normal Backpropagation Method

Let's see how it will be trained if we apply normal backpropagation method to Neural ODE.
$\frac{\partial L}{\partial z_{T}}$ is known. $z_T$ is the last layer output.
Let's say the step size is $h\approx0$
Let's define $a_t = \frac{\partial L}{\partial z_t}$ .

z_{t+h} = z_t + h\cdot f(z_t)

\frac{\partial L}{\partial z_{t}} = \frac{\partial L}{\partial z_{t+h}}\cdot \frac{\partial z_{t+h}}{z_t}

= \frac{\partial L}{\partial z_{t+h}}\cdot \left\{1+h\cdot\frac{\partial f(z_t)}{\partial z_t} \right\}

= a_{t+h} \cdot \left\{1+h\cdot\frac{\partial f(z_t)}{\partial z_t} \right\}

Using the upper equation, we can get the gradient of the layer that outputs $z_t$ .

\frac{\partial L}{\partial \theta_t} = \frac{\partial L}{\partial z_{t+h}} \cdot \frac{\partial z_{t+h}}{\partial \theta_t}

Then we can derive

\frac{\partial L}{\partial \theta_t} = a_{t+h} \cdot \frac {\partial (z_t + h \cdot f(z_t))}{\partial \theta_t} = a_{t+h} \cdot h \cdot \frac {\partial f(z_t)} {\partial \theta_t}

This means that we need a layer for every steps. This requires huge amount of calculation. We have to think of more better option.

Adjoint Sensitivity Method

Let's go through a simple equation of getting value of $z(t+h)$ .

z(t+h) = z(t) + \int_{t}^{t+h}f(z(t'))dt'

This is the start of adjoint method.
Now let's define $a(t) = \frac{\partial L}{\partial z(t)}$ .

a(t) = \frac{\partial L}{\partial z(t)}

= \frac{\partial L}{\partial z(t+h)} \cdot \frac{\partial z(t+h)}{\partial z(t)}

= a(t+h) \cdot \left\{ {1 + \frac{\partial \int_{t}^{t+h}f(z(t'))dt'}{\partial z(t)}}\right\}

Using adjoint method, we can easily calculate the $a_t$ using $\frac{da(t)}{dt}$ .

\frac{da(t)}{dt}= \displaystyle \lim_{h \to 0+} \frac{a(t+\varepsilon) - a(t)}{\varepsilon }

= \displaystyle \lim_{\varepsilon \to 0+} \frac{a(t+\varepsilon) - a(t+\varepsilon) \left\{ {1 + \frac{\partial \int_{t}^{t+\varepsilon}f(z(t'))dt'}{\partial z(t)}}\right\} }{\varepsilon }

Using the following equation,

\int_{t}^{t+\varepsilon}f(z(t'))dt' = \int_{t}^{t+\varepsilon} \frac {\partial z(t')} {\partial t'} dt' = z(t+\varepsilon) - z(t)

We can derive the following equation.

\frac {da(t)}{dt} = \displaystyle \lim_{\varepsilon \to 0+} -a(t+\varepsilon) \cdot \frac {\partial f(z(t))} {\partial z(t)} = -a(t) \cdot \frac {\partial f(z(t))} {\partial z(t)}

Using $\frac {da(t)} {dt}$ and $a(T)$ , we can calculate $a(t')$ for any $t'$ .
We can apply this to the Normal backpropagation method with less layers. We don't need a layer for every step.

\frac{\partial L}{\partial \theta_t} = a_{t+h} \cdot \frac {\partial (z_t + h \cdot f(z_t))}{\partial \theta_t} = a_{t+h} \cdot h \cdot \frac {\partial f(z_t)} {\partial \theta_t}

Summary

I was surprised that we can solve differential equation using ResNet. Machine Learning is integrated into many areas such as physics and mathematics.

ball

KAIST CS Major

이전 포스트

velog 블로그를 시작하며

다음 포스트

[ML]Neural ODE

Summary

IVP(Initial Value Problem)

Euler Discretization

Euler Discretization with ResNet

Alternative ODE-Solver

Training Neural ODE

Normal Backpropagation Method

Adjoint Sensitivity Method

Summary

velog 블로그를 시작하며

[ML] Fourier Feature Encoding

0개의 댓글