[REVIEW] Meta-Learning with a Geometry-Adaptive Preconditioner

SHIN·2023년 6월 12일
0

1. Introduction

  • MAML, one of the most popular optimization based meta-learning algorithm.
  • Preconditioned MAML
    - Imporved by Preconditioned Gradient Descent(PGD) for inner loop optimization
    - Meta learns not only initialization parameters of models but also meta-parameters of preconditioner PP.
    - PP was adapted with innder-step k or with individual task separately, thus Riemman metric condition cannot be satisfied.
    • Riemman metric : A condition that the steepest gradient descent can be achieved on a given parameter space.
  • Proposing Geometry Adaptive Preconditioned gradient descent (GAP), which includes two unconsidered properties.
    -PGAPP_{GAP} is adapted with individual task and optimization path(path dependent i,e, innder-step dependent).
    -PGAPP_{GAP} is a Riemman metric.

2. Background

2.1. MAML

Inner loop

Outer loop

        Note that, ΣτP(τ)Lτ\Sigma_{\tau\sim P(\tau)}L_\tau has used instead of Eτ[Lτout]E_{\tau}[L_{\tau}^{out}], on the MAML paper.

2.2. PGD

Preconditioned gradient update


       \bullet PGD often reduces the effect of pathological curvature and speed up the optimization.

2.3. Unfolding: reshaping a tensor into a matrix

       Tensor and Tensor decompositions link

2.4. Riemmannian manifold

       Riemannian manifold and metric

3. Methodology

Setting

  • LL : number of layers of NN
  • θ={W1,,Wl,,WL}\theta = \{\mathsf{W}^1,\cdots,\mathsf{W}^l,\cdots,\mathsf{W}^L\} : CNN parameters
  • ϕ={M1,,Ml,,ML}\phi = \{\mathbf{M}^1,\cdots,\mathbf{M}^l,\cdots,\mathbf{M}^L\} : preconditioner parameters
  • Tip(D)\mathcal{T}_i\sim p(\mathcal{D}), Ti\mathcal{T}_i : batch of tasks
  • τTi\tau\in \mathcal{T}_i , τ\tau : a task
  • KK : number of samples

3.1. GAP: Geometry-Adaptive Preconditioner

3.1.1 Inner-Loop Optimization

  • L-layer neural network fθ()f_\theta(\cdot) with parameters θ={\theta =\{W1,,^1,\cdots,Wl,,^l,\cdots,WL}^L\}

  • Let gradient Gτ,kl=_{\tau,k}^l=\nablaWτ,klLτin(θτ,k;Dtr)_{\tau,k}^l L_{\tau}^{in}(\theta_{\tau,k};D^{tr})

  • Reshape Gτ,kl_{\tau,k}^l witn unfolding mode-1 into Gτ,klG_{\tau,k}^l. (mode-1 performs the best)

  • Define additional parameters ϕ={Ml}l=1L\phi = \{M^l\}_{l=1}^L where milR,  Sp(x)=12log(1+exp(2x))m_i^l \in R, \; Sp(x)=\frac{1}{2}log(1+exp(2*x))

  • SVD Gτ,klG_{\tau,k}^l into Uτ,klΣτ,kl(Vτ,kl)T,U_{\tau,k}^l\Sigma_{\tau,k}^l(V_{\tau,k}^l)^T,
    and G~τ,kl=Uτ,kl(Mτ,klΣτ,kl)(Vτ,kl)T\widetilde{G}_{\tau,k}^l = U_{\tau,k}^l(M_{\tau,k}^l\Sigma_{\tau,k}^l)(V_{\tau,k}^l)^T

  • Reshape G~τ,kl\widetilde{G}_{\tau,k}^l back to it's original tensor form

  • Preconditioned gradient descent of GAP becomes:

3.1.2 Outer-loop Optimization

  • GAP learns two parameter sets θ,ϕ:\theta,\phi :

3.1.3 Training precedure

3.1.4 Desirable properties of GAP

    1. Fully adaptive in both task-specific(τ\tau) and path-depend(kk) way
    1. Riemannian metric
    • Steepest descent learning
profile
HAPPY the cat

0개의 댓글