U_Week_2_Day_10

유영재·2021년 8월 13일

부스트캠프 AI_Tech

목록 보기
10/30
post-thumbnail

수업 정리  

강의 목록

[DL Basic] Generative Models 1

  • Learning a Generative Model
    • Generation(sampling) : If we sample xnewp(x),xnewx_{new} \sim p(x), x_{new} should look like a dog
    • Density estimation(anomaly detection) : p(x)p(x) should be high if xx looks lik a dog, and low otherwise. also known as, explicit models.
    • Unsupervised representation learning(feature learning) : We should be able to learn what these image have in common, e.g., ears, tail, etc
      >> then, how can we represent p(x)p(x)?
  • Basic Discrete Distributions
    • Bernoulli distribution : (biased) coin flip
      • D={Head,Tails}D = \{\mathrm{Head, Tails}\}
      • Specify P(X=Heads)=p.P(X = \mathrm{Heads}) = p. Then P(X=Tails)=1p.P(X=\mathrm{Tails}) = 1 - p.
      • Write : XBer(p)X \sim \mathrm{Ber}(p)
    • Categorical distribution : (biased) m-sided dice
      • D={1,...,m}D = \{1, ..., m\}
      • Specify P(Y=i)=pi,P(Y=i) = p_{i}, such that i=1mpi=1\sum_{i=1}^m p_{i}=1
      • Write : YCat(p1,...,pm)Y \sim \mathrm{Cat}(p_{1}, ..., p_{m})
  • Structure Through Independence
    • What if X1,...,XnX_{1}, ..., X_{n} are independent, then p(x1,...,xn=p(x1)p(x2)...p(xn)p(x_{1}, ..., x_{n} = p(x_{1})p(x_{2})...p(x_{n})
    • How many possible states? 2n{\color{Blue}2^n}
    • How many parameters to specify p(x1,...,xn)?p(x_{1}, ..., x_{n})? n{\color{Blue}n}
    • 2n{\color{Blue}2^n} entries can be described by just n{\color{Blue}n} numbers! But this independence{\color{Blue}\mathrm{independence}} assumption is too strong to model useful distributions.
  • Conditional Independence
    • Three important rules
      • Chain rule : p(x1,...,xn)=p(x1)p(x2x1)p(x3x1,x2)...p(xnx1,...,xn1)p(x_{1}, ..., x_{n}) = p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{1},x_{2})...p(x_{n}|x_{1},...,x_{n-1})
      • Bayes' rule : p(xy)=p(x,y)p(y)=p(yx)p(x)p(y)p(x|y) = \frac{p(x,y)}{p(y)}=\frac{p(y|x)p(x)}{p(y)}
      • Conditional independence : If xyx,x \bot y|x, then p(xy,z)=p(xz)p(x|y,z)=p(x|z)
    • If using the chain rule, How many parameters?
      • p(x1){\color{Blue}p(x_{1})} : 1 params
      • p(x2x1){\color{Blue}p(x_{2}|x_{1})} : 2 params (one per p(x2x1=0)p(x_{2}|x_{1} = 0) and one per p(x2x1=1)p(x_{2}|x_{1} = 1)
      • p(x3x1,x2){\color{Blue}p(x_{3}|x_{1}, x_{2})} : 4 params
      • Hence, 1+2+22+...+2n1=2n11+2+2^{2}+...+2^{n-1}=2^{n}-1
      • Now, suppose Xi+1X1,...,Xi1XiX_{i+1} \bot X_{1},...,X_{i-1}|X_{i}(Markov assumption), then p(x1,...,xn)=p(x1)p(x2x1)p(x3x2)...p(xnxn1){\color{Blue}p(x_{1}, ..., x_{n}) = p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{2})...p(x_{n}|x_{n-1})}
      • How many parameters? 2n1{\color{Blue}2n-1}
      • Hence, by leveraging the Markov assumption, we get exponential reduction on the number of parameters.
      • Auto-regressive models leverage this conditional independency.
  • Auto-regressive model
    • Suppose we have 28×2828 \times 28 binary pixels
    • Our goal is to learn p(x)=p(x1,...,x784)p(x) = p(x_{1}, ..., x_{784}) over x0,1784x \in {0,1}^{784}
    • How can we parametrize p(x)p(x)?
      • Let's use the chain rule to factor the joint distribution
      • p(x1:784)=p(x1)p(x2x1)p(x3x1:2)...p(x_{1:784})=p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{1:2})...
      • This is called an autoregressive model
      • Note that we need an order of all random variables.
  • NADE : Neural Autoregressive Density Estimator
    • probability distribution of ii-th pixel : p(xix1:i1)=σ(aihi+bi)p(x_{i}|x_{1:i-1}) = \sigma(a_{i}h_{i} + b{i}) where hi=σ(W<ix1:i1+c)h_{i} = \sigma(W_{<i}x_{1:i-1}+c)
    • NADE is an explicit model that can compute the density of the given inputs
    • How can we compute the density off the given image?
      • Suppose we have image with 784 binary pixels, {x1,x2,...,x784}\{x_{1}, x_{2}, ..., x_{784}\}
      • Then, the joint probability is computed by {x1,x2,...,x784}=p(x1)p(x2x1)...p(x784x1:783)\{x_{1}, x_{2}, ..., x_{784}\} = p(x_{1})p(x_{2}|x_{1})...p(x_{784}|x_{1:783}) where each conditional probability p(xix1:i1)p(x_{i}|x_{1:i-1}) is computed independently
    • In case of modeling continuous random variable, a mixture of Gaussian can be used
  • Pixel RNN
    • We can also use RNNs to define an auto-regressive model
    • For example, for an n×nn \times n RGB image, p(x)=i=1n2p(xi,Rx<ip(xi,Gx<i,xi,R)p(xi,Bx<i,xi,R,xi,G)p(x) = \prod_{i=1}^{n^2} p(x_{i,R}|x_{<i}p(x_{i,G}|x_{<i},x_{i,R})p(x_{i,B}|x_{<i},x_{i,R},x_{i,G})
    • There are two model architectures in Pixel RNN based on the ordering of chain:
      • Row LSTM
      • Diagonal BiLSTM

[DL Basic] Generative Models 2

  • Variational Auto-encoder
    • Variational inference(VI)
      • The goal of VI is to optimize the variational distribution that best matches the posterior distribution
      • Posterior distribution : pθ(zx)p_{\theta}(z|x)
      • Variational distribution : pϕ(zx)p_{\phi}(z|x)
      • In particular, we want to find the variational distribution that minimizes the KL divergence between the true posterior
      • But how?
      • ELBO can further be decompsed into
    • Key limitation
      • It is an intractable model(hard to evaluate likelihood)
      • The prior fitting term must be differentiable, hence it is hard to use diverse latent prior distributions.
      • In most cases, we us an isotropic Gaussian
        DKL(qϕ(zx)N(0,l))=12i=1D(σzi2+μzi2ln(σzi2)1)D_{K L}\left(q_{\phi}(z \mid x) \| \mathcal{N}(0, l)\right)=\frac{1}{2} \sum_{i=1}^{D}\left(\sigma_{z_{i}}^{2}+\mu_{z_{i}}^{2}-\ln \left(\sigma_{z_{i}}^{2}\right)-1\right)
  • Adversarial Auto-encoder
    • It allows us to use any arbitrary latent distributions that we can sample
  • GAN
    • minGmaxDV(D,G)=Expdata (x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]\min _{G} \max _{D} V(D, G)=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))]
    • GAN vs VAE
    • GAN Objective
      • A two player minimax game between generator and discriminator
      • For discriminator:
        • maxDV(G,D)=Expdata[logD(x)]+ExpG[log(1D(x))]\max _{D} V(G, D)=E_{\mathbf{x} \sim p_{\mathrm{data}}}[\log D(\mathbf{x})]+E_{\mathbf{x} \sim p_{G}}[\log (1-D(\mathbf{x}))]
        • where the optimal discriminator is DG(x)=pdata (x)pdata (x)+pG(x)D_{G}^{*}(\mathbf{x})=\frac{p_{\text {data }}(\mathbf{x})}{p_{\text {data }}(\mathbf{x})+p_{G}(\mathbf{x})}
      • For generator : minGV(G,D)=Expdata [logD(x)]+ExpG[log(1D(x))]\min _{G} V(G, D)=E_{\mathbf{x} \sim p_{\text {data }}}[\log D(\mathbf{x})]+E_{\mathbf{x} \sim p_{G}}[\log (1-D(\mathbf{x}))]
      • Plugging in the optimal discriminator, we get
  • DCGAN
  • Info-GAN
  • Text2Image
  • Puzzle-GAN
  • CycleGAN
    • Cycle-consistency loss
  • Star-GAN
  • Progressive-GAN


피어세션 정리

  • 스페셜 피어세션
  • 팀 회고록 작성

느낀점

기존에 공부했던 NLP외에 CNN 발전 모델과 GAN 등을 공부할 수 있었던 일주일이었습니다.

벌써 2주가 지났습니다. 목표로하던 대회, 공모전 참가는 아직 실행에 옮기지 못했습니다... 하루 빨리 대회 찾아서 참가라도 해보는 것에 의의를 두어야 할 것 같습니다.

0개의 댓글