The Basic : Generative Adversarial Networks (GANs)

taejinjeong·2021년 6월 28일
0

Computer Vision (GAN)

목록 보기
1/1

The Concepts of GANs

Discriminative Model vs Generative Model


The comparison of Discriminative Model and Generative Model.
The Discriminative Model tries to draw a line in the data space to distinguish the data.
In contrast, the Generative Model tries to produce data which fall close to their real counterparts in the data space.


The Basic Idea of Generative Adversarial Networks (GANs)

The Generative Adversarial Networks (GANs) consists of two parts; Generator and Discriminator.

The generator tries to create samples that are intended to come from the same distribution as the training data, while the discriminator tries to examine samples to determine whethere they are real (coming from the sample dataset) or fake (coming from the generator) as shown above.
Since those two networks pursue their goals without controlling other actions interactively(decieving the discriminator / detecting the output created by the generator), this framework can be interpreted as Game Theory: two-player minimax Game and the solution will land to a Nash Equilibria.


Game Theory: Minimax Game

The Minimax Game is a decision rule for minimizing the possible loss for a worst case scenario.
The solution of the game is to maximize the minimum gain, which is referred to as maximin value.
The maximin value is the highest value that the player can be sure to get without knowing the actions of the other players defined as below.
vi=maxaiminaivi(ai,ai)v_i = \max_{a_i}\min_{a_{-i}}v_i(a_i,a_{-i})
where :

  • ii is the index of the player of interest
  • i-i denotes all other players except player ii
  • aia_i is the action taken by player ii
  • aia_{-i} denotes the actions taken by all other players
  • viv_i is the value function of player ii

The solution to a game is a Nash Equilibria which is a tuple (θD,θG\theta^D, \theta^G) that is a local minimum of JDJ^D with respect to θD\theta^D and a local minimum of JGJ^G with respect to θG\theta^G.


The Theory of GANs

The overview of GANs architecture consists of two networks, generator GG and discriminator DD, each of which is differentiable both with respect to its inputs, zz for GG and xx for DD, and with respect to its parameters, θD\theta^D for GG and θD\theta^D for DD


Generator


The generator is simply a differentiable function GG with input zz, the prior probability.
When zz is sampled from some simple prior distribution, G(z)G(z) yeilds a sample of xx drawm from pmodelp_{model}.
The inputs to the function GG do not need to correspond to inputs to the first layer of the deep neural net; inputs may be provided at any point throughout the network.
If we want pmodelp_{model} to have full support on xx space we need the dimension of zz to be at least as large as the dimension of xx, and GG must be differentiable, but those are the only requirements.


Discriminator


The discriminator is simply a differentiable classifier function DD.
The purpose of the discriminator is to distinguish the real data from the data created by the generator ( D(x)=1D(x)=1 and D(G(z))=0D(G(z))=0 ).
The inputs to the function DD come from two different sources; one is from examples of xx randomly sampled from the training set and the other is from the data created by generator.


Cost Functions

To build a Cost Functions for GANs, it is essential to know maximum likelihood estimation and minimax problem.


Maximum likelihood estimation

The basic idea of maximum likelihood is to define a model that provides an estimate of a probability distribution, parameterized by parameters θ\theta.
The likelihood as the probability that the model assigns to the training data is refered as below.

i=1mpmodel(xi,θ)\prod_{i = 1}^{m} p_{model}(x^i,\theta), for a dataset containing mm training examples xix^i

The principle of maximum likelihood simply says to choose the parameters for the model that maximize the likelihood of the training data.

θ=argmaxθi=1mpmodel(xi,θ)\theta^* = \arg\max_\theta\prod_{i=1}^m p_{model}(x^i,\theta)
     =argmaxθlogi=1mpmodel(xi,θ)\ \ \ \ \ = \arg\max_\theta\log\prod_{i=1}^m p_{model}(x^i,\theta)
     =argmaxθi=1mlogpmodel(xi,θ)\ \ \ \ \ = \arg\max_\theta\sum_{i=1}^m\log p_{model}(x^i,\theta)

The maximum likelihood estimation would be interpreted as minimizing the Kullback-Leibler Divergence(KL divergence) between the data generating distribution and the model as below.

θ=argminθDKL(pdata(x)pmodel(x;θ))\theta^*=\arg\min_\theta D_{KL}(p_{data}(x)||p_{model}(x;\theta))


Kullback-Leibler Divergence (KL Divergence)

In mathematical statistics, the Kullback–Leibler divergence, DKLD_{KL} (also called relative entropy), is a measure of how one probability distribution is different from a second, reference probability distribution.
Consider two probability distributions PP and QQ where, usually, PP represents the data, the observations, or a measured probability distribution and distribution QQ represents instead a theory, a model, a description or an approximation of PP.
The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of PP using a code optimized for QQ rather than one optimized for PP.
For discrete probability distributions PP and QQ defined on the same probability space, XX, the relative entropy or KL divergence from QQ to PP is defined as below.

DKL(PQ)=xXP(x)log(P(x)Q(x))D_{KL}(P||Q)=\sum_{x\in X}P(x)\log (\frac{P(x)}{Q(x)}) where, xx is discrete
DKL(PQ)=xp(x)log(p(x)q(x))dxD_{KL}(P||Q)=\int_x p(x)\log(\frac{p(x)}{q(x)})dx where, xx is continuous.

In other words, it is the expectation of the logarithmic difference between the probabilities PP and QQ, where the expectation is taken using the probabilities PP.


Jensen-Shannon Divergence

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions.
It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value.

JSD(PQ)=12D(PM)+12D(QM)JSD(P||Q)=\frac{1}{2}D(P||M)+\frac{1}{2}D(Q||M) where M=12(P+M)M=\frac{1}{2}(P+M)


Objective Function (Minimax Loss)

Since GANs consists of two player, generator and discriminator which try to pursue their goals without knowing each other's actions, The objective function can be set up based on minimax game as below.
minGmaxDV(D,G)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G\max_DV(D,G) = \mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{z\sim p_{z}}[\log(1-D(G(z)))]


Objective Function in the Perspective of Discriminator

First, consider the optimal discriminator, DD for any given generator, GG.


Objective Function in the Perspective of Generator

Since the training objective for DD can be interpreted as maximizing the log-likelihood for estimating the conditional probability P(Y=yx)P(Y=y|x), where Y indicates whether xx comes from pdatap_{data} (with y=1y=1) or from pgp_g (with y=0y=0), the cost function, C(G)C(G) reformulated as below. And, with the cost function, C(G)C(G), the global minimum is achieved when pg=pdatap_g=p_{data}


Detailed infromation for the proof of Theorem 1
The value function V(G,D)V(G,D) with the optimal disciminator DG(x)D^{*}_{G}(x) can be interpreted as cost function C(G)C(G) for discriminator with fixed GG.
C(G)=maxDV(G,D)=V(G,DG)C(G) = \max_DV(G,D) = V(G,D^{*}_G)

           =xpdata(x)log(DG(x))+pg(x)log(1DG(x))dx\ \ \ \ \ \ \ \ \ \ \ = \int_x p_{data}(x) \log(D_{G}^{*}(x)) + p_g(x) \log(1-D_{G}^{*}(x))dx

           =xpdata(x)log(pdata(x)pdata(x)+pg(x))+pg(x)log(pg(x)pg(x)+pdata(x))dx\ \ \ \ \ \ \ \ \ \ \ = \int_x p_{data}(x) \log(\frac{p_{data}(x)}{p_{data}(x) + p_{g}(x)}) + p_g(x) \log(\frac{p_{g}(x)}{p_{g}(x) + p_{data}(x)})dx

           =log(4)+xpdata(x)log(pdata(x)pdata(x)+pg(x))dx+log(2)\ \ \ \ \ \ \ \ \ \ \ = -\log(4) + \int_x p_{data}(x) \log(\frac{p_{data}(x)}{p_{data}(x) + p_{g}(x)})dx + \log(2)
                    +xpg(x)log(pg(x)pg(x)+pdata(x))dx+log(2)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\int_x p_g(x) \log(\frac{p_{g}(x)}{p_{g}(x) + p_{data}(x)})dx + \log(2)

           =log(4)+xpdata(x)log(2pdata(x)pdata(x)+pg(x))+pg(x)log(2pg(x)pg(x)+pdata(x))dx\ \ \ \ \ \ \ \ \ \ \ = -\log(4) + \int_x p_{data}(x) \log(\frac{2p_{data}(x)}{p_{data}(x) + p_{g}(x)}) + p_g(x) \log(\frac{2p_{g}(x)}{p_{g}(x) + p_{data}(x)})dx

           =log(4)+xpdata(x)(log(pdata(x))+log(pdata(x)+pg(x)2))\ \ \ \ \ \ \ \ \ \ \ = -\log(4) + \int_x p_{data}(x) (\log(p_{data}(x)) + \log(\frac{p_{data}(x)+p_{g}(x)}{2}))
                    +pg(x)(log(pg(x))+log(pdata(x)+pg(x)2))dx\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +p_g(x) (\log(p_{g}(x)) + \log(\frac{p_{data}(x)+p_{g}(x)}{2}))dx

           =log(4)+KL(pdatapdata+pg2)+KL(pgpdata+pg2)\ \ \ \ \ \ \ \ \ \ \ = -\log(4) + KL(p_{data}||\frac{p_{data}+p_{g}}{2}) + KL(p_{g}||\frac{p_{data}+p_{g}}{2})

           =log(4)+2  JSD(pdatapg)\ \ \ \ \ \ \ \ \ \ \ = -\log(4) + 2 \ \cdot\ JSD(p_{data}||p_g)

Thus, C(G)C(G) is achieved the global minimum at pg=pdatap_g=p_{data} with the value log(4)-\log(4)


Training Process and Convergence of Algorithm



The overview of the Algorithm1 is given below.


References

Generative Adversarial Nets
Generative Adversarial Networks Tutorial by Ian Goodfellow
Google Developers - GAN
Minimax Wikipedia
Kullback-Leibler Divergence

profile
Mathematics and Statistics in UC Berkeley '17 / AI Research Engineer

0개의 댓글