Generative Adversarial Networks

d9249·2022년 4월 21일
0

Language translation

목록 보기
8/12

Abstract

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

The training procedure for G is to maximize the probability of D making a mistake.

This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 21 everywhere.

In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation.

There is no need for any Markov chains or unrolled approximate inference net- works during either training or generation of samples.

Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

1. Introdution

The promise of deep learning is to discover rich, hierarchical models that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora.

So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label.

These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units which have a particularly well-behaved gradient .

Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context.

We propose a new generative model estimation procedure that sidesteps these difficulties.

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution.

The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency.

Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.

This framework can yield specific training algorithms for many kinds of model and optimization algorithm.

In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron.

We refer to this special case as adversarial nets.

In this case, we can train both models using only the highly successful backpropagation and dropout algorithms and sample from the generative model using only forward propagation.

No approximate inference or Markov chains are necessary.

2. Related Work

An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs), deep Boltzmann machines (DBMs) and their numerous variants.

The interactions within such models are represented as the product of unnormalized potential functions, normalized by a global summation/integration over all states of the random variables.

This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods.

Mixing poses a significant problem for learning algorithms that rely on MCMC.

Deep belief networks (DBNs) are hybrid models containing a single undirected layer and sev- eral directed layers.

While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.

Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching and noise-contrastive estimation (NCE).

Both of these require the learned probability density to be analytically specified up to a normalization constant.

Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable unnormalized probability density.

Some models such as denoising auto-encoders and contractive autoencoders have learning rules very similar to score matching applied to RBMs.

In NCE, as in this work, a discriminative training criterion is employed to fit a generative model.

However, rather than fitting a separate discriminative model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution.

Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.

Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution.

This approach has the advantage that such machines can be designed to be trained by back-propagation.

Prominent recent work in this area includes the generative stochastic network (GSN) framework, which extends generalized denoising auto-encoders: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain.

Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling.

Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units, which improve the performance of backpropagation but have problems with unbounded activation when used ina feedback loop.

More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes and stochastic backpropagation.

3. Adversarial nets

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons.

To learn the generator’s distribution pgp_g over data xx, we define a prior on input noise variables pz(z)p_z(z), then represent a mapping to data space as G(z;θg)G(z;θ_g), where GG is a differentiable function represented by a multilayer perceptron with parameters θgθ_g.

We also define a second multilayer perceptron D(x;θd)D(x; θ_d) that outputs a single scalar.
D(x)D(x) represents the probability that xx came from the data rather than pgp_g.

We train DD to maximize the probability of assigning the correct label to both training examples and samples from GG.

We simultaneously train GG to minimize log(1D(G(z)))log(1 − D(G(z))):
In other words, DD and GG play the following two-player minimax game with value function V(G,D)V(G, D):

mGinmDaxV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))].\underset{G}min\underset{D} maxV(D,G) = E_{x∼p_{data}(x)}[logD(x)]+E_{z∼p_z(z)}[log(1−D(G(z)))].

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as GG and DD are given enough capacity, i.e., in the non-parametric limit.

See Figure 1 for a less formal, more pedagogical explanation of the approach.

In practice, we must implement the game using an iterative, numerical approach.

Optimizing DD to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting.

Instead, we alternate between kk steps of optimizing DD and one step of optimizing GG.

This results in DD being maintained near its optimal solution, so long as GG changes slowly enough.

This strategy is analogous to the way that SML/PCD training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning.

The procedure is formally presented in Algorithm 1.

In practice, equation 1 may not provide sufficient gradient for GG to learn well.

Early in learning, when GG is poor, DD can reject samples with high confidence because they are clearly different from the training data.

In this case, log(1D(G(z)))log(1 − D(G(z))) saturates.

Rather than training GG to minimize log(1D(G(z)))log(1 − D(G(z))) we can train GG to maximize logD(G(z))log D(G(z)).

This objective function results in the same fixed point of the dynamics of GG and DD but provides much stronger gradients early in learning.


Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pgp_g (G) (green, solid line).

The lower horizontal line is the domain from which zz is sampled, in this case uniformly.

The horizontal line above is part of the domain of xx.

The upward arrows show how the mapping x=G(z)x = G(z) imposes the non-uniform distribution pg on transformed samples.

GG contracts in regions of high density and expands in regions of low density of PgP_g.

(a) Consider an adversarial pair near convergence: PgP_g is similar to pdata and DD is a partially accurate classifier.

(b) In the inner loop of the algorithm DD is trained to discriminate samples from data, converging to D(x)=Pdata(x)Pdata(x)+Pg(x)D^∗(x) = \frac{P_{data}(x)}{P_{data}(x)+P_g(x)} .

(c) After an update to GG, gradient of D has guided G(z)G(z) to flow to regions that are more likely to be classified as data.

(d) After several steps of training, if GG and DD have enough capacity, they will reach a point at which both cannot improve because Pg=PdataP_g = P_{data}.
The discriminator is unable to differentiate between the two distributions, i.e. D(x)=12D(x) = \frac{1}{2}.

4. Theoretical Results

The generator GG implicitly defines a probability distribution pg as the distribution of the samples G(z)G(z) obtained when zpzz ∼ pz.

Therefore, we would like Algorithm 1 to converge to a good estimator of pdatap_{data}, if given enough capacity and training time.

The results of this section are done in a non- parametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.

We will show in section 4.1 that this minimax game has a global optimum for pg=pdatap_g = p_{data}.

We will then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.

Algorithm 1

Minibatch stochastic gradient descent training of generative adversarial nets.
The number of steps to apply to the discriminator, kk, is a hyperparameter.
We used k=1k = 1, the least expensive option, in our experiments.

4.1 Global Optimality of Pg=Pdata{P_g=P_{data}}

We first consider the optimal discriminator DD for any given generator GG.

Proposition 1. For GG fixed, the optimal discriminator DD is

DG(x)=Pdata(x)Pdata(x)+Pg(x)D^∗_G (x) = \frac{P_{data}(x)}{P_{data}(x) + P_g(x)}

Proof. The training criterion for the discriminator DD, given any generator GG, is to maximize the quantity V(G,D)V (G, D)

V(G,D)=xPdata(x)log(D(x))dx+zPz(z)log(1D(g(z)))dzV(G,D) = \int_{x} P_{data}(x)log(D(x))dx + \int_{z} P_{z}(z)log(1-D(g(z)))dz
=xPdata(x)log(D(x))+Pg(x)log(1D(x))dx= \int_{x} P_{data}(x)log(D(x)) + P_{g}(x)log(1-D(x))dx

For any (a,b)R2/0,0(a, b) ∈ R^2 / {0, 0}, the function yalog(y)+blog(1y)y → alog(y) + blog(1 − y) achieves its maximum in [0,1] at aa+b\frac{a}{a+b}.
The discriminator does not need to be defined outside of Supp(Pdata)Supp(Pg)Supp(P_{data}) ∪ Supp(P_g), concluding the proof.

Note that the training objective for DD can be interpreted as maximizing the log-likelihood for estimating the conditional probability P(Y=yx)P(Y = y|x), where YY indicates whether xx comes from PdataP_{data}(with y = 1) or from pgp_g(with y = 0).
The minimax game in Eq. 1 can now be reformulated as:

C(G)=mDaxV(G,D) =Expdata[logDG(x)]+EzPz[log(1DG(G(z)))] =Expdata[logDG(x)]+ExPg[log(1DG(x))] =Expdata[logPdata(x)pdata(x)+pg(x)]+Expg[logpg(x)pdata(x)+pg(x)]C(G) = \underset{D} max {V (G, D)} \\\ \\=E_{x∼p_{data}} [log D_G^∗(x)] + E_{z∼P_z}[log(1 − D_G^∗ (G(z)))] \\\ \\=E_{x∼p_{data}} [log D_G^∗(x)] + E_{x∼P_g} [log(1 − D_G^∗ (x))] \\\ \\=E_{x∼p_{data}}\left[log \frac{P_{data}(x)}{p_{data} (x)+p_g(x)}\right] + E_{x∼p_g}\left[log \frac{p_g(x)}{p_{data}(x)+p_g(x)} \right ]

Theorem 1. The global minimum of the virtual training criterion C(G)C(G) is achieved if and only if Pg=PdataP_g =P_{data}.
Atthatpoint, C(G)C(G) achieves the value − log4.

Proof. For Pg=PdataP_g = P_{data} , DG(x)=12D_G^∗(x) = \frac{1}{2}, (consider Eq. 2).
Hence, by inspecting Eq. 4 at DG(x)=12D_G^∗ (x) = \frac{1}{2},
we find C(G)=log12+log12=log4.C(G) = log \frac{1}{2} + log \frac{1}{2} = − log 4.
To see that this is the best possible value of C(G)C(G),
reached only for Pg=PdataP_g = P_{data}, observe that

Expdata[log2]+Expg[log2]=log4E_{x∼p_{data}}\left[-log2 \right] + E_{x∼p_g}\left[log 2\right ] = -log 4

and that by subtracting this expression from C(G)=V(DG,G)C(G) = V (D_G^∗, G), we obtain:

C(G)=log(4)+KL(PdataPdata+Pg)2)+KL(PgPdata+Pg)2)C(G) = -log(4) + KL \left(P_{data} ||\frac{P_{data}+P_g)}{2}\right)+ KL \left(P_{g} ||\frac{P_{data}+P_g)}{2}\right)

where KL is the Kullback–Leibler divergence.
We recognize in the previous expression the Jensen–Shannon divergence between the model’s distribution and the data generating process:

C(G)=log(4)+2JSD(PdataPg)C(G) = -log(4)+2 \bullet JSD(P_{data}||P_g)

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal, we have shown that C=log(4)C^∗ = − log(4) is the global minimum of C(G)C(G) and that the only solution is Pg=PdataP_g=P_{data},i.e., the generative model perfectly replicating the data generating process.

4.2 Convergence of Algorithm 1

Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given G, and pg is updated so as to improve the criterion

Expdata[logDG(x)]+ExPg[log(1DG(x))]E_{x∼p_{data}} [log D_G^∗(x)] + E_{x∼P_g} [log(1 − D_G^∗ (x))]

hen PgP_g converges to PdataP_{data}

Proof. Consider V(G,D)=U(Pg,D)V(G,D) = U(P_g,D) as a function of pg as done in the above criterion. Note that U(Pg,D)U(P_g,D) is convex in PgP_g.

The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained.

In other words, if f(x)=supαAfα(x)f(x) = sup_{α∈A}f_α(x) and fα(x)f_α(x) is convex in xx for every αα, then fβ(x)f∂fβ(x) ∈ ∂f if β=argsupαAfα(x)β = argsup_{α∈A}f_α(x).

This is equivalent to computing a gradient descent update for pg at the optimal DD given the corresponding GG.

supD U(pg,D) is convex in PgP_g with a unique global optima as proven in Thm 1, therefore with sufficiently small updates of Pg,PgP_g, P_g converges to PxP_x, concluding the proof.

In practice, adversarial nets represent a limited family of PgP_g distributions via the function G(z;θg)G(z; θ_g ), and we optimize θgθ_g rather than pg itself.

Using a multilayer perceptron to define GG introduces multiple critical points in parameter space.

However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.

5. Experiments

We trained adversarial nets an a range of datasets including MNIST, the Toronto Face Database (TFD), and CIFAR-10.
The generator nets used a mixture of rectifier linear activations and sigmoid activations, while the discriminator net used maxout activations.
Dropout was applied in training the discriminator net.
While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.

We estimate probability of the test set data under PgP_g by fitting a Gaussian Parzen window to the samples generated with GG and reporting the log-likelihood under this distribution.

modelMNISTTFD
DBN138 ±\pm 21909 ±\pm 66
Stacked CAE121 ±\pm 1.62110 ±\pm 50
Deep GSN214 ±\pm 1.11890 ±\pm 29
Adversarial225 ±\pm 22057 ±\pm 26

Table 1: Parzen window-based log-likelihood estimates.
The reported numbers on MNIST are the mean log- likelihood of samples on test set, with the standard error of the mean computed across examples.
On TFD, we computed the standard error across folds of the dataset, with a different σ chosen using the validation set of each fold.
On TFD, σσ was cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.

The σσ parameter of the Gaussians was obtained by cross validation on the validation set.
This procedure was introduced in Breuleux et al. and used for various generative models for which the exact likelihood is not tractable.
Results are reported in Table 1.
This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge.
Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.

In Figures 2 and 3 we show samples drawn from the generator net after training.
While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set.
Samples are fair random draws, not cherry-picked.
Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing.
a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator)


Figure 3: Digits obtained by linearly interpolating between coordinates in zz space of the full model.

Deep directed graphical modelsDeep undirected graphical modelsGenerative autoencodersAdversarial models
TrainingInference needed during training.Inference needed during training. MCMC needed to approximate partition function gradient.Enforced tradeoff between mixing and power of reconstruction generationSynchronizing the discriminator with the generator. Helvetica.
Inference27Variational inferenceMCMC-based inferenceLearned approximate inference
SamplingNo difficultiesRequires Markov chainRequires Markov chainNo difficulties
Evaluating p(x)p(x)Intractable, may be approximated with AISIntractable, may be approximated with AISNot explicitly represented, may be approximated with Parzen density estimationNot explicitly represented, may be approximated with Parzen density estimation
Model designNearly all models incur extreme difficultyCareful design needed to ensure multiple propertiesAny differentiable function is theoretically permittedAny differentiable function is theoretically permitted

Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches to deep generative modeling for each of the major operations involving a model.

6. Advantages and disadvantages

This new framework comes with advantages and disadvantages relative to previous modeling frameworks.
The disadvantages are primarily that there is no explicit representation of Pg(x)P_g(x), and that DD must be synchronized well with GG during training (in particular, GG must not be trained too much without updating DD, in order to avoid “the Helvetica scenario” in which GG collapses too many values of zz to the same value of xx to have enough diversity to model pdata), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps.
The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model.
Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.

The aforementioned advantages are primarily computational.
Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator.
This means that components of the input are not copied directly into the generator’s parameters.
Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.

7. Conlusions and future work

This framework admits many straightforward extensions:

  1. A conditionalgenerativeconditional generative model P(xc)P(x | c) can be obtained by adding cc as input to both GG and DD.
  2. LearnedapproximateinferenceLearned approximate inference can be performed by training an auxiliary network to predict zz given xx.
    This is similar to the inference net trained by the wake-sleep algorithm but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.
  3. One can approximately model all conditionals P(xSxS)P(x_S | x_S) where SS is a subset of the indices of xx by training a family of conditional models that share parameters.
    Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM.
  4. Semi-supervised learning: features from the discriminator or inference net could improve perfor- mance of classifiers when limited labeled data is available.
  5. Efficiency improvements: training could be accelerated greatly by divising better methods for coordinating GG and DD or determining better distributions to sample zz from during training.
profile
AI researcher가 되고싶은 석사 연구생입니다.

0개의 댓글