小言_互联网的博客

【论文翻译】Transferring GANs: generating images from limited data

456人阅读  评论(0)

论文下载

Abstract.

Transferring knowledge of pre-trained networks to new domains by means of fine-tuning is a widely used practice for applications based on discriminative models. To the best of our knowledge this practice has not been studied within the context of generative deep networks. Therefore, we study domain adaptation applied to image generation with generative adversarial networks. We evaluate several aspects of domain adaptation, including the impact of target domain size, the relative distance between source and target domain, and the initialization of conditional GANs. Our results show that using knowledge from pre-trained networks can shorten the convergence time and can significantly improve the quality of the generated images, especially when target data is limited. We show that these conclusions can also be drawn for conditional GANs even when the pre-trained model was trained without conditioning. Our results also suggest that density is more important than diversity and a dataset with one or few densely sampled classes is a better source model than more diverse datasets such as ImageNet or Places.
Keywords: Generative adversarial networks, transfer learning, domain adaptation, image generation

摘要通过微调的方式将预先训练好的网络知识转移到新的领域,是基于判别模型的应用中广泛使用的一种实践。据我们所知,这一实践还没有在生成深层网络的背景下进行研究。因此,我们研究了域自适应在生成对抗网络图像生成中的应用。我们评估了域自适应的几个方面,包括目标域大小的影响、源域与目标域的相对距离以及条件GANs的初始化。结果表明,利用预先训练好的网络知识可以缩短收敛时间,显著提高图像质量,特别是在目标数据有限的情况下。我们证明这些结论也适用于有条件的甘斯,即使是在没有条件作用的情况下训练前的模型。我们的研究结果还表明,密度比多样性更重要,与ImageNet或Places等更多样化的数据集相比,具有一个或几个密集采样的类的数据集是更好的源模型。

关键词:生成对抗网络,转移学习,域适应,图像生成

1 Introduction

Generative Adversarial Networks (GANs) can generate samples from compleximage distributions [1]. They consist of two networks: a discriminator whichaims to separate real images from fake (or generated) images, and a generatorwhich is simultaneously optimized to generate images which are classified as realby the discriminator. The theory was later extended to the case of conditionalGANs where the generative process is constrained using a conditioning prior [2]which is provided as an additional input. GANs have further been widely applied in applications, including super-resolution [3], 3D object generation andreconstruction [4], human pose estimation [5], and age estimation [6].Deep neural networks have obtained excellent results for discriminative classification problems for which large datasets exist; for example on the ImageNetdataset which consists of over 1M images [7]. However, for many problems theamount of labeled data is not sufficient to train the millions of parameters typically present in these networks. Fortunately, it was found that the knowledgecontained in a network trained on a large dataset (such as ImageNet) can easilybe transferred to other computer vision tasks. Either by using these networks asoff-the-shelf feature extractors [8], or by adapting them to a new domain by aprocess called fine tuning [9]. In the latter case, the pre-trained network is usedto initialize the weights for a new task (effectively transferring the knowledgelearned from the source domain), which are then fine tuned with the trainingimages from the new domain. It has been shown that much fewer images wererequired to train networks which were initialized with a pre-trained network.GANs are in general trained from scratch. The procedure of using a pretrained network for initialization { which is very popular for discriminative networks { is to the best of our knowledge not used for GANs. However, like inthe case of discriminative networks, the number of parameters in a GAN is vast;for example the popular DC-GAN architecture [10] requires 36M parameters togenerate an image of 64x64. Especially in the case of domains which lack manytraining images, the usage of pre-trained GANs could significantly improve thequality of the generated images.Therefore, in this paper, we set out to evaluate the usage of pre-trainednetworks for GANs. The paper has the following contributions:

  1. We evaluate several transfer configurations, and show that pre-trained networks can effectively accelerate the learning process and provide useful priorknowledge when data is limited.
  2. We study how the relation between source and target domains impacts theresults, and discuss the problem of choosing a suitable pre-trained model,which seems more difficult than in the case of discriminative tasks.
  3. We evaluate the transfer from unconditional GANs to conditional GANs fortwo commonly used methods to condition GANs.

1 简介
生成对抗网络 (GAN) 可以从复杂图像分布生成样本 [1]。它们由两个网络组成:一个区分器,用于将真实图像与假(或生成的)图像分开,以及一个同时优化以生成被分类为真正的歧视者。该理论后来被扩展到条件AN的情况下,其中生成过程被约束使用条件前[2],这是作为附加输入提供。局域网进一步广泛应用于应用,包括超分辨率[3]、3D物体生成和重建[4]、人形估计[5]和年龄估计[6]。深度神经网络在存在大型数据集的区分分类问题方面取得了良好的效果;例如,在由超过 1M 图像组成的ImageNetdataset上[7] 。但是,对于许多问题,标记数据的数量不足以训练这些网络中通常存在的数百万个参数。幸运的是,在大型数据集(如 ImageNet)上训练的网络中所包含的知识很容易转移到其他计算机视觉任务。要么将这些网络用作现成的功能提取器[8],要么通过称为微调[9]的过程将它们调整到新域。在后一种情况下,预先训练的网络用于初始化新任务的权重(有效地传输从源域学到的知识),然后使用新域的训练图像微调这些权重。已经表明,训练使用预先训练的网络初始化的网络所需的图像要少得多。GAN 一般都是从零开始训练的。使用预先训练的网络进行初始化的过程(对于歧视网络非常流行)是我们所知的不是用于 DN 的过程。然而,与区分网络的情况一样,GAN 中的参数数量巨大;例如,流行的 DC-GAN 体系结构 [10] 需要 36M 参数才能生成 64x64 的图像。特别是在缺少许多训练图像的域的情况下,使用预先训练的 GAN 可以显著提高生成的图像的质量。因此,在本文中,我们着手评估预训练的 GAN 网络的使用情况。本文有以下意见:

  1. 我们评估了几种传输配置,并表明预先培训的网络可以有效地加快学习过程,并在数据有限时提供有用的先验知识。
  2. 研究了源域和目标域之间的关系如何影响结果,并讨论了选择合适的预训练模型的问题,这似乎比区分任务更难。
  3. 我们评估从无条件 GAN 到条件 GAN 的转移,用于两种常用方法,以对 GAN 进行条件条件。

2 Related Work

Transfer learning/domain transfer: Learning how to transfer knowledgefrom a source domain to target domain is a well studied problem in computervision [11]. In the deep learning era, complex knowledge is extracted duringthe training stage on large datasets [12,13]. Domain adaptation by means offine tuning a pre-trained network has become the default approach for manyapplications with limited training data or slow convergence [14,9].Several works have investigated transferring knowledge to unsupervised orsparsely labeled domains. Tzeng et al. [15] optimized for domain invariance,while transferring task information that is present in the correlation betweenthe classes of the source domain. Ganin et al. [16] proposed to learn domain invariant features by means of a gradient reversal layer. A network simultaneouslytrained on these invariant features can be transfered to the target domain. Finally, domain transfer has also been studied for networks that learn metrics [17] In contrast to these methods, we do not focus on transferring discriminativefeatures, but transferring knowledge for image generation.
GAN: Goodfellow et al. [1] introduced the first GAN model for image generation. Their architecture uses a series of fully connected layers and thus is limitedto simple datasets. When approaching the generation of real images of highercomplexity, convolutional architectures have shown to be a more suitable option.Shortly afterwards, Deep Convolutional GANs (DC-GAN) quickly became thestandard GAN architecture for image generation problems [10]. In DC-GAN, thegenerator sequentially up-samples the input features by using fractionally-stridedconvolutions, whereas the discriminator uses normal convolutions to classify theinput images. Recent multi-scale architectures [18,19,20] can effectively generatehigh resolution images. It was also found that ensembles can be used to improvethe quality of the generated distribution [21].Independently of the type of architecture used, GANs present multiple challenges regarding their training, such as convergence properties, stability issues,or mode collapse. Arjovksy et al. [22] showed that the original GAN loss [1]are unable to properly deal with ill-suited distributions such as those with disjoint supports, often found during GAN training. Addressing these limitationsthe Wassertein GAN [23] uses the Wasserstein distance as a robust loss, yetrequiring the generator to be 1-Lipschitz. This constrain is originally enforcedby clipping the weights. Alternatively, an even more stable solution is adding agradient penalty term to the loss (known as WGAN-GP) [24].cGAN: Conditional GANs (cGANs) [2] are a class of GANs that use a particular attribute as a prior to build conditional generative models. Examples ofconditions are class labels [25,26,27], text [28,29], another image (image translation [30,31] and style transfer [32]).Most cGAN models [2,29,33,34] apply their condition in both generator anddiscriminator by concatenating it to the input of the layers, i.e. the noise vectorfor the first layer or the learned features for the internal layers. Instead, in[32], they include the conditioning in the batch normalization layer. The ACGAN framework [25] extends the discriminator with an auxiliary decoder toreconstruct class-conditional information. Similarly, InfoGAN [35] reconstructsa subset of the latent variables from which the samples were generated. Miyato etal. [36] propose another modification of the discriminator based on a projectionlayer that uses the inner product between the conditional information and theintermediate output to compute its loss.

2 相关工作
转移学习/域名转移:学习如何将知识从源域传输到目标域是计算机视觉中一个研究良好的问题[11]。在深度学习时代,在大型数据集的培训阶段提取复杂的知识[12,13]。通过微调预先训练的网络进行域调整,已成为许多训练数据有限或收敛缓慢的应用程序的默认方法 [14,9]。
一些工作调查了将知识转移到无人监督或有稀疏标记的域。Tzeng等人[15]针对域不变性进行了优化,同时传输源域各类之间相关性中存在的任务信息。Ganin等人[16]建议通过梯度反转层来学习域不变特征。同时在这些不变功能上训练的网络可以传输到目标域。最后,对学习指标[17]的网络也研究了域名转移,与这些方法相比,我们不注重传输区分特征,而是为图像生成传递知识。
GAN Goodfellow等人[1]引入了第一个用于图像生成的GAN模型。他们的体系结构使用一系列完全连接的图层,因此仅限于简单的数据集。当接近生成更高复杂度的真实图像时,卷积体系结构已被证明是一个更合适的选择。不久之后,深度卷积DN(DC-GAN)迅速成为图像生成问题的标准GAN架构[10]。在DC-GAN中,使生成器用小数-步长卷积对输入特征进行顺序上采样,而鉴别器使用正态卷积对输入图像进行分类。最近的多尺度架构[18,19,20]可以有效地生成高分辨率图像。还发现,可用总效果于提高生成的分布的质量[21]。
与所使用的体系结构类型无关,GAN 在训练方面面临多重挑战,例如收敛特性、稳定性问题或模式崩溃。Arjovksy等人[22]表明,原来的GAN损失[1]无法正确处理不适合的分布,如那些不相交的分布,经常在GAN培训期间发现。解决这些限制,Wasserstein[23]使用Wasserstein距离作为一个强大的损失,但要求是生成器Lipschitz。此约束最初通过剪切权重来强制执行。或者,一个更稳定的解决方案是向损失添加梯度惩罚术语(称为WGAN-GP)[24]。
cGAN:条件 GAN (cAN) [2] 是一类 GAN,它使用特定属性作为构建条件生成模型之前的一类。条件的示例包括类标签 [25,26,27]、文本 [28,29]、另一个图像(图像转换 [30,31] 和样式转移 [32])。大多数 cGAN 模型 [2,29,33,34] 通过将其串联到层的输入(即第一层的噪声矢量或内部图层的已学要素)来应用其状态在生成器和分量器中。相反,在[32]中,它们包括批处理规范化层中的调理。ACGAN 框架 [25] 使用辅助解码器扩展鉴别器以重建类条件信息。同样,InfoGAN [35] 重建了生成样本的潜在变量的子集。宫托·埃塔尔[36] 建议基于使用条件信息和中间输出之间的内部产物来计算其损耗的投影器对鉴别器进行另一次修改。

3 Generative Adversarial Networks

3.1 Loss functions

A GAN consists of a generator G and a discriminator D [1]. The aim is to traina generator G which generates samples that are indistinguishable from the realdata distribution. The discriminator is optimized to distinguish samples from thereal data distribution pdata from those of the fake (generated) data distributionpg. The generator takes noise z ∼ pz as input, and generates samples G (z)with a distribution pg. The networks are trained with an adversarial objective.The generator is optimized to generate samples which would be classified bythe discriminator as belonging to the real data distribution. The minimax gameobjective is given by:
G∗ = argminGmaxDLGAN (G; D) (1)LGAN (G; D) = Ex∼pdata[log D(x)] + Ez∼pz[log(1 - D(G(z)))] (2)In the case of WGAN-GP [24] the two loss functions are:LW GAN-GP (D) = -Ex∼pdata[D(x)] + Ez∼pz[D(G(z))]+ λEx∼pdata;z∼pz;α∼(0;1) h(krD (αx + (1 - α) G(z)) k2 - 1)2i (3)LW GAN-GP (G) = -Ez∼pz[D(G(z))] (4)

3.2 Evaluation Metrics

Evaluating GANs is notoriously difficult [37] and there is no clear agreed reference metric yet. In general, a good metric should measure the quality and thediversity in the generated data. Likelihood has been shown to not correlate wellwith these requirements [37]. Better correlation with human perception has beenfound in the widely used Inception Score [38], but recent works have also shownits limitations [39]. In our experiments we use two recent metrics that show better correlation in recent studies [40,41]. While not perfect, we believe they aresatisfactory enough to help us to compare the models in our experiments.Fr´echet Inception Distance [42] The similarity between two sets is measuredas their Fr´echet distance (also known as Wasserstein-2 distance) in an embedded space. The embedding is computed using a fixed convolutional network (anInception model) up to a specific layer. The embedded data is assumed to followa multivariate normal distribution, which is estimated by computing their meanand covariance. In particular, the FID is computed asFID (X1; X2) = kµ1 - µ2k2 2 + Tr Σ1 + Σ2 - 2 (Σ1Σ2)1 2 (5)Typically, X1 is the full dataset with real images, while X2 is a set of generatedsamples. We use FID as our primary metric, since it is efficient to compute andcorrelates well with human perception [42].Independent Wasserstein (IW) critic [43] This metric uses an independent critic D^ only for evaluation. This independent critic will approximate theWasserstein distance [22] between two datasets X1 and X2 asIW (X1; X2) = Ex∼X1 D^ (x) - Ex∼X2 D^ (x) (6)
In this case, X1 is typically a validation set, used to train the independent critic.We report IW only in some experiments, due to the larger computational costthat requires training a network for each measurement.

3 生成对抗网络
生成器.1 损耗函数A GAN 由发生器 G 和鉴别器 D [1] 组成。目的是训练一个sG,它产生与真实数据分布无法区分的样本。鉴别器经过优化,将样本与真实数据分布pdata与假(生成)数据分布p数据区分开来。s以噪声z + pz 作为输入,并生成带有分布 pg的样本G (z)。网络的训练具有对抗性的目标。该生成器经过优化,可生成样本,该样本将被鉴别器归类为属于实际数据分布的样本。最小最大游戏目标由:G = = argminGmaxDLGAN (G;D) (1)LGAN (G;D) = Ex=pdata[日志 D(x)] = Ez=pz[日志(1 - D) (G) z()) * (2)在WGAN-GP的情况下 [24] 两个损耗函数是:LW GAN-GP (D)= - Ex=pdata=D(x)= = Ez=pz=D(G(z)) = = Ex=pdata;z [pz;== =( 0;1) h (千分 (μx = (1 - +) G(z)) k2 - 1)2i (3)LW GAN-GP (G) = -Ez=pz=D(G(z) (4)

3.2 评估指标
评估通用数是出了名的困难 [37] ,目前尚无明确的商定参考指标。通常,一个好的指标应该测量生成数据的质量和多样性。可能性已被证明与这些要求没有很好地关联[37]。在广泛使用的"初始分数"[38]中,人们发现了与人类感知更好的相关性,但最近的研究中显示出其局限性[39]。在我们的实验中,我们使用两个最近的指标,在最近的研究中显示更好的相关性[40,41]。虽然并不完美,但我们相信它们足以帮助我们比较实验中的模型。
Fr_echet 初始距离 [42] 两组之间的相似性是测量其 Fr_echet 距离(也称为瓦瑟斯坦-2 距离)在嵌入式空间中。嵌入使用固定卷积网络(初始模型)计算,一直到特定图层。假定嵌入数据遵循多变量正态分布,通过计算其均值和协方差来估计。特别是,FID 计算为 FID (X1;X2) = k=1 - ±2k2+ Tr =1+ + +2 (+1+2)1 2 (5) 通常, X1 是具有真实图像的完整数据集,而 X2 是一组生成的示例。我们使用 FID 作为主要指标,因为它的计算效率很高,并且与人类感知密切相关[42]。
独立瓦瑟斯坦(IW)准则[43]这个指标只使用独立准则D+进行评价。这个独立准则将近似两个数据集X1和X2之间的瓦瑟斯坦距离[22](X1;X2) = Ex×X1 D=(x) - Ex×X2 D=( x) (6)
在这种情况下,X 1 通常是一个验证集,用于训练独立准则。我们仅在某些实验中报告 IW,因为计算成本较大,需要为每个测量训练网络。

4 Transferring GAN representations

4.1 GAN adaptation

To study the effect of domain transfer for GANs we will use the WGAN-GP [24]architecture which uses ResNet in both generator and discriminator. This architecture has been experimentally demonstrated to be stable and robust againstmode collapse [24]. The generator consists of one fully connected layer, fourResidual Blocks and one convolution layer, and the Discriminator has same setting. The same architecture is used for conditional GAN.
Implementation details We generate images of 64×64 pixels, using standardvalues for hyperparameters. The source models1 are trained with a batch of 128images during 50K iterations (except 10K iterations for CelebA) using Adam[44] and a learning rate of 1e-4. For fine tuning we use a batch size of 64 and alearning rate of 1e-4 (except 1e-5 for 1K target samples). Batch normalizationand layer normalization are used in the generator and discriminator respectively.

4 转移 GAN 表示
4.1 GAN 适应为了研究 GAN 域转移的影响,我们将使用 WGAN-GP [24] 体系结构,该架构在生成器和鉴别器中使用 ResNet。该架构经实验证明,对模式崩溃是稳定和稳健的[24]。生成器由一个完全连接的图层、四个残存块和一个卷积层组成,并且鉴别器具有相同的设置。条件 GAN 使用相同的体系结构。
实现细节我们使用超参数的标准值生成 64×64 像素的图像。源模型1在 50K 迭代期间使用 Adam[44] 和学习速率 1e-4 进行训练,使用 50K 次迭代(CelebA 的 10K 次迭代除外)使用一批 128 个图像。对于微调,我们使用批处理大小 64 和学习速率 1e-4(1K 目标样本的 1e-5 除外)。批处理规范化和层规范化分别用于生成器和鉴别器。

4.2 Generator/discriminator transfer configuration

The two networks of the GAN (generator and discriminator) can be initializedwith either random or pre-trained weights (from the source networks). In a firstexperiment we consider the four possible combinations using a GAN pre-trainedwith ImageNet and 100K samples of LSUN bedrooms as target dataset. Thesource GAN was trained for 50K iterations. The target GAN was trained for(additional) 40K iterations.Table 1 shows the results. Interestingly, we found that transferring the discriminator is more critical than transferring the generator. The former helps to improve the results in both FID and IW metrics, while the latter only helps ifthe discriminator was already transferred, otherwise harming the performance.Transferring both obtains the best result. We also found that training is morestable in this setting. Therefore, in the rest of the experiments we evaluated either training both networks from scratch or pre-training both (henceforth simplyreferred to as pre-trained).
Figure 1 shows the evolution of FID and IW during the training process withand without transfer. Networks adapted from a pre-trained model can generateimages of given scores in significantly fewer iterations. Training from scratch fora long time manages to reduce this gap significantly, but pre-trained GANs cangenerate images with good quality already with much fewer iterations. Figures 2and 4 show specific examples illustrating visually these conclusions.

4.2 生成器/鉴别器传输配置
GAN 的两个网络(生成器和鉴别器)可以使用随机或预训练的权重(来自源网络)进行初始化。在第一个实验中,我们考虑使用 GAN 预训练的影像网和 100K LSUN 卧室样本作为目标数据集的四种可能组合。源 GAN 已接受 50K 迭代的培训。目标 GAN 已针对(附加)40K 迭代进行训练。表 1 显示了结果。有趣的是,我们发现转移鉴别器比传输生成器更重要。前者有助于提高 FID 和 IW 指标中的结果,而后者仅在已转移鉴别器时帮助,否则会损害性能。传输两者可获得最佳结果。我们还发现,在这种环境中,训练更稳定。因此,在其余的实验中,我们评估了从头开始训练两个网络或预训练(因此,简单地称为预先训练)。
图 1 显示了 FID 和 IW 在培训过程中的演变,并且没有转移。根据预先训练的模型调整的网络可以在显著数量减少的迭代中生成给定分数的图像。长时间从零开始的培训可以显著缩小这种差距,但预先训练的 GAN 可以生成质量良好的图像,而迭代次数要少得多。图 2 和图 4 显示了具体示例,以可视方式说明了这些结论。

4.3 Size of the target dataset

The number of training images is critical to obtain realistic images, in particularas the resolution increases. Our experimental settings involve generating imagesof 64×64 pixels, where GANs typically require hundreds of thousands of trainingimages to obtain convincing results. We evaluate our approach in a challengingsetting where we use as few as 1000 images from the LSUN Bedrooms dataset,and using ImageNet as source dataset. Note that, in general, GANs evaluatedon LSUN Bedrooms use the full set of 3M million images.Table 2 shows FID and IW measured for different amounts of training samplesof the target domain. As the training data becomes scarce, the training setimplicitly becomes less representative of the full dataset (i.e. less diverse). Inthis experiment, a GAN adapted from the pre-trained model requires roughlybetween two and five times fewer images to obtain a similar score than a GANtrained from scratch. FID and IW are sensitive to this factor, so in order to have a lower bound we also measured the FID between the specific subset usedas training data and the full dataset. With 1K images this value is even higherthan the value for generated samples after training with 100K and 1M images.Intializing with the pre-trained GAN helps to improve the results in all cases,being more significant as the target data is more limited. The difference with thelower bound is still large, which suggests that there is still field for improvementin settings with limited data.Figure 2 shows images generated at different iterations. As in the previouscase, pre-trained networks can generate high quality images already in earlieriterations, in particular with sharper and more defined shapes and more realisticfine details. Visually, the difference is also more evident with limited data, wherelearning to generate fine details is difficult, so adapting pre-trained networks cantransfer relevant prior information.

4.3 目标数据集的大小训练图像的数量对于获取真实图像至关重要,特别是随着分辨率的增加。我们的实验设置涉及生成 64×64 像素的图像,其中 GAN 通常需要数十万张训练图像才能获得令人信服的结果。我们在一个具有挑战性的环境中评估我们的方法,我们使用 LSUN 卧室数据集中的 1000 个图像,并使用 ImageNet 作为源数据集。请注意,一般来说,在 LSUN 卧室上评估的 GAN 使用全套 300 万张图像。
表 2 显示了针对目标域的不同培训样本量测量的 FID 和 IW。随着培训数据变得稀缺,训练集在完全数据集中的代表性就变小了(即多样性较低)。在这个实验中,从预先训练的模型改编的GAN需要大约两到五倍的图像才能获得与从零开始训练的GAN相比的类似分数。FID 和 IW 对此因素很敏感,因此为了有一个下限,我们还测量了用作训练数据和完整数据集的特定子集之间的 FID。对于 1K 图像,此值甚至高于使用 100K 和 1M 图像进行训练后生成的样本的值。与预先训练的 GAN 进行初始化有助于在所有情况下提高结果,因为目标数据更加有限,因此效果更加显著。与下边界的差异仍然很大,这表明在数据有限的设置中仍有需要改进的字段。图 2 显示了在不同迭代中生成的图像。与前一种情况一样,预先训练的网络可以在早期迭代中生成高质量的图像,特别是具有更清晰、更明确的形状和更逼真的细节。从视觉上看,由于数据有限,这种差异也更为明显,因为很难学习生成精细的细节,因此调整预先培训的网络可以传输相关的先前信息。

4.4 Source and target domains

The domain of the source model and its relation with the target domain are alsoa critical factor. We evaluate different combinations of source domains and targetdomains (see Table 3 for details). As source datasets we used ImageNet, Places,LSUN Bedrooms and CelebA. Note that both ImageNet and Places cover widedomains, with great diversity in objects and scenes, respectively, while LSUNBedrooms and CelebA cover more densely a narrow domain. As target we usedsmaller datasets, including Oxford Flowers, LSUN Kitchens (a subset of 50K outof 2M images), Label Faces in the Wild (LFW) and CityScapes.We pre-trained GANs for the four source datasets and then trained five GANsfor each of the four target datasets (from scratch and initialized with each ofthe source GANs). The FID and IW after fine tuning are shown in Table 4. Pretrained GANs achieve significantly better results. Both metrics generally agreebut there are some interesting exceptions. The best source model for Flowers astarget is ImageNet, which is not surprising since it contains also flowers, plantsand objects in general. It is more surprising that Bedrooms is also competitiveaccording to FID (but not so much according to IW). The most interestingcase is perhaps Kitchens, since Places has several thousands of kitchens in thedataset, yet also many more classes that are less related. In contrast, bedroomsand kitchens are not the same class yet still very related visually and structurally,so the much larger set of related images in Bedrooms may be a better choice.Here FID and IW do not agree, with FID clearly favoring Bedrooms, and eventhe less related ImageNet, over Places, while IW preferring Places by a small margin. As expected, CelebA is the best source for LFW, since both containfaces (with different scales though), but Bedroom is surprisingly very close tothe performance in both metrics. For Cityscapes all methods have similar results(within a similar range), with both high FID and IW, perhaps due to the largedistance to all source domains.

4.4 源域和目标域
源模型的域及其与目标域的关系也是一个关键因素。我们评估源域和目标域的不同组合(有关详细信息,请参阅表 3)。作为源数据集,我们使用 ImageNet、地点、LSUN 卧室和 CelebA。请注意,ImageNet 和 Places 都覆盖了宽域,对象和场景的多样性分别为很大,而 LSUNBedrooms 和 CelebA 覆盖的窄域更密集。作为目标,我们使用较小的数据集,包括牛津花卉、LSUN 厨房(2M 图像中的 50K 个子集)、野生标签面 (LFW) 和城市景观。 我们为四个源数据集预训练了 GAN,然后针对四个目标数据集中的每一个训练了五个 DN(从头开始,并与每个源 GAN 初始化)。细微调后的 FID 和 IW 如表 4 所示。预训练的 GAN 可实现明显更好的结果。这两个指标通常都同意,但也有一些有趣的例外。花作为目标的最佳源模型是ImageNet,这并不奇怪,因为它也包含花,植物和对象一般。更令人惊讶的是,卧室也竞争根据FID(但不太根据IW)。最有趣的例子也许是厨房,因为Places在数据集里有几千个厨房,但还有更多的类,这是不太相关的。相比之下,卧室和厨房不是同一类,但在视觉和结构上仍然非常相关,因此卧室中更大的相关图像可能是一个更好的选择。在这里FID和IW不同意,与FID显然赞成卧室,甚至不太相关的图像网,比地方,而IW更喜欢地方以微弱优势。正如所料,CelebA 是 LFW 的最佳来源,因为两个都包含面(尽管具有不同的比例),但 Bedroom 令人惊讶的是非常接近这两个指标的性能。对于城市景观,所有方法都有类似的结果(在类似的范围内),具有高 FID 和 IW,这可能是由于与所有源域的距离较大。

4.5 Selecting the pre-trained model

Selecting a pre-trained model for a discriminative task (e.g. classification) is reduced to simply selecting either ImageNet, for object-centric domains, or Places,for scene-centric ones. The target classifier or fine tuning will simply learn to ignore non-related features and filters of the source network.However, this simple rule of thumb does not seem to apply so clearly inour GAN transfer setting due to generation being a much more complex taskthan discrimination. Results in Table 4 show that sometimes unrelated datasetsmay perform better than other apparently more related. The large number ofunrelated classes may be an important factor, since narrow yet dense domainsalso seem to perform better even when they are not so related (e.g. Bedrooms).There are also non-trivial biases in the datasets that may explain this behavior.Therefore, a way to estimate the most suitable model for a given target datasetis desirable, given a collection of pre-trained GANs.Perhaps the most simple way is to measure the distance between the sourceand target domains. We evaluated the FID between the (real) images in the target and the source datasets (results included in the supplementary material).While showing some correlation with the FID of the target generated data, it hasthe limitation of not considering whether the actual pre-trained model is ableor not to accurately sample from the real distribution. A more helpful metricis the distance between the target data and the generated samples by the pretrained model. In this way, the quality of the model is taken into account. Weestimate this distance also using FID. In general, there seem to roughly correlatewith the final FID results with target generated data (compare Tables 4 and 5).Nevertheless, it is surprising that Places is estimated as a good source datasetbut does not live up to the expectation. The opposite occurs for Bedrooms, whichseems to deliver better results than expected. This may suggest that density ismore important than diversity for a good transferable model, even for apparentlyunrelated target domains.In our opinion, the FID between source generated and target real data is arough indicator of suitability rather than accurate metric. It should taken intoaccount jointly with others factors (e.g. quality of the source model) to decidewhich model is best for a given target dataset.

4.5 选择预先训练的模型
为区分任务(例如分类)选择预先训练的模型,将简化为简单地选择 ImageNet、以对象为中心的域或以场景为中心的域的"位置"。目标分类器或微调将简单地学习忽略源网络的非相关功能和筛选器。然而,这个简单的经验法则似乎并不那么清楚地适用于我们的GAN转移设置,因为代是一个比歧视更复杂的任务。表 4 中的结果显示,有时不相关的数据集的性能可能优于其他显然更相关的数据集。大量不相关的类可能是一个重要因素,因为窄而密集的域似乎表现更好,即使它们不是那么相关(例如卧室)。数据集中还有一些非平凡的偏见,这些偏差可能解释此行为。因此,考虑到预先训练的 GAN 集合,最好使用一种方法来估计给定目标数据集的最合适的模型。
我们评估了目标中的(真实)图像和源数据集(补充材料中包含的结果)之间的 FID。在显示与目标生成数据的 FID 的某种相关性的同时,它存在不考虑实际预训练模型是否能够从实际分布中准确采样的限制。更有用的指标是预训练模型的目标数据与生成的样本之间的距离。这样,模型的质量就考虑到了。我们还使用 FID 估计此距离。通常,似乎与最终 FID 结果和目标生成数据大致相关(比较表 4 和表 5)。然而,令人惊讶的是,Places 被估计为一个好的源数据集,但不符合预期。卧室的情况正好相反,它似乎能带来比预期更好的结果。这可能表明,对于一个好的可转移模型来说,密度比多样性更重要,即使对于显然不相关的目标域也是如此。
我们认为,源生成和目标真实数据之间的 FID 是适宜性而非准确指标的粗略指标。它应与其他因素(例如源模型的质量)共同考虑,以决定哪种模型最适合给定的目标数据集。

4.6 Visualizing the adaptation process

One advantage of the image generation setting is that the process of shifting fromthe source domain towards the target domain can be visualized by samplingimages at different iterations, in particular during the initial ones. Figure 4shows some examples of the target domain Kitchens and different source domains(iterations are sampled in a logarithmic scale).Trained from scratch, the generated images simply start with noisy patternsthat evolve slowly, and after 4000 iterations the model manages to reproduce theglobal layout and color, but still fails to generate convincing details. Both theGANs pre-trained with Places and ImageNet fail to generate realistic enoughsource images and often sample from unrelated source classes (see iteration 0).During the initial adaptation steps, the GAN tries to generate kitchen-like patterns by matching and slightly modifying the source pattern, therefore preserving global features such as colors and global layout, at least during a significantnumber of iterations, then slowly changing them to more realistic ones. Nevertheless, the textures and edges are sharper and more realistic than from scratch.The GAN pre-trained with Bedrooms can already generate very convincing bedrooms, which share a lot of features with kitchens. The larger number of trainingimages in Bedrooms helps to learn transferable fine grained details that otherdatasets cannot. The adaptation mostly preserves the layout, colors and perspective of the source generated bedroom, and slowly transforms it into kitchensby changing fine grained details, resulting in more convincing images than withthe other source datasets. Despite being a completely unrelated domain, CelebAalso manages to help in speeding up the learning process by providing usefulpriors. Different parts such as face, hair and eyes are transformed into differentparts of the kitchen. Rather than the face itself, the most predominant feature remaining from the source generated image is the background color and shape,that influences in the layout and colors that the generated kitchens will have.

4.6 可视化自适应过程
图像生成设置的一个优点是,可以通过在不同迭代中采样图像来可视化从源域向目标域的转换过程,特别是在初始阶段的。图 4 显示了目标域厨房和不同源域的一些示例(迭代在对数比例中采样)。
从头开始训练,生成的图像只是从缓慢演变的嘈杂模式开始,在 4000 次迭代后,模型设法重现全局布局和颜色,但仍无法生成令人信服的细节。使用"位置"和 ImageNet 预训练的 GAN 都未能生成足够逼真的源图像,并且通常从不相关的源类中采样(请参阅迭代 0)。在初始适应步骤中,GAN 尝试通过匹配和稍微修改源模式来生成类似厨房的模式,从而保留全局特征(如颜色和全局布局),至少在大量迭代期间,然后慢慢将它们更改为更现实的。然而,纹理和边缘比从零开始更清晰、更逼真。与卧室预先培训的GAN已经可以产生非常有说服力的卧室,这些卧室与厨房有很多功能。卧室中更多的训练图像有助于学习其他数据集无法传输的细粒度细节。改编主要保留源生成卧室的布局、颜色和透视,并通过更改细粒度细节缓慢地将其转换为厨房,从而产生比其他源数据集更具说服力的图像。尽管 CelebA 是一个完全不相关的领域,但通过提供有用的先验器,还设法帮助加快学习过程。不同的部分,如脸,头发和眼睛被转换成厨房的不同部分。源生成的图像中剩余的最主要特征是背景颜色和形状,而不是面部本身,而是影响生成的厨房的布局和颜色。

5 Transferring to conditional GANs

Here we study the transferring the representation learned by a pre-trained unconditional GAN to a cGAN [2]. cGANs allow us to condition the generative modelon particular information such as classes, attributes, or even other images. Let ybe a conditioning variable. The discriminator D(x; y) aims to distinguish pairsof real data x and y sampled from the joint distribution pdata (x; y) from pairs ofgenerated outputs G(z; y0) conditioned on samples y0 from y’s marginal pdata(y).
5.1 Conditional GAN adaptationFor the current study, we adopt the Auxiliary Classifier GAN (AC-GAN) framework of [25]. In this formulation, the discriminator has an ‘auxiliary classifier’that outputs a probability distribution over classes P (C = yjx) conditioned onthe input x. The objective function is then composed of the conditional versionof the GAN loss LGAN (eq. (2)) and the log-likelihood of the correct class. Thefinal loss functions for generator and discriminator are:
LAC-GAN (G) = LGAN (G) - αGE [log (P (C = y0jG(z; y0)))] ; LAC-GAN (D) = LGAN (D) - αDE [log (P (C = yjx))] ; (7)(8)
respectively. The parameters αG and αD weight the contribution of the auxiliaryclassifier loss with respect to the GAN loss for the generator and discriminator. Inour implementation, we use Resnet-18 [50] for both G and D, and the WGAN-GPloss from the equations (3) and (4) as the GAN loss. Overall, the implementationdetails (batch size, learning rate) are the same as introduced in section 4.1.In AC-GAN, the conditioning is performed only on the generator by appending the class label to the input noise vector. We call this variant ‘Cond Concat’.We randomly initialize the weights which are connected to the conditioning prior.We also used another variant following [32], in which the conditioning prior isembedded in the batch normalization layers of the generator (referred to as‘Cond BNorm’). In this case, there are different batch normalization parametersfor each class. We initialize these parameters by copying the values from theunconditional GAN to all classes.
5.2 ResultsWe use Places [13] as the source domain and consider all the ten classes of theLSUN dataset [45] as target domain. We train the AC-GAN with 10K imagesper class for 25K iterations. The weights of the conditional GAN can be transferred from the pre-trained unconditional GAN (see section 3.1) or initializedat random. The performance is assessed in terms of the FID score between target domain and generated images. The FID is computed class-wise, averaging over all classes and also considering the dataset as a whole (class-agnostic case).The classes in the target domain have been generated uniformly. The results arepresented in table 6, where we show the performance of the AC-GAN whoseweights have been transferred from pre-trained network vs. an AC-GAN initialized randomly. We computed the FID for 250, 2500 and 25000 iterations. Atthe beginning of the learning process, there is a significant difference betweenthe two cases. The gap is reduced towards the end of the learning process buta significant performance gain still remains for pre-trained networks. We alsoconsider the case with fewer images per class. The results after 25000 iterationsfor 100 and 1K images per class are provided in the last column of table 7. Wecan observe how the difference between networks trained from scratch or frompre-trained weights is more significant for smaller sample sizes. This confirms thetrend observed in section 4.3: transferring the pre-trained weights is especiallyadvantageous when only limited data is available.The same behavior can be observed in figure 5 (left) where we compare theperformance of the AC-GAN with two unconditional GANs, one pre-trained onthe source domain and one trained from scratch, as in section 4.2. The curvescorrespond to the class-agnostic case (column ‘All’ in the table 6). From thisplot, we can observe three aspects: (i) the two variants of AC-GAN performsimilarly (for this reason, for the remaining of the experiments we consider only‘Cond BNorm’); (ii) the network initialized with pre-trained weights convergesfaster than the network trained from scratch, and the overall performance isbetter; and (iii) AC-GAN performs slightly better than the unconditional GAN.Next, we evaluate the AC-GAN performance on a classification experiment.We train a reference classifier on the 10 classes of LSUN (10K real images perclass). Then, we evaluate the quality of each model trained for 25K iterationsby generating 10K images per class and measuring the accuracy of the referenceclassifier for 100, 1K and 10K images per class. The results show an improvement when using pre-trained models, with higher accuracy and lower FID inall settings, suggesting that it captures better the real data distribution of thedataset compared to training from scratch.Finally, we perform a psychophysical experiment with generated images byAC-GAN with LSUN as target. Human subjects are presented with two images:pre-trained vs. from scratch (generated from the same condition ), andasked ‘Which of these two images of is more realistic?’ Subjects werealso given the option to skip a particular pair should they find very hard to decidefor one of them. We require each subject to provide 100 valid assessments. Weuse 10 human subjects which evaluate image pairs for different settings (100, 1K, 10K images per class). The results (Fig. 5 right) clearly show that the imagesbased on pre-trained GANs are considered to be more realistic in the case of 100and 1K images per class (e.g. pre-trained is preferred in 67% of cases with 1Kimages). As expected the difference is smaller for the 10K case.

5 转移到条件 GAN
这里我们研究将预先训练的无条件 GAN 学到的表示形式转移到 cGAN [2] cAN 允许我们根据特定信息(如类、属性,甚至其他信息)对生成模型进行条件图像。让 y成为条件变量。鉴别器 D(x; y)旨在区分实际数据 x 和 y采样的对与联合分布 pdata (x; y) ) 来自生成输出的对G(z; y0), 以y的边际 pdata(y) 中的样本y0 为条件。
5.1 条件GAN适应
对于目前的研究,我们采用了[25]的辅助分类器GAN(AC-GAN)框架。在此公式中,鉴别器具有一个"辅助分类器",该分类器在输入x上以P (C = yjx)为条件,输出概率分布。然后,目标函数由 GAN 损失 LGAN的条件版本(eq. ((2) 和正确类的日志可能性组成。发生器和鉴别器的最终损耗函数是:
LAC-GAN (G)- LGAN (G) - GE+log (P (C - y0jG (z; y0)); LAC-GAN (D)- LGAN (D) - 'DE +日志(P ( C - yjx) ); (7)(8)
别。参数=G 和 +D 加权辅助分类器损耗对生成器和鉴别器的 GAN 损耗的贡献。在我们的实现中,我们使用Resnet-18 [50]作为G和D,以及方程(3)和(4)中的WGAN-GPloss作为GAN损耗。总体而言,实现详细信息(批处理大小、学习速率)与 AC-GAN 4.1.
与ACGAN中介绍的相同,仅通过将类标签追加到输入噪声矢量,在生成器上执行调理。我们称之为"康康猫"。我们随机初始化之前连接到条件权重的权重。我们还在 [32] 之后使用了另一个变体,其中先验的调理嵌入到生成器的批处理规范化层中(称为"Cond BNorm")。在这种情况下,每个类有不同的批处理规范化参数。我们将这些参数从无条件 GAN 复制到所有类来初始化这些参数。
5.2 结果我们使用位置 [13] 作为源域,并将 LSUN 数据集 [45] 的所有 10 个类视为目标域。我们使用 10K 图像器类对 AC-GAN 进行训练,用于 25K 次迭代。条件 GAN 的权重可以从预先训练的无条件 GAN(参见第 3.1 节)或随机初始化。根据目标域和生成的映像之间的 FID 分数来评估性能。FID 按类计算,对所有类求平均值,并将数据集作为一个整体考虑(与类无关的情况)。目标域中的类已统一生成。表 6 中显示了 AC-GAN 的性能,其权重是从预先训练的网络与 AC-GAN 随机初始化的。我们计算了 250、2500 和 25000 次迭代的 FID。在学习过程开始时,这两种情况有显著差异。在学习过程结束时,差距缩小,但预培训网络的性能仍有显著提升。我们还考虑每个类图像较少的情况。表 7 的最后一列提供了每个类对 100 和 1K 映像进行 25000 次迭代后的结果。我们可以观察到,从头开始训练的网络或从预先训练的权重训练网络之间的差异对于较小的样本尺寸更为显著。这证实了第 4.3 节中观察到的趋势:在只有有限的数据可用时,转移预先训练的权重特别有利。如图 5(左图)中观察到相同的行为,我们将 AC-GAN 的性能与两个无条件的 GAN 进行比较,一个在源域上预先培训,一个从头开始训练,如第 4.2 节所示。曲线对应于与类无关的情况(表 6 中的"全部"列)。从这个图中,我们可以观察到三个方面:(i) AC-GAN 的两个变体的性能相似(因此,对于我们认为只有"Cond BNorm"的实验的剩余部分);(i) AC-GAN 的两个变体的性能也不同;(i) AC-GAN 的两个变体的性能也不同。"(二) 使用预训练权重初始化的网络比从零开始训练的网络收敛得更快,整体性能更好;(iii) AC-GAN 的表现略好于无条件的 GAN。接下来,我们评估AC-GAN在分类实验中的表现。我们在 LSUN 的 10 个类(每个类 10K 真实图像)上训练一个引用分类器。然后,通过每类生成 10K 图像并测量每类 100、1K 和 10K 图像的参考分类器的准确性,我们评估每个模型为 25K 迭代训练的模型的质量。结果显示,在所有设置中使用预先训练的模型时,具有更高的准确性和更低的 FID,这显示它比从头开始训练可以更好地捕获数据集的实际数据分布。
最后,对以LSUN为目标的AC-GAN生成的图像进行了心理物理实验。人类主体呈现两个图像:预先训练与从头开始(从同一条件 生成),并询问"这两个图像中的哪一个 更真实?受试者也可以选择跳过一个特定的对,如果他们发现很难决定其中之一。我们要求每个受试者提供100次有效的评估。我们使用 10 个人工主体,用于评估不同设置的图像对(每类 100、1K、10K 图像)。结果(图 5 右图)清楚地表明,基于预训练 GA 的图像在每类 100 和 1K 图像的情况下被认为更真实(例如,在 67% 的 1K 图像情况下首选预训练图像)。正如预期的那样,对于 10K 的情况下,差异较小。

6 Conclusions

We show how the principles of transfer learning can be applied to generativefeatures for image generation with GANs. GANs, and conditional GANs, benefit from transferring pre-trained models, resulting in lower FID scores and morerecognizable images with less training data. Somewhat contrary to intuition, ourexperiments show that transferring the discriminator is much more critical thanthe generator (yet transferring both networks is best). However, there are alsoother important differences with the discriminative scenario. Notably, it seemsthat a much higher density (images per class) is required to learn good transferable features for image generation, than for image discrimination (where diversityseems more critical). As a consequence, ImageNet and Places, while producingexcellent transferable features for discrimination, seem not dense enough for generation, and LSUN data seems to be a better choice despite its limited diversity.Nevertheless, poor transferability may be also related to the limitations of current GAN techniques, and better ones could also lead to better transferability.Our experiments evaluate GANs in settings rarely explored in previous worksand show that there are many open problems. These settings include GANsand evaluation metrics in the very limited data regime, better mechanisms toestimate the most suitable pre-trained model for a given target dataset, and thedesign of better pre-trained GAN models.

结论我们展示了传输学习原理如何应用于使用GAN生成图像的生成特性。GAN 和条件 GAN 受益于传输预先训练的模型,从而降低 FID 分数和具有较少训练数据的更可识别的图像。与直觉有些相反,我们的实验表明,转移鉴别器比生成器重要得多(但传输两个网络是最好的)。但是,与歧视方案之间还有其他重要差异。值得注意的是,与图像歧视(多样性似乎更为关键)相比,学习图像生成的良好可转移特征所需的密度(每类图像)似乎要高得多。因此,ImageNet 和 Places 虽然为歧视制作了出色的可转移特征,但似乎不足以满足生成密度,尽管 LSUN 数据的多样性有限,但它似乎是一个更好的选择。然而,转让性差也可能与目前GAN技术的局限性有关,更好的技术也可能导致更好的可转让性。我们的实验在以前作品中很少探索的设置中评估了 GA,并表明存在许多悬而未决的问题。这些设置包括非常有限的数据系统中的 GAN 和评估指标、为给定目标数据集估计最合适的预训练模型的更好机制,以及设计经过更高级培训的 GAN 模型。


转载:https://blog.csdn.net/weixin_40262196/article/details/102532150
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场