Conditional Variational Autoencoder (CVAE)

We would like to introduce conditional variational autoencoder (CVAE) [2][3], a deep generative model, and show our online demonstration (Facial VAE). We believe that the CVAE method is very promising to many fields, such as image generation, anomaly detection problems, and so on.


Variational autoencoder (VAE) [1] is a generative model which utilizes deep neural networks to describe the distribution of observed and latent (unobserved) variables. Using the VAE model, we assume that the data x is generated by pθ(x|z) where θ denotes the parameter of deep neural networks. Given the data x, we want to maximize the log-likelihood logp(x) and in order to solve this problem, we can use variational inference. The main principle of variatioal inference is to introduce an approximate distribution q(z|x) and maximize the lower bound instead of the log-likelihood. In VAE, the approximate distribution q(z|x) can be parameterized by qφ(z|x) where φ means the parameter of deep neural networks. We usually call p(x|z) the encoder (or the recognition model) and q(z|x) the decoder (or the generative model). From this point of view, we can consider this model as a different kind of autoencoder and why it is called variational “autoencoder”. By using the training algorithm Stochastic Gradient Variational Bayes (SGVB), we can train deep neural networks on VAE by backpropagation. If you want to know VAE algorithm in detail, please see [1].

VAE can reconstruct the data such as normal autoencoders. Moreover, VAE can generate samples from random values by using the decoder p(x|z). Note that these generated samples are not actually exist in the original training data. VAE trains the probability distribution p(x|z) which is most likely to generate the original data, therefore we can generate new samples which look like the original data.



Conditional VAE [2][3] is VAE architecture conditioning on another description of the data, y.  In this model, we can generate samples from the conditional distribution p(x|y). By changing the value of y, such as numbers of labels in MNIST, we can get corresponding samples x~p(x|y).


Application : Facial VAE

To demonstrate and explain how CVAE can apply to real face images we created “Facial VAE”. This demo is inspired from [4][5]. In our demo, we can generate face images conditioned on attributes, such as male, young and so on. Moreover, we combine VAE and Generative Adversarial Networks (GAN) [6]  into one model in order to generate clearer images.  The most significant difference between [4][5] and ours is the ability to infer attributes from an image given by a user. Therefore, if you submit your image to Facial VAE, you will get the corresponding attributes and a reconstruction of the image.  Moreover, if you change the attributes, then you will get a new generated image conditioned based on those attributes.


Facial VAE – Demo

Please give try our demo!

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Kingma, Diederik P., et al. “Semi-supervised learning with deep generative models.” Advances in Neural Information Processing Systems. 2014.

[3] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. “Learning Structured Output Representation using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems. 2015.

[4] Larsen, Anders Boesen Lindbo, Søren Kaae Sønderby, and Ole Winther. “Autoencoding beyond pixels using a learned similarity metric.” arXiv preprint arXiv:1512.09300 (2015).

[5] Generating Faces with Torch (

[6] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in Neural Information Processing Systems. 2014.