Introduction to Generative Deep Learning

The 30000ft view of generative AI.

1. What is Generative Modeling?

Generative modeling is a branch of machine learning that involves training a model to produce new data that is similar ot a given dataset.

This means we require some training data which we wish to imitate. The model should be probabilistic rather than deterministic, so that we can sample different variations of the output.

A helpful way to define generative modeling is by contrasting it with its counterpart discriminative modeling:

Discriminative modeling estimates \(p(y|x)\) - the probability of a label \(y\) given some observation \(x\)
Generative modeling estimates \(p(x)\) - the probability of seeing an observation \(x\). By sampling from this distribution we can generate new observations.
- We can also create a conditional generative model to estimate \(p(x|y)\), the probability of seeing an observation \(x\) with label \(y\). For ample, if we train a model to generate a given type of fruit.

2. The Generative Modeling Framework

The aim:

We have training data of observations \(X\)
We assume the training data has been generated by an unknown distribution \(p_{data}\)
We want to build a generative model \(p_{model}\) that mimics \(p_{data}\)
- If we succeed, we can sample from \(p_{model}\) to generate synthetic observations

Desirable properties of \(p_{model}\) are:

Accuracy - If \(p_{model}\) is a high value it should look like a real observation, and likewise low values should look fake.
Generation - It should be easy to sample from it.
Representation - It should be possible to understand how high-level features are represented by the model.

3. Representation Learning

We want to be able to describe things in terms of high-level features. For example, when describing appearance, we might talk about hair colour, length, eye colour, etc. Not RGB values pixel by pixel…

This is the idea behind representation learning. We describe each training data observation using some lower-dimensional latent space. Then we learn a maaping function from the latent space to the original domain.

Encoder-decoder techniques try to transform a high-dimensional nonlinear manifold into a simpler latent space.

4. Core Probability Theory

Sample space

The complete set of all values that an observation \(x\) can take.

Probability density function

A function \(p(x)\) that maps a point \(x\) in the sample space to a number between 0 and 1. The integral over the sample space must equal 1 for it to be a well-define probability distribtution.

There is one true data distribution \(p_{data}\) but infinitely many approximations \(p_{model}\) that we can find.

Parametric modeling

We can structure our approach to finding \(p_{model}\) by restricting ourselves to a family of density functions \(p_{\theta}(x)\) which can be described with a finite set of parameters \(\theta\).

For example, restricting our search to a linear model \(y = w*x + b\) where we need to find the parameters \(w\) and \(b\).

Likelihood

The likelihood, \(L(\theta|x)\), of a parameter set \(\theta\) is a function which measures the plausability of the parameters having observed some data point \(x\). It is the probability of seeing the data \(x\) if the true data-generating distribution was the model parameterized by \(\theta\).

It is defined as the parametric model density: \[ L(\theta|x) = p_{\theta}(x) \]

If we have a dataset of independent observations \(X\) then the probabilities multiply: \[ L(\theta|X) = \prod_{x \in X} p_{\theta}(x) \]

The product of a lot of decimals can get unwieldy, so we often take the log-likelihood \(l\) to turn the product into a sum: \[ l(\theta|X) = \sum_{x \in X} \log{p_{\theta}(x) } \]

The focus of parametric modeling is therefore to find the optimal parameter set \(\hat{\theta}\), which leads to…

Maximum likelihood estimation

A technique to estimate \(\hat{\theta}\), the set of parameters that is most likely to explain the observed data \(X\): \[ \hat{\theta} = \arg\max_x l(\theta|X) \]

Neural networks often work with a loss function, so this is equivalent to minimising the negative log-likelihood: \[ \hat{\theta} = \arg\min_\theta -l(\theta|X) = \arg\min_\theta -\log{p_{\theta}(X)} \]

Bringing it back to generative modeling

Generative modeling is a form of MLE where the parameters are the neural network weights. We want to find the weights that maximise the likelihood of observing the training data.

But for high-dimensional problems \(p_{\theta}(X)\) is intractable, it cannot be calculated directly.

There are different approaches to maing the problem tractable. These are summarised in the following section and explored in more detail in subsequent chapters.

5. Generative Model Taxonomy

All types of generative model are seeking to solve the same problem, but they take different approaches to modeling the intractable density function \(p_{\theta}(x)\).

There are three general approaches:

Explicitly model the density function but constrain the model in some way.
Explicitly model a tractable approximation of the density function.
Implicitly model the density function, using a stochastic process that generates the data directly.

flowchart LR


  A(Generative models) --> B1(Explicit density)
  A(Generative models) --> B2(Implicit density)

  B1 --> C1(Approximate density)
  B1 --> C2(Tractable density)

  C1 --> D1(VAE)
  C1 --> D2(Energy-based model)
  C1 --> D3(Diffusion model)

  C2 --> D4(Autoregressive model)
  C2 --> D5(Normalizing flow model)

  B2 --> D6(GAN)

Implicit density models don’t even try to estimate the probability density, instead focusing on producing a stochastic data generation process directly.
Tractable models place constraints on the model architecture (i.e. the parameters \(\theta\)).

References

Chapter 1 of Generative Deep Learning by David Foster.