Continuous Normalizing Flows

June 17, 2025

by Leonardo

1. Continuous Normalizing Flows (CNF)

CNFs are a particular case of Neural ODE networks, with additional tricks to compute the likelihood in order to train them. Given a data point $𝑥^{(𝑖)}$ , we want to know $𝑝_{1} (𝑥^{(𝑖)})$ .

Directly computing $𝑝_{1} (𝑥^{(𝑖)})$ is intractable, so we use similar approach to Change of Variables. From the transport equation::

\frac{\partial 𝑝_{𝑡} (𝑥)}{\partial 𝑡} = - (\nabla \cdot (𝑢_{𝜃} 𝑝_{𝑡})) (𝑥)

By following the Lagrangian perspective (tracking individual particles)¹, we have:

Instantaneous Change of Variables: Let $𝑥_{𝑡}$ be a finite continuous random variable with probability $𝑝_{𝑡} (𝑥_{𝑡})$ dependent on time. Let $\frac{𝑑 𝑥_{𝑡}}{𝑑 𝑡} = 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡)$ be a differential equation describing a continuous-in-time transformation of $𝑥_{𝑡}$ . Assuming that $𝑢_{𝜃}$ is uniformly Lipschitz continuous in $𝑥$ and continuous in $𝑡$ , then the change in log probability also follows a differential equation,

\frac{𝑑 \log 𝑝_{𝑡} (𝑥_{𝑡})}{𝑑 𝑡} = - tr (\frac{𝑑 𝑢_{𝜃}}{𝑑 𝑥_{𝑡}}) = - tr (𝐽_{𝑢_{𝜃}} (𝑥_{𝑡})) = - (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡})

Thus for a given data point $𝑥^{(𝑖)}$ at time $𝑡 = 1$ , we can compute its log-likelihood by solving the following system of ODEs backwards in time:

{\begin{matrix} \frac{𝑑 𝑥_{𝑡}}{𝑑 𝑡} = 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡) \\ \frac{𝑑 \log 𝑝_{𝑡} (𝑥_{𝑡})}{𝑑 𝑡} = - (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡}) \end{matrix}

Starting from $𝑥_{1} = 𝑥^{(𝑖)}$ and integrating from $𝑡 = 1$ to $𝑡 = 0$ , we obtain $\log 𝑝_{0} (𝑥_{0}) = \log 𝑝_{1} (𝑥^{(𝑖)}) + \int_{0}^{1} (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡}) 𝑑 𝑡$ , which equivalent to $\log 𝑝_{1} (𝑥^{(𝑖)}) = \log 𝑝_{0} (𝑥_{0}) - \int_{0}^{1} (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡}) 𝑑 𝑡$ .

The main benefits of continuous NF are:

The constraints one needs to impose on $𝑢$ are much less stringent than in the discrete case²
Inverting the flow can be achieved by simply solving the ODE in reverse
Computing the likelihood does not require inverting the flow, nor to compute a log determinant; only the trace of the Jacobian is required, that can be approximated using the Hutchinson trick.³

However, training a neural ODE with log-likelihood does not scale well to high-dimensional spaces, and the process tends to be unstable, likely due to numerical approximations and to the (infinite) number of possible probability paths.

Starting from the transport equation in Eulerian perspective:
$\frac{\partial 𝑝_{𝑡} (𝑥)}{\partial 𝑡} = - (\nabla \cdot (𝑢_{𝜃} 𝑝_{𝑡})) (𝑥)$
Expanding the divergence:
$\frac{\partial 𝑝_{𝑡} (𝑥)}{\partial 𝑡} = - 𝑝_{𝑡} (𝑥) (\nabla \cdot 𝑢_{𝜃}) (𝑥) - 𝑢_{𝜃} (𝑥) \cdot \nabla 𝑝_{𝑡} (𝑥)$
Dividing by $𝑝_{𝑡} (𝑥)$ :
$\frac{\partial \log 𝑝_{𝑡} (𝑥)}{\partial 𝑡} = - (\nabla \cdot 𝑢_{𝜃}) (𝑥) - 𝑢_{𝜃} (𝑥) \cdot \nabla \log 𝑝_{𝑡} (𝑥)$
For the Lagrangian perspective, we consider the total derivative along a particle trajectory $𝑥_{𝑡}$ satisfying $\frac{𝑑 𝑥_{𝑡}}{𝑑 𝑡} = 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡)$ :
$\frac{𝑑}{𝑑 𝑡} \log 𝑝_{𝑡} (𝑥_{𝑡}) = {\frac{\partial \log 𝑝_{𝑡}}{\partial 𝑡} |}_{𝑥_{𝑡}} + \nabla \log 𝑝_{𝑡} (𝑥_{𝑡}) \cdot \frac{𝑑 𝑥_{𝑡}}{𝑑 𝑡}$
Substituting $\frac{𝑑 𝑥_{𝑡}}{𝑑 𝑡} = 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡)$ :
$\frac{𝑑}{𝑑 𝑡} \log 𝑝_{𝑡} (𝑥_{𝑡}) = {\frac{\partial \log 𝑝_{𝑡}}{\partial 𝑡} |}_{𝑥_{𝑡}} + \nabla \log 𝑝_{𝑡} (𝑥_{𝑡}) \cdot 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡)$
Using the Eulerian result above:
$\begin{matrix} \frac{𝑑}{𝑑 𝑡} \log 𝑝_{𝑡} (𝑥_{𝑡}) & = - (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡}) - 𝑢_{𝜃} (𝑥_{𝑡}) \cdot \nabla \log 𝑝_{𝑡} (𝑥_{𝑡}) + \nabla \log 𝑝_{𝑡} (𝑥_{𝑡}) \cdot 𝑢_{𝜃} (𝑥_{𝑡}, 𝑡) \\ = - (\nabla \cdot 𝑢_{𝜃}) (𝑥_{𝑡}) \end{matrix}$
The last two terms cancel out, yielding the instantaneous change of variables formula.
Note that the function $𝑓$ in the discrete case needs to be invertible, which is a strong constraint.
The Hutchinson trick estimates the trace of a matrix by averaging $𝑣^{𝑇} 𝐴 𝑣$ over random vectors $𝑣$ with zero mean and unit variance, avoiding explicit computation of the full Jacobian.

🔒 Access Restricted

Access Control

Continuous Normalizing Flows

1. Continuous Normalizing Flows (CNF)

References