Direct Modeling

June 6, 2025

by Leonardo

1. Directly model $𝑝 (𝑥)$

1.1. Hopfield Network

Associative memories networks try to associate an input with its most similar pattern. The purpose is to store and retrieve patterns.

The Energy of Hopfield Network is $𝐸 = - 𝐷 = - \sum_{𝑖 < 𝑗} 𝑤_{𝑖 𝑗} 𝑦_{𝑖} 𝑦_{𝑗} - \sum_{𝑖} 𝑏_{𝑖} 𝑦_{𝑖} = - \frac{1}{2} 𝑦^{𝑇} 𝑊 𝑦 - 𝑏^{𝑇} 𝑦$ .

At each time each neuron receives a "field" $𝑧_{𝑖} = \sum_{𝑗 \neq 𝑖} 𝑤_{𝑖 𝑗} 𝑦_{𝑗} + 𝑏_{𝑖}$ . If $𝑧_{𝑖} > 0$ , then $𝑦_{𝑖} = 1$ is preferred; if $𝑧_{𝑖} < 0$ , then $𝑦_{𝑖} = - 1$ is preferred. If everything is preferred, the $𝐷$ is maximized. If not, we have to flip some neurons to make it preferred.

Any flip that changes $𝑦_{𝑖}^{-}$ to $𝑦_{𝑖}^{+}$ increases $𝐷$ by $Δ 𝐷 = 𝑦_{𝑖}^{+} (\sum_{𝑗 \neq 𝑖} 𝑤_{𝑖 𝑗} 𝑦_{𝑗} + 𝑏_{𝑖}) - 𝑦_{𝑖}^{-} (\sum_{𝑗 \neq 𝑖} 𝑤_{𝑖 𝑗} 𝑦_{𝑗} + 𝑏_{𝑖}) = 2 𝑦_{𝑖}^{+} (\sum_{𝑗 \neq 𝑖} 𝑤_{𝑖 𝑗} 𝑦_{𝑗} + 𝑏_{𝑖})$ . Since $𝐷 \leq \sum_{𝑖 < 𝑗} | 𝑤_{𝑖 𝑗} | + \sum_{𝑖} | 𝑏_{𝑖} |$ , we know it converges with a finite number of flips.

Training: If we want to store $𝑁_{𝑝}$ patterns, we use Hebbian Learning Rule to set $𝑤_{𝑖 𝑗} \leftarrow \frac{1}{𝑁_{𝑝}} \sum_{𝑝} 𝑦_{𝑖}^{𝑝} 𝑦_{𝑗}^{𝑝}$ to get lowest possible energy. But this may cause spurious local optima.

How to prevent spurious local optima? We have to modify the energy landscape. We can use negative samples to do contrastive learning, which means we also maximize $𝐸$ for all non-desired patterns. $𝑊 = \arg \min_{𝑊} \sum_{𝑦 \in 𝑃} 𝐸 (𝑦) - \sum_{𝑦^{'} \notin 𝑃} 𝐸 (𝑦^{'})$ , we can use SGD to optimize $𝑊$ : $𝑊_{𝑡 + 1} = 𝑊_{𝑡} - 𝜂 (\sum_{𝑦 \in 𝑃} 𝑦 𝑦^{𝑇} - \sum_{𝑦^{'} \notin 𝑃} 𝑦^{'} 𝑦^{' 𝑇})$ . Since only those shallow valleys may misdirect model, we try to broaden desired patterns valleys and narrow non-desired patterns valleys. So we update $𝑊$ by $𝑊_{𝑡 + 1} = 𝑊_{𝑡} - 𝜂 (\sum_{𝑦 \in 𝑃} 𝑦 𝑦^{𝑇} - \sum_{𝑦^{'} \notin 𝑃 \land 𝑦^{'} \in valley} 𝑦^{'} 𝑦^{' 𝑇})$ . We initialize $𝑦^{'}$ by all the desired patterns and run evolution for $𝑦^{'}$ with small random noise to get the valley that is close to the desired valley since those are most misleading.

But there's another problem: Naively forcing a valley to raise may hurt the learned representation. Thus we only run evolution for $𝑦^{'}$ for a few timesteps.

1.2. Boltzmann Machine

Given an energy function $𝐸_{𝑇} (𝑆)$ , if we follow a proper physical evolution process, the system state will converge to the Boltzmann distribution $𝑃_{𝑇} (𝑆) = \frac{1}{𝑍} \exp (- \frac{𝐸_{𝑇} (𝑆)}{𝑘 𝑇})$ . We can generate patterns by sampling from $𝑃_{𝑇} (𝑆)$ , which makes a deterministic process to a probabilistic one.

$𝑃 (𝑦_{𝑖} = 1 | 𝑦_{𝑗 \neq 𝑖}) = \frac{𝑒^{- 𝐸_{𝑦_{𝑖} = 1}}}{𝑒^{- 𝐸_{𝑦_{𝑖} = 1}} + 𝑒^{- 𝐸_{𝑦_{𝑖} = - 1}}} = \frac{1}{1 + 𝑒^{- (𝐸_{𝑦_{𝑖} = 1} - 𝐸_{𝑦_{𝑖} = - 1})}} = \frac{1}{1 + \exp (- 2 \sum_{𝑗} 𝑤_{𝑖 𝑗} 𝑦_{𝑗} - 2 𝑏_{𝑖})}$ . Thus the network revolution becomes $𝑦_{𝑖}^{𝑡 + 1} \sim Bernoulli (𝜎 (𝑧_{𝑖} (𝑡)))$ . Retrieval a stored pattern can be done by taking the average of final $𝑀$ samples: $𝑦_{𝑖} = 𝐼 [\frac{1}{𝑀} \sum_{𝑡 = 𝐿 - 𝑀 + 1}^{𝐿} 𝑦_{𝑖} (𝑡) > 0]$ (We can take $𝑀 = 1$ for simplicity).

Training: $𝑃 (𝑦) = \frac{\exp (\frac{1}{2} 𝑦^{𝑇} 𝑊 𝑦)}{\sum_{𝑦^{'}} \exp (\frac{1}{2} 𝑦^{' 𝑇} 𝑊 𝑦^{'})}$ , log likelihood $𝐿 (𝑊) = \frac{1}{𝑁_{𝑝}} \sum_{𝑦 \in 𝑃} \log 𝑃 (𝑦) = \frac{1}{𝑁_{𝑝}} \sum_{𝑦 \in 𝑃} \frac{1}{2} 𝑦^{𝑇} 𝑊 𝑦 - \log \sum_{𝑦^{'}} \exp (\frac{1}{2} 𝑦^{' 𝑇} 𝑊 𝑦^{'})))$ . To maximize $𝐿 (𝑊)$ , we can use SGD: $\nabla_{𝑤_{𝑖 𝑗}} 𝐿 = \frac{1}{𝑁_{𝑝}} \sum_{𝑦 \in 𝑃} 𝑦_{𝑖} 𝑦_{𝑗} - \sum_{𝑦^{'}} \frac{\exp (\frac{1}{2} 𝑦^{' 𝑇} 𝑊 𝑦^{'})}{𝑍} 𝑦_{𝑖}^{'} 𝑦_{𝑗}^{'} = \frac{1}{𝑁_{𝑝}} \sum_{𝑦 \in 𝑃} 𝑦_{𝑖} 𝑦_{𝑗} - 𝐸_{𝑦^{'}} [𝑦_{𝑖}^{'} 𝑦_{𝑗}^{'}] = \frac{1}{𝑁_{𝑝}} \sum_{𝑦 \in 𝑃} 𝑦_{𝑖} 𝑦_{𝑗} - \frac{1}{| 𝑆 |} \sum_{𝑦^{'} \in 𝑆} 𝑦_{𝑖}^{'} 𝑦_{𝑗}^{'}$

We use Restricted Boltzmann Machine (RBM) for faster Gibbs sampling mixing. We have hidden neurons and visible neurons and there's no intra-layer connection. Previously we sample from every neurons and Gibbs sampling gurantees the convergence, but now we can sample from hidden neurons and visible neurons alternatively. Since there are no connections within the same layer, all neurons in the same layer can be sampled in parallel.

1.2.1. Sampling

Sampling from $𝑃 (𝑥)$ :

Inverse Transform Sampling:

If a random variable $𝑋$ has a Cumulative Distribution Function (CDF) $𝐹 (𝑥) = 𝑃 (𝑋 \leq 𝑥)$ . We can sample a value $𝑢$ from $𝑈 (0, 1)$ and set $𝑥 = 𝐹^{- 1} (𝑢)$ . This only works if we can compute CDF and its inverse.

Box-Muller Transform:

This is a specialized and highly efficient method for sampling from a standard normal (Gaussian) distribution. First draw $𝑢_{1}$ and $𝑢_{2}$ from the standard uniform distribution $𝑈 (0, 1)$ . Transform them into two independent standard normal samples: $𝑧_{1} = \sqrt{- 2 \ln (𝑢_{1})} \cdot \cos (2 𝜋 𝑢_{2})$ , $𝑧_{2} = \sqrt{- 2 \ln (𝑢_{1})} \cdot \sin (2 𝜋 𝑢_{2})$ . Then Both $𝑧_{1}$ and $𝑧_{2}$ are independent random variables sampled from the standard normal distribution $𝑁 (0, 1)$ .

Importance Sampling:

We can sample from $𝑃 (𝑥)$ by sampling from proposal distribution $𝑞 (𝑥)$ and then weighting the samples by the ratio $\frac{𝑃 (𝑥)}{𝑞 (𝑥)}$ .

MCMC:
- Irreducibility: A Markov chain is irreducible if, for any two states $𝑖, 𝑗$ , there exists an integer $𝑛 > 0$ such that: $𝑃_{𝑖 𝑗}^{𝑛} > 0$ . This means it is possible to reach any state from any other state in a finite number of steps.
- Aperiodicity: A Markov chain is aperiodic if, for every state $𝑖$ , the greatest common divisor of the set ${𝑛 \geq 1 : 𝑃_{𝑖 𝑖}^{𝑛} > 0}$ is 1: $\gcd (𝑛 \geq 1 : 𝑃_{𝑖 𝑖}^{𝑛} > 0) = 1$ . This ensures the chain does not get trapped in cycles.
- Existence of Stationary Distribution: There exists a distribution $𝜋$ such that: $𝜋_{𝑗} = \sum_{𝑖} 𝜋_{𝑖} 𝑃_{𝑖 𝑗}$ or in vector notation, $𝜋 = 𝜋 𝑃$ . This $𝜋$ is called the stationary or invariant distribution.
- Detailed Balance (Reversibility): A stronger condition than stationarity, detailed balance requires that for all states $𝑖, 𝑗$ : $𝜋_{𝑖} 𝑃_{𝑖 𝑗} = 𝜋_{𝑗} 𝑃_{𝑗 𝑖}$ . If this holds, then $𝜋$ is a stationary distribution for $𝑃$ .
1. Metropolis-Hastings (M-H) Algorithm:
For $𝑡 = 0, 1, 2, \dots$ :
- Draw a candidate $𝑥^{'}$ from a proposal distribution $𝑞 (𝑥^{'} | 𝑥_{𝑡})$ .
- Compute the acceptance ratio $𝛼 (𝑥^{'} | 𝑥_{𝑡}) = \min (1, \frac{𝑃 (𝑥^{'}) \cdot 𝑞 (𝑥_{𝑡} | 𝑥^{'})}{𝑃 (𝑥_{𝑡}) \cdot 𝑞 (𝑥^{'} | 𝑥_{𝑡})})$ , the formula simplifies to $\min (1, \frac{𝑃 (𝑥^{'})}{𝑃 (𝑥_{𝑡})})$ if $𝑞 (𝑥^{'} | 𝑥_{𝑡}) = 𝑞 (𝑥_{𝑡} | 𝑥^{'})$ .
- Draw a uniform random number $𝑢$ from $𝑈 (0, 1)$ .
  - If $𝑢 \leq 𝛼 (𝑥^{'} | 𝑥_{𝑡})$ , then $𝑥_{𝑡 + 1} = 𝑥^{'}$ .
  - Otherwise, $𝑥_{𝑡 + 1} = 𝑥_{𝑡}$ .
1. Gibbs Sampling:
A special case of the M-H algorithm that is particularly efficient for multidimensional distributions, especially when the conditional distribution of each dimension is easy to sample from.
- Randomly initialize $𝑥^{0} = (𝑥_{1}^{0}, 𝑥_{2}^{0}, \dots, 𝑥_{𝑑}^{0})$ .
- For $𝑡 = 0, 1, 2, \dots$ :
  - For $𝑖 = 1, 2, \dots, 𝑑$ :
    - Sample $𝑥_{𝑖}^{𝑡}$ from $𝑃 (𝑥_{𝑖} | 𝑥_{1}^{𝑡}, 𝑥_{2}^{𝑡}, \dots, 𝑥_{𝑖 - 1}^{𝑡}, 𝑥_{𝑖 + 1}^{𝑡 - 1}, \dots, 𝑥_{𝑑}^{𝑡 - 1})$ .

1.3. Normalizing Flows

Simplify the generation process. $𝑧 \to 𝑥, 𝑥 = 𝑓 (𝑧)$

Change of Variables rule:

𝑝 (𝑥) = 𝑝 (𝑓^{-} 1 (𝑥)) | \det (\frac{\partial 𝑓^{-} 1 (𝑥)}{\partial 𝑥}) | = \frac{𝑝 (𝑧)}{| \det (\frac{\partial 𝑓 (𝑧)}{\partial 𝑧}) |}

$𝑓 = 𝑓_{𝐾} ⚬ \dots ⚬ 𝑓_{2} ⚬ 𝑓_{1}$ , $\log 𝑝 (𝑥) = \log 𝑝 (𝑓^{- 1} (𝑥)) + \sum_{𝑘 = 1}^{𝐾} \log \det (\frac{\partial 𝑓_{𝑘} (𝑥_{𝑘 - 1})}{\partial 𝑥_{𝑘 - 1}})$

1.3.1. Planar Flow

𝑓 (𝑧) = 𝑧 + 𝑢 ℎ (𝑤^{𝑇} 𝑧 + 𝑏)

where $𝜆 = {𝑤 \in 𝑅^{𝐷}, 𝑢 \in 𝑅^{𝐷}, 𝑏 \in 𝑅}$ are free parameters and $ℎ (\cdot)$ is a smooth element-wise non-linearity (e.g. tanh, ReLU, etc.).

| \det (\frac{\partial 𝑓}{\partial 𝑧}) | = | \det (𝐼 + 𝑢 ℎ^{'} (𝑤^{𝑇} 𝑧 + 𝑏) 𝑤^{𝑇}) | = | 1 + 𝑢^{𝑇} 𝑤 ℎ^{'} (𝑤^{𝑇} 𝑧 + 𝑏) |

The flow defined by the transformation modifies the initial density $𝑞_{0}$ by applying a series of contractions and expansions in the direction perpendicular to the hyperplane $𝑤^{𝑇} 𝑧 + 𝑏 = 0$ , hence we refer to these maps as planar flows.

However, $𝑓^{'}$ does not admit an analytical expression, and one must resort to iterative algorithms such as Newton's method to approximate it.

1.3.2. NICE (Non-linear Independent Component Estimation)

We denote $𝑧 = 𝑓 (𝑥)$ and call $𝑓$ the encoder and its inverse $𝑓^{- 1}$ the decoder. With $𝑓^{- 1}$ given, we can generate $𝑥$ from $𝑧$ by ancestral sampling.

We want "easy determinant of the Jacobian" and "easy inverse".

General Coupling Layer: Partition $𝑥$ into two disjoint subsets $𝑥_{𝐼_{1}}$ and $𝑥_{𝐼_{2}}$ . We can define $ℎ = (ℎ_{𝐼_{1}}, ℎ_{𝐼_{2}})$ , where $ℎ_{𝐼_{1}} = 𝑥_{𝐼_{1}}, ℎ_{𝐼_{2}} = 𝑔 (𝑥_{𝐼_{2}}; 𝑚 (𝑥_{𝐼_{1}}))$ , $𝑚$ is a function (e.g. MLP) and $𝑔$ is the coupling law. We consider $𝐼_{1} = [[1, 𝑑]]$ and $𝐼_{2} = [[𝑑 + 1, 𝐷]]$ , then

\frac{\partial ℎ}{\partial 𝑥} = (\begin{matrix} 𝐼_{𝑑} & 0 \\ \frac{\partial ℎ_{𝐼_{2}}}{\partial 𝑥_{𝐼_{1}}} & \frac{\partial ℎ_{𝐼_{2}}}{\partial 𝑥_{𝐼_{2}}} \end{matrix}), \det \frac{\partial ℎ}{\partial 𝑥} = \det (\frac{\partial ℎ_{𝐼_{2}}}{\partial 𝑥_{𝐼_{2}}})

For simplicity, we choose additive coupling law $𝑔 (𝑎; 𝑏) = 𝑎 + 𝑏$ . Thus $𝑥_{𝐼_{1}} = ℎ_{𝐼_{1}}, 𝑥_{𝐼_{2}} = ℎ_{𝐼_{2}} - 𝑚 (ℎ_{𝐼_{1}}), \det \frac{\partial ℎ}{\partial 𝑥} = 1$ . Each transformation is simple, so we stack multiple layers to get a complex transformation. $𝑥 = ℎ^{0} \leftrightarrow ℎ^{1} \leftrightarrow \dots \leftrightarrow ℎ^{𝐾} = 𝑧, \det \frac{\partial 𝑧}{\partial 𝑥} = 1$ .

Combining coupling layers: Since a coupling layer leaves part of its input unchanged, we need to exchange the role of the two subsets in the partition in alternating layers, so that the composition of two coupling layers modifies every dimension.

Re-Scaling Layer: we include a diagonal scaling matrix $𝑆$ as the top layer to ensure non-unit volume transformation, which multiplies the i-th ouput value by $𝑆_{𝑖 𝑖}$ : ${(𝑥_{𝑖})}_{𝑖 \leq 𝐷} \to {(𝑆_{𝑖 𝑖} 𝑥_{𝑖})}_{𝑖 \leq 𝐷}$ . This allows the learner to give more weight (i.e. model more variation) on some dimensions and less in others. In the limit where $𝑆_{𝑖 𝑖}$ goes to $+ \infty$ for some $𝑖$ , the effective dimensionality of the data has been reduced by 1. We can relate these scaling factors to the eigenspectrum of a PCA, showing how much variation is present in each of the latent dimensions (the larger $𝑆_{𝑖 𝑖}$ is, the less important the dimension $𝑖$ is).

Inpainting: For inpainting we clamp the observed dimensions ( $𝑥_{𝑂}$ ) to their values and maximize loglikelihood with respect to the hidden dimensions ( $𝑋_{𝐻}$ ) using projected gradient ascent (to keep the input in its original interval of values) with gaussian noise with step size $𝛼_{𝑖} = \frac{10}{100 + 𝑖}$ , where i is the iteration, following the stochastic gradient update:

𝑥_{𝐻, 𝑖 + 1} = 𝑥_{𝐻, 𝑖} + 𝛼_{𝑖} ((\partial \log (𝑝_{𝑋} \frac{𝑥_{𝑂}}{\partial 𝑥_{𝐻, 𝑖}}), 𝑥_{𝐻, 𝑖})) + 𝜀)

1.3.3. Real NVP

Real-valued non-volume preserving (Real NVP) transformations.

Affine Coupling Layer: $ℎ_{1 : 𝑑} = 𝑥_{1 : 𝑑}, ℎ_{𝑑 + 1 : 𝐷} = 𝑥_{𝑑 + 1 : 𝐷} ⊙ \exp (𝑠 (𝑥_{1 : 𝑑})) + 𝑡 (𝑥_{1 : 𝑑})$ , where $𝑠$ and $𝑡$ stand for scale and translation. The Jacobian of this transformation is

\frac{\partial ℎ}{\partial 𝑥^{𝑇}} = (\begin{matrix} 𝐼_{𝑑} & 0 \\ \frac{\partial ℎ_{𝑑 + 1 : 𝐷}}{\partial 𝑥_{1 : 𝑑}^{𝑇}} & diag (\exp (𝑠 (𝑥_{1 : 𝑑}))) \end{matrix})

And the inverse transformation is $𝑥_{1 : 𝑑} = ℎ_{1 : 𝑑}, 𝑥_{𝑑 + 1 : 𝐷} = (ℎ_{𝑑 + 1 : 𝐷} - 𝑡 (ℎ_{1 : 𝑑})) ⊙ \exp (- 𝑠 (ℎ_{1 : 𝑑}))$ .

1.3.4. GLOW

NICE exchanges partitions, Real NVP shuffles, GLOW uses learnable and invertible 1x1 convolutions as mixing layers. $\log | \det \frac{\partial conv2D (ℎ; 𝑊)}{\partial ℎ} | = ℎ \cdot 𝑤 \cdot \log | \det 𝑊 |$ . The cost of computing or differentiating $\det 𝑊$ is $𝑂 (𝑐^{3})$ . We use LU decomposition to reduce to $𝑂 (𝑐)$ . $𝑊 = 𝑃 𝐿 (𝑈 + diag (𝑠))$ , where $𝑃$ is a permutation matrix, $𝐿$ is a lower triangular matrix with ones on the diagonal, $𝑈$ is a diagonal matrix with zeros on the diagonal, and $𝑠$ is a vector. $\log | \det 𝑊 | = \sum (\log | 𝑠 |)$ .

🔒 Access Restricted

Access Control