Direct Modeling

1. Directly model 𝑝(π‘₯)

1.1. Hopfield Network

Associative memories networks try to associate an input with its most similar pattern. The purpose is to store and retrieve patterns.

The Energy of Hopfield Network is 𝐸=βˆ’π·=βˆ’βˆ‘π‘–<π‘—π‘€π‘–π‘—π‘¦π‘–π‘¦π‘—βˆ’βˆ‘π‘–π‘π‘–π‘¦π‘–=βˆ’12π‘¦π‘‡π‘Šπ‘¦βˆ’π‘π‘‡π‘¦.

At each time each neuron receives a "field" 𝑧𝑖=βˆ‘π‘—β‰ π‘–π‘€π‘–π‘—π‘¦π‘—+𝑏𝑖. If 𝑧𝑖>0, then 𝑦𝑖=1 is preferred; if 𝑧𝑖<0, then 𝑦𝑖=βˆ’1 is preferred. If everything is preferred, the 𝐷 is maximized. If not, we have to flip some neurons to make it preferred.

Any flip that changes π‘¦π‘–βˆ’ to 𝑦𝑖+ increases 𝐷 by Δ𝐷=𝑦𝑖+(βˆ‘π‘—β‰ π‘–π‘€π‘–π‘—π‘¦π‘—+𝑏𝑖)βˆ’π‘¦π‘–βˆ’(βˆ‘π‘—β‰ π‘–π‘€π‘–π‘—π‘¦π‘—+𝑏𝑖)=2𝑦𝑖+(βˆ‘π‘—β‰ π‘–π‘€π‘–π‘—π‘¦π‘—+𝑏𝑖). Since π·β‰€βˆ‘π‘–<𝑗|𝑀𝑖𝑗|+βˆ‘π‘–|𝑏𝑖|, we know it converges with a finite number of flips.

Training: If we want to store 𝑁𝑝 patterns, we use Hebbian Learning Rule to set 𝑀𝑖𝑗←1π‘π‘βˆ‘π‘π‘¦π‘–π‘π‘¦π‘—π‘ to get lowest possible energy. But this may cause spurious local optima.

How to prevent spurious local optima? We have to modify the energy landscape. We can use negative samples to do contrastive learning, which means we also maximize 𝐸 for all non-desired patterns. π‘Š=argminπ‘Šβˆ‘π‘¦βˆˆπ‘ƒπΈ(𝑦)βˆ’βˆ‘π‘¦β€²βˆ‰π‘ƒπΈ(𝑦′), we can use SGD to optimize π‘Š: π‘Šπ‘‘+1=π‘Šπ‘‘βˆ’πœ‚(βˆ‘π‘¦βˆˆπ‘ƒπ‘¦π‘¦π‘‡βˆ’βˆ‘π‘¦β€²βˆ‰π‘ƒπ‘¦β€²π‘¦β€²π‘‡). Since only those shallow valleys may misdirect model, we try to broaden desired patterns valleys and narrow non-desired patterns valleys. So we update π‘Š by π‘Šπ‘‘+1=π‘Šπ‘‘βˆ’πœ‚(βˆ‘π‘¦βˆˆπ‘ƒπ‘¦π‘¦π‘‡βˆ’βˆ‘π‘¦β€²βˆ‰π‘ƒβˆ§π‘¦β€²βˆˆΒ valley𝑦′𝑦′𝑇). We initialize 𝑦′ by all the desired patterns and run evolution for 𝑦′ with small random noise to get the valley that is close to the desired valley since those are most misleading.

But there's another problem: Naively forcing a valley to raise may hurt the learned representation. Thus we only run evolution for 𝑦′ for a few timesteps.

1.2. Boltzmann Machine

Given an energy function 𝐸𝑇(𝑆), if we follow a proper physical evolution process, the system state will converge to the Boltzmann distribution 𝑃𝑇(𝑆)=1𝑍exp(βˆ’πΈπ‘‡(𝑆)π‘˜π‘‡). We can generate patterns by sampling from 𝑃𝑇(𝑆), which makes a deterministic process to a probabilistic one.

𝑃(𝑦𝑖=1|𝑦𝑗≠𝑖)=π‘’βˆ’πΈπ‘¦π‘–=1π‘’βˆ’πΈπ‘¦π‘–=1+π‘’βˆ’πΈπ‘¦π‘–=βˆ’1=11+π‘’βˆ’(𝐸𝑦𝑖=1βˆ’πΈπ‘¦π‘–=βˆ’1)=11+exp(βˆ’2βˆ‘π‘—π‘€π‘–π‘—π‘¦π‘—βˆ’2𝑏𝑖). Thus the network revolution becomes 𝑦𝑖𝑑+1∼ Bernoulli(𝜎(𝑧𝑖(𝑑))). Retrieval a stored pattern can be done by taking the average of final 𝑀 samples: 𝑦𝑖=𝐼[1π‘€βˆ‘π‘‘=πΏβˆ’π‘€+1𝐿𝑦𝑖(𝑑)>0] (We can take 𝑀=1 for simplicity).

Training: 𝑃(𝑦)=exp(12π‘¦π‘‡π‘Šπ‘¦)βˆ‘π‘¦β€²exp(12π‘¦β€²π‘‡π‘Šπ‘¦β€²), log likelihood 𝐿(π‘Š)=1π‘π‘βˆ‘π‘¦βˆˆπ‘ƒlog𝑃(𝑦)=1π‘π‘βˆ‘π‘¦βˆˆπ‘ƒ12π‘¦π‘‡π‘Šπ‘¦βˆ’logβˆ‘π‘¦β€²exp(12π‘¦β€²π‘‡π‘Šπ‘¦β€²))). To maximize 𝐿(π‘Š), we can use SGD: βˆ‡π‘€π‘–π‘—πΏ=1π‘π‘βˆ‘π‘¦βˆˆπ‘ƒπ‘¦π‘–π‘¦π‘—βˆ’βˆ‘π‘¦β€²exp(12π‘¦β€²π‘‡π‘Šπ‘¦β€²)𝑍𝑦𝑖′𝑦𝑗′=1π‘π‘βˆ‘π‘¦βˆˆπ‘ƒπ‘¦π‘–π‘¦π‘—βˆ’πΈπ‘¦β€²[𝑦𝑖′𝑦𝑗′]=1π‘π‘βˆ‘π‘¦βˆˆπ‘ƒπ‘¦π‘–π‘¦π‘—βˆ’1|𝑆|βˆ‘π‘¦β€²βˆˆπ‘†π‘¦π‘–β€²π‘¦π‘—β€²

We use Restricted Boltzmann Machine (RBM) for faster Gibbs sampling mixing. We have hidden neurons and visible neurons and there's no intra-layer connection. Previously we sample from every neurons and Gibbs sampling gurantees the convergence, but now we can sample from hidden neurons and visible neurons alternatively. Since there are no connections within the same layer, all neurons in the same layer can be sampled in parallel.

1.2.1. Sampling

Sampling from 𝑃(π‘₯):

  1. Inverse Transform Sampling:

If a random variable 𝑋 has a Cumulative Distribution Function (CDF) 𝐹(π‘₯)=𝑃(𝑋≀π‘₯). We can sample a value 𝑒 from π‘ˆ(0,1) and set π‘₯=πΉβˆ’1(𝑒). This only works if we can compute CDF and its inverse.

  1. Box-Muller Transform:

This is a specialized and highly efficient method for sampling from a standard normal (Gaussian) distribution. First draw 𝑒1 and 𝑒2 from the standard uniform distribution π‘ˆ(0,1). Transform them into two independent standard normal samples: 𝑧1=βˆ’2ln(𝑒1)β‹…cos(2πœ‹π‘’2), 𝑧2=βˆ’2ln(𝑒1)β‹…sin(2πœ‹π‘’2). Then Both 𝑧1 and 𝑧2 are independent random variables sampled from the standard normal distribution 𝑁(0,1).

  1. Importance Sampling:

We can sample from 𝑃(π‘₯) by sampling from proposal distribution π‘ž(π‘₯) and then weighting the samples by the ratio 𝑃(π‘₯)π‘ž(π‘₯).

  1. MCMC:

    • Irreducibility: A Markov chain is irreducible if, for any two states 𝑖,𝑗, there exists an integer 𝑛>0 such that: 𝑃𝑖𝑗𝑛>0. This means it is possible to reach any state from any other state in a finite number of steps.
    • Aperiodicity: A Markov chain is aperiodic if, for every state 𝑖, the greatest common divisor of the set {𝑛β‰₯1:𝑃𝑖𝑖𝑛>0} is 1: gcd(𝑛β‰₯1:𝑃𝑖𝑖𝑛>0)=1. This ensures the chain does not get trapped in cycles.
    • Existence of Stationary Distribution: There exists a distribution πœ‹ such that: πœ‹π‘—=βˆ‘π‘–πœ‹π‘–π‘ƒπ‘–π‘— or in vector notation, πœ‹=πœ‹π‘ƒ. This πœ‹ is called the stationary or invariant distribution.
    • Detailed Balance (Reversibility): A stronger condition than stationarity, detailed balance requires that for all states 𝑖,𝑗: πœ‹π‘–π‘ƒπ‘–π‘—=πœ‹π‘—π‘ƒπ‘—π‘–. If this holds, then πœ‹ is a stationary distribution for 𝑃.
    1. Metropolis-Hastings (M-H) Algorithm:

    For 𝑑=0,1,2,…:

    • Draw a candidate π‘₯β€² from a proposal distribution π‘ž(π‘₯β€²|π‘₯𝑑).
    • Compute the acceptance ratio 𝛼(π‘₯β€²|π‘₯𝑑)=min(1,𝑃(π‘₯β€²)β‹…π‘ž(π‘₯𝑑|π‘₯β€²)𝑃(π‘₯𝑑)β‹…π‘ž(π‘₯β€²|π‘₯𝑑)), the formula simplifies to min(1,𝑃(π‘₯β€²)𝑃(π‘₯𝑑)) if π‘ž(π‘₯β€²|π‘₯𝑑)=π‘ž(π‘₯𝑑|π‘₯β€²).
    • Draw a uniform random number 𝑒 from π‘ˆ(0,1).

      • If 𝑒≀𝛼(π‘₯β€²|π‘₯𝑑), then π‘₯𝑑+1=π‘₯β€².
      • Otherwise, π‘₯𝑑+1=π‘₯𝑑.
    1. Gibbs Sampling:

    A special case of the M-H algorithm that is particularly efficient for multidimensional distributions, especially when the conditional distribution of each dimension is easy to sample from.

    • Randomly initialize π‘₯0=(π‘₯10,π‘₯20,…,π‘₯𝑑0).
    • For 𝑑=0,1,2,…:

      • For 𝑖=1,2,…,𝑑:

        • Sample π‘₯𝑖𝑑 from 𝑃(π‘₯𝑖|π‘₯1𝑑,π‘₯2𝑑,…,π‘₯π‘–βˆ’1𝑑,π‘₯𝑖+1π‘‘βˆ’1,…,π‘₯π‘‘π‘‘βˆ’1).

1.3. Normalizing Flows

Simplify the generation process. 𝑧→π‘₯,π‘₯=𝑓(𝑧)

Change of Variables rule:

𝑝(π‘₯)=𝑝(π‘“βˆ’1(π‘₯))|det(βˆ‚π‘“βˆ’1(π‘₯)βˆ‚π‘₯)|=𝑝(𝑧)|det(βˆ‚π‘“(𝑧)βˆ‚π‘§)|

𝑓=π‘“πΎβš¬β€¦βš¬π‘“2βš¬π‘“1, log𝑝(π‘₯)=log𝑝(π‘“βˆ’1(π‘₯))+βˆ‘π‘˜=1𝐾logdet(βˆ‚π‘“π‘˜(π‘₯π‘˜βˆ’1)βˆ‚π‘₯π‘˜βˆ’1)

1.3.1. Planar Flow

𝑓(𝑧)=𝑧+π‘’β„Ž(𝑀𝑇𝑧+𝑏)

where πœ†={π‘€βˆˆπ‘…π·,π‘’βˆˆπ‘…π·,π‘βˆˆπ‘…} are free parameters and β„Ž(β‹…) is a smooth element-wise non-linearity (e.g. tanh, ReLU, etc.).

|det(βˆ‚π‘“βˆ‚π‘§)|=|det(𝐼+π‘’β„Žβ€²(𝑀𝑇𝑧+𝑏)𝑀𝑇)|=|1+π‘’π‘‡π‘€β„Žβ€²(𝑀𝑇𝑧+𝑏)|

The flow defined by the transformation modifies the initial density π‘ž0 by applying a series of contractions and expansions in the direction perpendicular to the hyperplane 𝑀𝑇𝑧+𝑏=0, hence we refer to these maps as planar flows.

However, 𝑓′ does not admit an analytical expression, and one must resort to iterative algorithms such as Newton's method to approximate it.

1.3.2. NICE (Non-linear Independent Component Estimation)

We denote 𝑧=𝑓(π‘₯) and call 𝑓 the encoder and its inverse π‘“βˆ’1 the decoder. With π‘“βˆ’1 given, we can generate π‘₯ from 𝑧 by ancestral sampling.

We want "easy determinant of the Jacobian" and "easy inverse".

General Coupling Layer: Partition π‘₯ into two disjoint subsets π‘₯𝐼1 and π‘₯𝐼2. We can define β„Ž=(β„ŽπΌ1,β„ŽπΌ2), where β„ŽπΌ1=π‘₯𝐼1,β„ŽπΌ2=𝑔(π‘₯𝐼2;π‘š(π‘₯𝐼1)), π‘š is a function (e.g. MLP) and 𝑔 is the coupling law. We consider 𝐼1=[[1,𝑑]] and 𝐼2=[[𝑑+1,𝐷]], then

βˆ‚β„Žβˆ‚π‘₯=(𝐼𝑑0βˆ‚β„ŽπΌ2βˆ‚π‘₯𝐼1βˆ‚β„ŽπΌ2βˆ‚π‘₯𝐼2),detβˆ‚β„Žβˆ‚π‘₯=det(βˆ‚β„ŽπΌ2βˆ‚π‘₯𝐼2)

For simplicity, we choose additive coupling law 𝑔(π‘Ž;𝑏)=π‘Ž+𝑏. Thus π‘₯𝐼1=β„ŽπΌ1,π‘₯𝐼2=β„ŽπΌ2βˆ’π‘š(β„ŽπΌ1),detβˆ‚β„Žβˆ‚π‘₯=1. Each transformation is simple, so we stack multiple layers to get a complex transformation. π‘₯=β„Ž0β†”β„Ž1β†”β€¦β†”β„ŽπΎ=𝑧,detβˆ‚π‘§βˆ‚π‘₯=1.

Combining coupling layers: Since a coupling layer leaves part of its input unchanged, we need to exchange the role of the two subsets in the partition in alternating layers, so that the composition of two coupling layers modifies every dimension.

Re-Scaling Layer: we include a diagonal scaling matrix 𝑆 as the top layer to ensure non-unit volume transformation, which multiplies the i-th ouput value by 𝑆𝑖𝑖: (π‘₯𝑖)𝑖≀𝐷→(𝑆𝑖𝑖π‘₯𝑖)𝑖≀𝐷. This allows the learner to give more weight (i.e. model more variation) on some dimensions and less in others. In the limit where 𝑆𝑖𝑖 goes to +∞ for some 𝑖, the effective dimensionality of the data has been reduced by 1. We can relate these scaling factors to the eigenspectrum of a PCA, showing how much variation is present in each of the latent dimensions (the larger 𝑆𝑖𝑖 is, the less important the dimension 𝑖 is).

Inpainting: For inpainting we clamp the observed dimensions (π‘₯𝑂) to their values and maximize loglikelihood with respect to the hidden dimensions (𝑋𝐻) using projected gradient ascent (to keep the input in its original interval of values) with gaussian noise with step size 𝛼𝑖=10100+𝑖 , where i is the iteration, following the stochastic gradient update:

π‘₯𝐻,𝑖+1=π‘₯𝐻,𝑖+𝛼𝑖((βˆ‚log(𝑝𝑋π‘₯π‘‚βˆ‚π‘₯𝐻,𝑖),π‘₯𝐻,𝑖))+πœ€)

1.3.3. Real NVP

Real-valued non-volume preserving (Real NVP) transformations.

Affine Coupling Layer: β„Ž1:𝑑=π‘₯1:𝑑,β„Žπ‘‘+1:𝐷=π‘₯𝑑+1:π·βŠ™exp(𝑠(π‘₯1:𝑑))+𝑑(π‘₯1:𝑑), where 𝑠 and 𝑑 stand for scale and translation. The Jacobian of this transformation is

βˆ‚β„Žβˆ‚π‘₯𝑇=(𝐼𝑑0βˆ‚β„Žπ‘‘+1:π·βˆ‚π‘₯1:𝑑𝑇diag(exp(𝑠(π‘₯1:𝑑))))

And the inverse transformation is π‘₯1:𝑑=β„Ž1:𝑑,π‘₯𝑑+1:𝐷=(β„Žπ‘‘+1:π·βˆ’π‘‘(β„Ž1:𝑑))βŠ™exp(βˆ’π‘ (β„Ž1:𝑑)).

1.3.4. GLOW

NICE exchanges partitions, Real NVP shuffles, GLOW uses learnable and invertible 1x1 convolutions as mixing layers. log|detβˆ‚Β conv2D(β„Ž;π‘Š)βˆ‚β„Ž|=β„Žβ‹…π‘€β‹…log|detπ‘Š|. The cost of computing or differentiating detπ‘Š is 𝑂(𝑐3). We use LU decomposition to reduce to 𝑂(𝑐). π‘Š=𝑃𝐿(π‘ˆ+Β diag(𝑠)), where 𝑃 is a permutation matrix, 𝐿 is a lower triangular matrix with ones on the diagonal, π‘ˆ is a diagonal matrix with zeros on the diagonal, and 𝑠 is a vector. log|detπ‘Š|=βˆ‘(log|𝑠|).