Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can try to provide a high level intuition. For actual experts, please forgive my errors of simplification.

Drawing samples from "simple" distributions like normal distributions is computationally easy. You can do it in microseconds. Sampling an arbitrary 1-D distribution is a bit harder - you have to invert its cumulative density function to recover probability under an "equivalent" uniform random variable, and potentially use a rejection sampling approach to sample 1-D values under this distribution.

Sampling a high-D distribution (such as an image) is even harder - you need to learn a mapping from this high-D image back to "tractable" densities. This imposes some pretty harsh "optimization bottlenecks" when trying to contort the manifold of images to normal distributions. The whole point of this exercise is that the transformation respects a valid probability distribution, so you can start from the normal distribution and apply this mapping to get a valid sample from the image distribution. This in practice is pretty hard, and the quality of samples seems to be lower than other forms of deep generative models which use less parameters.

Now instead of learning such a complex, hard-to-optimize transformation to valid densities, what if we instead learn a function E(x) that outputs a scalar "energy". The energy is low for "realistic" images, and high for unrealistic images. Kind of like inverse probability, except its not normalized - the energy value tells you nothing about likelihood unless you know the energy for all other images possible. This tends to actually be easier than learning densities, because the functional form of this energy function is unconstrained.

Furthermore, not knowing likelihoods doesn't stop you from getting "realistic" image, as all you need to do is descend the gradient x -= d/dx E(x), which takes you to an image with "lower" energy (i.e. more realistic). Under certain procedures (e.g. adding some noise to the gradient), this can be thought of as actually equivalent to sampling from a valid probability distribution, even though you can't compute its likelihood analytically.

The diffusion probabilistic model you refer to can be thought of as such a model - the more steps you take (i.e. the more compute you spend), the better the quality of the model.

GANs can be thought of as a one-pass neural network amortization of the end result of the diffusion process, but unlike MCMC methods, they cannot be "iteratively refined" with additional compute. Their sample quality is limited to whatever the generator spits out, even if you had additional compute.



This sounds like you're describing energy-based models, not diffusion models.


Ah, thanks. I had mistakenly assumed the diffusion process here was comparable / drop-in replacement to Langevin dynamics used with energy models.


I went and read the paper in more detail. Yeah, my original comment was way off-base, except if you draw a parallel with normalizing flows as iterated refinement (similar to iterated de-noising), and see that the DDPM is a more unconstrained form.

But at a surface level, there isn't a clear connection between DDPM and energy-based models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: