progress

act65 · Sep 26, 2024 · 610ced4 · 610ced4
1 parent bff207b
commit 610ced4
Show file tree

Hide file tree

Showing 13 changed files with 354 additions and 15 deletions.
diff --git a/_bibliography/ideal-si.bib b/_bibliography/ideal-si.bib
@@ -0,0 +1,66 @@
+---
+---
+References
+==========
+
+@misc{albergo_stochastic_2023,
+	title = {Stochastic {Interpolants}: {A} {Unifying} {Framework} for {Flows} and {Diffusions}},
+	shorttitle = {Stochastic {Interpolants}},
+	url = {http://arxiv.org/abs/2303.08797},
+	abstract = {A class of generative models that uniﬁes ﬂow-based and diﬀusion-based methods is introduced. These models extend the framework proposed in [1], enabling the use of a broad class of continuoustime stochastic processes called ‘stochastic interpolants’ to bridge any two arbitrary probability density functions exactly in ﬁnite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a ﬂexible way. The time-dependent probability density function of the stochastic interpolant is shown to satisfy a ﬁrst-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diﬀusion. Upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability ﬂow equations or stochastic diﬀerential equations with an adjustable level of noise. The drift coeﬃcients entering these models are time-dependent velocity ﬁelds characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. Remarkably, we show that minimization of these quadratic objectives leads to control of the likelihood for any of our generative models built upon stochastic dynamics. By contrast, we establish that generative models based upon a deterministic dynamics must, in addition, control the Fisher divergence between the target and the model. We also construct estimators for the likelihood and the cross-entropy of interpolant-based generative models, discuss connections with other stochastic bridges, and demonstrate that such models recover the Schr¨odinger bridge between the two target densities when explicitly optimizing over the interpolant.},
+	language = {en},
+	urldate = {2023-07-25},
+	publisher = {arXiv},
+	author = {Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric},
+	month = mar,
+	year = {2023},
+	note = {arXiv:2303.08797 [cond-mat]},
+	keywords = {Computer Science - Machine Learning, Condensed Matter - Disordered Systems and Neural Networks, Mathematics - Probability},
+	file = {Albergo et al. - 2023 - Stochastic Interpolants A Unifying Framework for .pdf:/local/scratch/telfaralex/Zotero/storage/ZIMWLRQ9/Albergo et al. - 2023 - Stochastic Interpolants A Unifying Framework for .pdf:application/pdf},
+}
+
+@misc{liu_flow_2022,
+	title = {Flow {Straight} and {Fast}: {Learning} to {Generate} and {Transfer} {Data} with {Rectified} {Flow}},
+	shorttitle = {Flow {Straight} and {Fast}},
+	url = {http://arxiv.org/abs/2209.03003},
+	abstract = {We present rectiﬁed ﬂow, a surprisingly simple approach to learning (neural) ordinary differential equation (ODE) models to transport between two empirically observed distributions π0 and π1, hence providing a uniﬁed solution to generative modeling and domain transfer, among various other tasks involving distribution transport. The idea of rectiﬁed ﬂow is to learn the ODE to follow the straight paths connecting the points drawn from π0 and π1 as much as possible. This is achieved by solving a straightforward nonlinear least squares optimization problem, which can be easily scaled to large models without introducing extra parameters beyond standard supervised learning. The straight paths are special and preferred because they are the shortest paths between two points, and can be simulated exactly without time discretization and hence yield computationally efﬁcient models. We show that the procedure of learning a rectiﬁed ﬂow from data, called rectiﬁcation, turns an arbitrary coupling of π0 and π1 to a new deterministic coupling with provably non-increasing convex transport costs. In addition, recursively applying rectiﬁcation allows us to obtain a sequence of ﬂows with increasingly straight paths, which can be simulated accurately with coarse time discretization in the inference phase. In empirical studies, we show that rectiﬁed ﬂow performs superbly on image generation, image-to-image translation, and domain adaptation. In particular, on image generation and translation, our method yields nearly straight ﬂows that give high quality results even with a single Euler discretization step.},
+	language = {en},
+	urldate = {2023-08-03},
+	publisher = {arXiv},
+	author = {Liu, Xingchao and Gong, Chengyue and Liu, Qiang},
+	month = sep,
+	year = {2022},
+	note = {arXiv:2209.03003 [cs]},
+	keywords = {Computer Science - Machine Learning},
+	file = {Liu et al. - 2022 - Flow Straight and Fast Learning to Generate and T.pdf:/local/scratch/telfaralex/Zotero/storage/F7SEBUCV/Liu et al. - 2022 - Flow Straight and Fast Learning to Generate and T.pdf:application/pdf},
+}
+
+
+@article{lipman_flow_2023,
+	title = {{FLOW} {MATCHING} {FOR} {GENERATIVE} {MODELING}},
+	abstract = {We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples—which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.},
+	language = {en},
+	author = {Lipman, Yaron and Chen, Ricky T Q and Ben-Hamu, Heli and Nickel, Maximilian and Le, Matt},
+	year = {2023},
+	file = {Lipman et al. - 2023 - FLOW MATCHING FOR GENERATIVE MODELING.pdf:/local/scratch/telfaralex/Zotero/storage/4EP7JY9I/Lipman et al. - 2023 - FLOW MATCHING FOR GENERATIVE MODELING.pdf:application/pdf},
+}
+
+@misc{dao2023flowmatchinglatentspace,
+      title={Flow Matching in Latent Space}, 
+      author={Quan Dao and Hao Phung and Binh Nguyen and Anh Tran},
+      year={2023},
+      eprint={2307.08698},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2307.08698}, 
+}
+
+@misc{esser2024scalingrectifiedflowtransformers,
+      title={Scaling Rectified Flow Transformers for High-Resolution Image Synthesis}, 
+      author={Patrick Esser and Sumith Kulal and Andreas Blattmann and Rahim Entezari and Jonas Müller and Harry Saini and Yam Levi and Dominik Lorenz and Axel Sauer and Frederic Boesel and Dustin Podell and Tim Dockhorn and Zion English and Kyle Lacey and Alex Goodwin and Yannik Marek and Robin Rombach},
+      year={2024},
+      eprint={2403.03206},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2403.03206}, 
+}
diff --git a/_drafts/technical-posts/2023-12-5-perfect-numbers.md b/_drafts/technical-posts/2023-12-5-perfect-numbers.md
@@ -7,35 +7,25 @@ subtitle: Playing with some numbers.
 <p>
 A perfect number is a positive integer that is equal 
 to the sum of its unique positive divisors, excluding the number itself. 
-For instance, 6 has divisors 1, 2 and 3 (excluding itself), and 1 + 2 + 3 = 6, so 6 is a perfect number. 
-</p>
-<p>
-    Interestingly there is a connection to the mersenne primes!<br>
+For example, 
 
-    $$
-    (2^{n - 1})(2^n - 1)
-    $$
+- 6 has divisors 1, 2 and 3 (excluding itself), and 1 + 2 + 3 = 6, therefore 6 is a perfect number. 
+- 28 has divisors 1, 2, 4, 7 and 14 (excluding itself), and 1 + 2 + 4 + 7 + 14 = 28, therefore 28 is a perfect number.
 
-    Each even perfect number must have a prime factor.
-    The factors must sum to an even number. And since 1 is always a factor, we need a least one more odd factor.
-    If the odd factor is not prime (ie 9).
 </p>
-
 <h2>
     Jigsaw puzzle
 </h2>
 <p>
-    Another way to think about these perfect numbers, n, is to imagine a kind of jigsaw puzzle.
+    Another way to think about these perfect numbers is to imagine a kind of jigsaw puzzle.
     The goal is to fill a n x n square with smaller squares. 
     <ul>
     <li> Each row of squares must constructed from i squares of the same size (j x j). (aka a pair of divisors, i x j = n)</li>
     <li> No two rows can be the same. (unique divisors)</li>
     </ul>
 </p>
 
-Here are the first 13 numbers draw as these jigsaw puzzles.
-Note 6, which is a 'perfect' number, perfectly fills the n x n square.
-(The n x n square is drawn as a red box.)
+Here are the some jigsaw puzzles.
 
 <div>
     <canvas id="2"></canvas>
@@ -53,6 +43,18 @@ Note 6, which is a 'perfect' number, perfectly fills the n x n square.
     <canvas id="14"></canvas>
 </div>
 
+So the perfect numbers perfectly fill the n x n square.
+While prime numbers do not fill the square at all.
+They are opposite in this sense.
+
+There also exist numbers that can fill the square, and more.
+12 is the first example, with divisors 1, 2, 3, 4, 6 which sum to 16.
+These numbers are called __abundant numbers__.
+
+
+
+
+
 How many rearangements of the jigsaw puzzle are there?
 For 6 there are;
 
@@ -134,10 +136,26 @@ the factors!
 
 </p>
 
+<p>
+    Interestingly there is a connection to the mersenne primes!<br>
+
+    $$
+    (2^{n - 1})(2^n - 1)
+    $$
+
+    Each even perfect number must have a prime factor.
+    The factors must sum to an even number. And since 1 is always a factor, we need a least one more odd factor.
+    If the odd factor is not prime (ie 9).
+</p>
+
+Unsolved problem in mathematics:
+Are there any odd perfect numbers?
+
 Questions;
 
 - does there exist a fast test for perfect numbers?
 - what are their applications?
 
+https://en.wikipedia.org/wiki/Weird_number
 
 <script src="{{base.url}}/assets/perfect-numbers/canvas.js"></script>
diff --git a/_drafts/technical-posts/2024-08-01-ideal-flow.md b/_drafts/technical-posts/2024-08-01-ideal-flow.md
@@ -0,0 +1,128 @@
+---
+title: The behaviour of ideal generative flows
+layout: post
+permalink: /ideal-si/
+scholar:
+  bibliography: "ideal-si.bib"
+---
+
+Stocahastic interpolants are a recent innovation that frames generative modelling as building a transport map between distributions. {% cite albergo_stochastic_2023 liu_flow_2022 lipman_flow_2023 %}
+
+We are given two distributions, $p(x)$ and $q(x)$ over the same space $X$. Our goal is to find a vector field $v$ that allows us to map from $p(x)$ to $q(x)$.
+We can do so by minimising the following objective;
+
+$$
+\begin{aligned}
+b(z, t) &= \mathbb E\big[\nabla_t I(x, y, t) \big] \\
+\mathcal L(\theta) &= \int_0^1 \mathbb E \big[ \parallel v(z, t, \theta) -  b(z, t) \parallel_2^2 \big] dt \\
+\end{aligned}
+$$
+
+Where $I(x, y, t)$ is the interpolant function, and $v(z, t, \theta)$ is the parameterised vector field.
+
+Here we explore the behaviour of transport maps generated by;
+
+- stochastic interpolants / linear flows
+- optimal transport maps
+<!-- what about a comparison to NODE or ?? -->
+
+## Stochastic interpolants
+
+Here are a few examples of what stochastic interpolants do.
+\footnote{these SI transport maps can be calulated exactly for gaussian distributions. So we can verify that the behaviour we observe is not due to the approximations made by a neural network.}
+
+### Splitting modes
+
+> __Q:__ If I sample from a mode of p(x), must it map to a mode of q(x)? No.
+
+![]({{ site.baseurl }}/assets/ideal-si/gaussian-splitting-2d.png)
+
+> We have two 2D gaussian distributions (in blue and cyan). We use SI to learn a map from p(x) (blue) to q(x) (cyan). The learned mapping 'splits' the modes in p(x) when mapping from p(x) to p(y). ie if we sample from a mode in p(x) (circle or dot) we get 50:50 samples from modes in q(x). Note this mapping approximated using a neural network.
+
+In the more tivial case, if we map from a single gaussian distribution to a multi modal gaussian distribution, then of course the mode of the single gaussian must be split.
+
+### Maximum likelihood sample
+
+> __Q:__ If I take the max likelihood sample from p, must it map to a max likelihood sample from q?
+
+![fig]({{ site.baseurl }}/assets/ideal-si/gaussian-1d-max-like.png)
+
+> Taking the max likelihood x from p(x), we generate a sample using our mapping from p(x) to q(x). We start wih $p(x) \approx 0.8$ and we calculate $p(y\mid x) \approx 0.07$.
+
+So it's possible for a high probability sample from $p(x)$ to map to a low probability sample in $q(x)$!
+
+This observation calls into question the validity / reliability of using flow based approaches to generate solutions to problems with optimal solutions; such as [sudoku](https://arxiv.org/abs/2210.11633), [source separation](https://ieeexplore.ieee.org/document/10095310/), ...
+
+### Mapping the identity
+
+> __Q:__ If I learn to map from $p(x)$ to $q(x)$, then I should learn the identity map? No.
+
+![fig3]({{ site.baseurl }}/assets/ideal-si/gaussian-1d-identity.png)
+
+> The learned map from $p(x)$ to $q(x)$ when $p(x) \mathop{=}_d q(x)$.
+
+Mapping from $p(x)$ to $p(x)$ learns a non-linear transform.
+However, it is possible to 'rectify' the flow. See [@liu_flow_2022] for more details.
+
+<!-- While it may seem strange to see such an non-uniform mapping. It is the result of The all to all pairings. -->
+
+
+<!-- ### Topology of modes
+
+This should be preserved?! -->
+
+
+
+## Optimal transport maps
+
+WIP
+
+<!-- ## Thoughts
+
+Consider the problem of speech enhancement. The  -->
+
+<!-- None of this matters for the speech - noisy speech setting since the feature spaces will align. And solving an optimal transport problem should give us good results (since p(x) and p(y) share similar feature interpretations. ie X and Y represent the same state space). and p(y) is (approximately) a slightly higher variance version of p(x) (ie convolved with a blurring gaussian). -->
+
+
+
+<!-- - how similar do these spaces need to be? -->
+
+## Alternative setting
+
+The 'unsupservised translation' problem hints at an alternative setting. [@grave_unsupervised_nodate] show it's possible to "infer a bilingual lexicon, without supervised data, by aligning word embeddings trained on monolingual data".
+
+So, imagine a setting where we have two 'similar' distributions on different spaces. We want to find a way to 'align' them.
+
+<!-- want a kind of topology preserving map. can be done by enforcing a cost to local changes? -->
+
+$$
+T p(x) \to q(y) \\
+X \neq Y
+$$
+
+Other applicatons could include;
+
+- unsupervised phoneme translation (or accent 'correction')
+- unsupervised 
+
+Open questions;
+
+- is it possible to achieve this within the transport framework (with the right cost function)?
+- 
+
+
+## Discussion
+
+More generally, which other properties (modes, max likelihood, ) of a distribution are (not) invariant to the transport map? 
+
+\begin{align}
+??
+\end{align}
+
+  if we are mapping from text to images, then X and Y are clearly different. But if we are mapping from 
+
+<!-- HOW DOES HIGH DIMENSIONALITY AFFECT THESE OBSERVATIONS -->
+
+## Bibliography
+
+{% bibliography %}
diff --git a/_drafts/technical-posts/2024-08-01-init-opt.md b/_drafts/technical-posts/2024-08-01-init-opt.md
@@ -0,0 +1,37 @@
+---
+title: "Note on ??"
+layout: post
+permalink: /opt-init-const/
+---
+
+Consider a constrained optimisation problem.
+We want to minimise an objective, while also staying close to a specific point $y$ (the constraint).
+
+Is there a rigorous connection between;
+
+1. an optimisation with a penalty term (the typical approach)
+2. a finite-step optimisation started from a specific initialisation
+
+$$
+x^* = \arg\min_x f(x) + \lambda D(x, y) \\
+D(x, y) = \frac{1}{2} \| x - y \|^2 \\
+x_0 \sim \mathcal{N}(0, \sigma^2) \\
+$$
+
+***
+
+We have a fixed budget of $T$ steps to optimise a function $f(x)$, and we start from a specific initialisation $x_0$.
+
+$$
+x^* = \arg\min_x f(x) \\
+x_0 = y
+$$
+
+
+***
+
+What's the point of this? / Questions
+
+- 2. will vary dramatically depending on the optimiser used, the learning rate, 
+- 2. would allow easier worst case distance bounds? 
+- relatonship to trust region methods?