The initial distribution is Gaussian: \(\color{#d95f02}{p(x_0)}\)
The transition density is Gaussian: \(\color{#050505}{p(x_k | x_{k-1})}\)
Therefore, the predictive prior is Gaussian: \(\color{#1b9e77}{p(x_k | y_{1:k-1})}\)
This implies the transition is linear with Gaussian noise.
The data likelihood is Gaussian: \(\color{#050505}{p(y_k | x_k)}\)
Therefore, the posterior is Gaussian: \(\color{#d95f02}{p(x_k | y_{1:k})}\)
This implies the observation is linear with Gaussian noise.
Thus, we only need to track a mean and covariance and update them linearly.
However, to properly handle nonlinearity, we need a different viewpoint.
Sample point of view
Instead of probability densities, let’s work in terms of samples from distributions.
Predict: Marginalize over the previous state to get the current state’s probability distribution. \[\begin{align}
\color{#1b9e77}{p(x_k | y_{1:k-1})} = \int \color{#050505}{p(x_k | x_{k-1})} \color{#d95f02}{p(x_{k-1} | y_{1:k-1})} dx_{k-1}
\end{align}\]
We have samples \(\color{#d95f02}{x^{(i)}_{k-1}}\) that are drawn from \(\color{#d95f02}{p(x_{k-1} | y_{1:k-1})}\).
Compute samples of the predictive prior by computing \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Sample point of view
Predict: \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Sample point of view
Predict: \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Update: Use Bayes’ rule to incorporate measurements. \[\begin{align}
\color{#d95f02}{p(x_k | y_{1:k})} \propto \color{#050505}{p(y_k | x_k)} \color{#1b9e77}{p(x_k | y_{1:k-1})}
\end{align}\]
Compute samples of the data likelihood by computing \(\color{#050505}{\hat y^{(i)}_k} = h(\color{#1b9e77}{\hat x^{(i)}_k}) + \epsilon_h\) for each sample.
Now we have joint samples \(\color{#1b9e77}{\hat x^{(i)}_k}, \color{#050505}{\hat y^{(i)}_k} \sim p(x_k, y_k | y_{1:k-1})\) but we want to sample from \(\color{#d95f02}{p(x_k | y_{1:k})}\).
Sample point of view
But how do we sample from the posterior?
Keep only samples that are close to \(y_k\)
This removes most of our samples.
Move the samples to give results closer to \(y_k\).
How? Normalization.
Tangent: why do we want to normalize?
For a unit normal, the condition distribution is identical for each conditioned value.
\(p(x, y)\): 2D unit normal
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice has the same mean, covariance, and shape.
\(p(x, y)\): 1D slices for fixed \(y\)
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we get identical distributions for any value of \(y\).
Tangent: why do we want to normalize?
For a unit normal, the condition distribution is identical for each conditioned value.
\(p(x | y)\): 2D conditioned unit normal
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice has the same mean, covariance, and shape.
\(p(x | y)\): 1D slices for fixed \(y\)
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we get identical distributions for any value of \(y\).
Tangent: why do we want to normalize?
Moving joint samples to the posterior is simple in the normal case.
Joint samples \(x, y\)
Conditioned samples \(p(x | y=1.1)\)
Here, I easily transform joint samples of \(p(x, y)\) to samples of \(p(x | y = 1.1)\).
Tangent: why do we want to normalize?
This does not work for a correlated distribution.
\(p(x, y)\): 2D correlated
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice is different.
\(p(x, y)\): 1D slices for fixed \(y\)
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we don’t get identical distributions.
Tangent: why do we want to normalize?
This does not work for a correlated distribution.
\(p(x | y)\): 2D conditioned correlated
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice is different.
\(p(x | y)\): 1D slices for fixed \(y\)
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we don’t get identical distributions.
Conditional normalization
Conditional normalization removes correlations in \(x\) and \(y\).
Conditional sampling is very easy in the de-correlated space.
Invertible normalization lets us convert conditioned samples back to the correlated space.
Conditional normalization
Find an invertible transformation \(N_k\) such that \(z_k = N_k(x_k; y_k)\) is normal.
Conditional normalizing function: \(N_k\)
Latent variable: \(z_k\)
Sample \(\color{#d95f02}{x^{(i)}_k}\) from posterior \(\color{#d95f02}{p(x_k | y_{1:k})}\)
Compute latent space \(\hat z^{(i)}_k = N_k(\hat x^{(i)}_k; \hat y^{(i)}_k)\)
Convert back to real space using observed data: \(\color{#d95f02}{x^{(i)}_k} = N_k^{-1}(\hat z^{(i)}_k; y_k)\)
Then \(\color{#d95f02}{x^{(i)}_k}\) are samples of the posterior.
\(\implies\) Let’s see how this relates to classic techniques.
Gaussians can be normalized with the Cholesky factor \(C\) of the covariance. \[\begin{align}
z = C^{-1} \big( (x - \mu_x) + B_{xy}B_y^{-1} (\mu_y - y) \big)
\end{align}\]
The Kalman filter uses a linear normalizer!
Major conclusion: Data assimilation via conditional normalizing flow is well-established.
Classic technique (Kalman filter)
Success: Efficient data assimilation that is correct for Gaussians.
Failures:
Linearizes transition and observation operators
Assumes Gaussianity
Stores the covariance matrix
Solution: Ensemble Kalman filter
Uses nonlinear transition and observation operators
Assumes Gaussianity.
Uses low-rank covariance matrix
Ensemble Kalman filter
Initialize ensemble at time step \(k = 0\) with \(N\) members: \(\{\color{#d95f02}{x_0^{(1)}}, \ldots, \color{#d95f02}{x_0^{(N)}}\}\).
Predict: Advance each ensemble member to the next time step.
\(\color{#1b9e77}{\hat x_k^{(i)}} = g(\color{#d95f02}{x_{k-1}^{(i)}}) + \epsilon_{g}\) for each sample \(i\)
Important! Each sample has a different instance of unknown parameters (permeability).
Ensemble Kalman filter
Update: assimilate observation \(y\) at time step \(k\)
\(\color{#050505}{\hat y_k^{(i)}} = h(\color{#1b9e77}{\hat x_k^{(i)}})\) for each sample \(i\), \(\quad \epsilon_h \sim \mathcal{N}(0, R)\)
Compute mean and deviations: \(\hat X = \{\color{#1b9e77}{\hat x_k^{(i)}} - \mu_x\} / \sqrt{N - 1}\), \(\hat Y = \{\color{#050505}{\hat y_k^{(i)}} - \mu_y\} / \sqrt{N - 1}\)
Linear update: \(\color{#d95f02}{x_k^{(i)}} = \color{#1b9e77}{\hat x_k^{(i)}} + \hat X \hat Y^T(\hat Y \hat Y^T + R)^{-1}(y - (\color{#050505}{\hat y_k^{(i)}} + \epsilon_h))\)
Ensemble Kalman filter
Update: assimilate observation \(y\) at time step \(k\)
\(\color{#050505}{\hat y_k^{(i)}} = h(\color{#1b9e77}{\hat x_k^{(i)}})\) for each sample \(i\), \(\quad \epsilon_h \sim \mathcal{N}(0, R)\)
Compute mean and deviations: \(\hat X = \{\color{#1b9e77}{\hat x_k^{(i)}} - \mu_x\} / \sqrt{N - 1}\), \(\hat Y = \{\color{#050505}{\hat y_k^{(i)}} - \mu_y\} / \sqrt{N - 1}\)
Linear update: \(\color{#d95f02}{x_k^{(i)}} = \color{#1b9e77}{\hat x_k^{(i)}} + \hat X \hat Y^T(\hat Y \hat Y^T + R)^{-1}(y - (\color{#050505}{\hat y_k^{(i)}} + \epsilon_h))\)
The linear update is identical to using a linear normalizer: