The initial distribution is Gaussian: \(\color{#d95f02}{p(x_0)}\)
The transition density is Gaussian: \(\color{#050505}{p(x_k | x_{k-1})}\)
Therefore, the predictive prior is Gaussian: \(\color{#1b9e77}{p(x_k | y_{1:k-1})}\)
This implies the transition is linear with Gaussian noise.
The data likelihood is Gaussian: \(\color{#050505}{p(y_k | x_k)}\)
Therefore, the posterior is Gaussian: \(\color{#d95f02}{p(x_k | y_{1:k})}\)
This implies the observation is linear with Gaussian noise.
Thus, we only need to track a mean and covariance and update them linearly.
However, to properly handle nonlinearity, we need a different viewpoint.
Sample point of view
Instead of probability densities, let’s work in terms of samples from distributions.
Predict: Marginalize over the previous state to get the current state’s probability distribution. \[\begin{align}
\color{#1b9e77}{p(x_k | y_{1:k-1})} = \int \color{#050505}{p(x_k | x_{k-1})} \color{#d95f02}{p(x_{k-1} | y_{1:k-1})} dx_{k-1}
\end{align}\]
We have samples \(\color{#d95f02}{x^{(i)}_{k-1}}\) that are drawn from \(\color{#d95f02}{p(x_{k-1} | y_{1:k-1})}\).
Compute samples of the predictive prior by computing \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Sample point of view
Predict: \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Sample point of view
Predict: \(\color{#1b9e77}{\hat x^{(i)}_k} = g(\color{#d95f02}{x^{(i)}_{k-1}}) + \epsilon_g\) for each sample.
Update: Use Bayes’ rule to incorporate measurements. \[\begin{align}
\color{#d95f02}{p(x_k | y_{1:k})} \propto \color{#050505}{p(y_k | x_k)} \color{#1b9e77}{p(x_k | y_{1:k-1})}
\end{align}\]
Compute samples of the data likelihood by computing \(\color{#050505}{\hat y^{(i)}_k} = h(\color{#1b9e77}{\hat x^{(i)}_k}) + \epsilon_h\) for each sample.
Now we have joint samples \(\color{#1b9e77}{\hat x^{(i)}_k}, \color{#050505}{\hat y^{(i)}_k} \sim p(x_k, y_k | y_{1:k-1})\) but we want to sample from \(\color{#d95f02}{p(x_k | y_{1:k})}\).
Sample point of view
But how do we sample from the posterior?
Keep only samples that are close to \(y_k\)
This removes most of our samples.
Move the samples to give results closer to \(y_k\).
How? Normalization.
Tangent: why do we want to normalize?
For a unit normal, the condition distribution is identical for each conditioned value.
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice has the same mean, covariance, and shape.
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we get identical distributions for any value of \(y\).
Tangent: why do we want to normalize?
For a unit normal, the condition distribution is identical for each conditioned value.
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice has the same mean, covariance, and shape.
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we get identical distributions for any value of \(y\).
Tangent: why do we want to normalize?
Moving joint samples to the posterior is simple in the normal case.
Here, I easily transform joint samples of \(p(x, y)\) to samples of \(p(x | y = 1.1)\).
Tangent: why do we want to normalize?
This does not work for a correlated distribution.
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice is different.
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we don’t get identical distributions.
Tangent: why do we want to normalize?
This does not work for a correlated distribution.
Look at slices of specific \(y\) values.
Each \(p(x, y)\) slice is different.
So after we normalize it using \(\color{#e7298a}{p(x | y) = p(x, y) / p(y)}\), we don’t get identical distributions.
Conditional normalization
Conditional normalization removes correlations in \(x\) and \(y\).
Conditional sampling is very easy in the de-correlated space.
Invertible normalization lets us convert conditioned samples back to the correlated space.
Conditional normalization
Find an invertible transformation \(N_k\) such that \(z_k = N_k(x_k; y_k)\) is normal.
Conditional normalizing function: \(N_k\)
Latent variable: \(z_k\)
Sample \(\color{#d95f02}{x^{(i)}_k}\) from posterior \(\color{#d95f02}{p(x_k | y_{1:k})}\)
Compute latent space \(\hat z^{(i)}_k = N_k(\hat x^{(i)}_k; \hat y^{(i)}_k)\)
Convert back to real space using observed data: \(\color{#d95f02}{x^{(i)}_k} = N_k^{-1}(\hat z^{(i)}_k; y_k)\)
Then \(\color{#d95f02}{x^{(i)}_k}\) are samples of the posterior.
\(\implies\) Let’s see how this relates to classic techniques.
Gaussians can be normalized with the Cholesky factor \(C\) of the covariance. \[\begin{align}
z = C^{-1} \big( (x - \mu_x) + B_{xy}B_y^{-1} (\mu_y - y) \big)
\end{align}\]
The Kalman filter uses a linear normalizer!
Major conclusion: Data assimilation via conditional normalizing flow is well-established.
Classic technique (Kalman filter)
Success: Efficient data assimilation that is correct for Gaussians.
Failures:
Linearizes transition and observation operators
Assumes Gaussianity
Stores the covariance matrix
Solution: Ensemble Kalman filter
Uses nonlinear transition and observation operators
Assumes Gaussianity.
Uses low-rank covariance matrix
Ensemble Kalman filter
Initialize ensemble at time step \(k = 0\) with \(N\) members: \(\{\color{#d95f02}{x_0^{(1)}}, \ldots, \color{#d95f02}{x_0^{(N)}}\}\).
Predict: Advance each ensemble member to the next time step.
\(\color{#1b9e77}{\hat x_k^{(i)}} = g(\color{#d95f02}{x_{k-1}^{(i)}}) + \epsilon_{g}\) for each sample \(i\)
Important! Each sample has a different instance of unknown parameters (permeability).
Ensemble Kalman filter
Update: assimilate observation \(y\) at time step \(k\)
\(\color{#050505}{\hat y_k^{(i)}} = h(\color{#1b9e77}{\hat x_k^{(i)}})\) for each sample \(i\), \(\quad \epsilon_h \sim \mathcal{N}(0, R)\)
Compute mean and deviations: \(\hat X = \{\color{#1b9e77}{\hat x_k^{(i)}} - \mu_x\} / \sqrt{N - 1}\), \(\hat Y = \{\color{#050505}{\hat y_k^{(i)}} - \mu_y\} / \sqrt{N - 1}\)
Linear update: \(\color{#d95f02}{x_k^{(i)}} = \color{#1b9e77}{\hat x_k^{(i)}} + \hat X \hat Y^T(\hat Y \hat Y^T + R)^{-1}(y - (\color{#050505}{\hat y_k^{(i)}} + \epsilon_h))\)
Ensemble Kalman filter
Update: assimilate observation \(y\) at time step \(k\)
\(\color{#050505}{\hat y_k^{(i)}} = h(\color{#1b9e77}{\hat x_k^{(i)}})\) for each sample \(i\), \(\quad \epsilon_h \sim \mathcal{N}(0, R)\)
Compute mean and deviations: \(\hat X = \{\color{#1b9e77}{\hat x_k^{(i)}} - \mu_x\} / \sqrt{N - 1}\), \(\hat Y = \{\color{#050505}{\hat y_k^{(i)}} - \mu_y\} / \sqrt{N - 1}\)
Linear update: \(\color{#d95f02}{x_k^{(i)}} = \color{#1b9e77}{\hat x_k^{(i)}} + \hat X \hat Y^T(\hat Y \hat Y^T + R)^{-1}(y - (\color{#050505}{\hat y_k^{(i)}} + \epsilon_h))\)
The linear update is identical to using a linear normalizer: