--- title: Reliable amortized variational inference with physics-based latent distribution correction runninghead: Reliable amortized variational inference author: | Ali Siahkoohi^1^, Gabrio Rizzuti^2^, Rafael Orozco^1^, and Felix J. Herrmann^1^ \ ^1^School of Computational Science and Engineering, Georgia Institute of Technology\ ^2^Department of Mathematics, Utrecht University bibliography: - paper.bib --- ## Abstract: Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges by pretraining a neural network that approximates the posterior distribution not only for one instance of observed data, but a distribution of data pertaining to a specific inverse problem. When fed previously unseen data, the neural network---in our case a conditional normalizing flow---provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems because of the Earth's heterogeneous subsurface. In addition, the network is prone to errors if evaluated over data that is not drawn from the training data distribution. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the conditional normalizing flow's latent distribution that improves the approximation to the posterior distribution for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach, given noisy seismic data simulated via linearized Born modeling, provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations. ## Introduction Inverse problems involve the estimation of an unknown quantity based on noisy indirect observations. The problem is typically solved by minimizing the difference between observed and predicted data, where predicted data can be computed by modeling the underlying data generation process through a forward operator. Due to the presence of noise in the data, forward modeling errors, and the inherent nullspace of the forward operator, minimization of the data misfit alone negatively impacts the quality of the obtained solution [@aster2018parameter]. Casting inverse problems into a probabilistic Bayesian framework allows for a more comprehensive description of their solution, where instead of finding one single solution, a distribution of solutions to the inverse problem---known as the posterior distribution---is obtained whose samples are consistent with the observed data [@tarantola2005inverse]. The posterior distribution can be sampled to extract statistical information that allows for quantification of uncertainty, i.e., assessing the variability among the possible solutions to the inverse problem. Uncertainty qualification and Bayesian inference in inverse problems often require high-dimensional posterior distribution sampling, for instance through the use of Markov chain Monte Carlo [MCMC, @robert2004monte; @MartinMcMC2012; @fang2018uqfip; @siahkoohiGEOdbif]. Because of their sequential nature, MCMC sampling methods require a large number of sampling steps to perform accurate Bayesian inference [@gelman2013bayesian], which reduces their applicability to large-scale problems due to the high-dimensionality of the unknown and costs associated with the forward operator [@curtis2001prior; @welling2011bayesian; @MartinMcMC2012; @fang2018uqfip; @herrmann2019NIPSliwcuc; @zhao2019gradient; @kotsi2020; @siahkoohi2020EAGEdlb; @siahkoohi2020SEGuqi]. As an alternative, variational inference methods [@jordan1999introduction; @wainwright2008graphical; @rezende2015variational; @liu2016stein; @rizzuti2020SEGpub; @zhang2020seismic; @tolle2021mean; @li2021multiparameter; @li2021traversing] approximate the posterior distribution with a surrogate and easy-to-sample distribution. By means of this approximation, sampling is turned into an optimization problem, in which the parameters of the surrogate distribution are tuned in order to minimize the divergence between the surrogate and posterior distributions. This surrogate distribution is then used for conducting Bayesian inference. While variational inference methods may have computational advantages over MCMC methods in high-dimensional inverse problems [@blei2017variational; @zhang2021introduction], the resulting approximation to the posterior distribution is typically non-amortized, i.e., it is specific to the observed data used in solving the variational inference optimization problem. Thus, the variational inference optimization problem must be solved again for every new set of observations. Solving this optimization problem may require numerous iterations [@rizzuti2020SEGpub; @zhang2020seismic], which may not be feasible in inverse problems with computationally costly forward operators, such as seismic imaging. On the other hand, amortized variational inference [@pmlrv80kim18e; @baptista2020adaptive; @kruse2021hint; @kovachki2021conditional; @radev2020bayesflow; @siahkoohi2020ABIpto; @ren2021uq; @siahkoohi2021Seglbe; @khorashadizadeh2022conditional; @orozco2021photoacoustic; @siahkoohi2022EAGEweb; @taghvaei2022optimal] reduces Bayesian inference computational costs by incurring an up-front optimization cost for finding a surrogate conditional distribution, typically parameterized by deep neural networks [@kruse2021hint], that approximate the posterior distribution across a family of observed data instead of being specific to a single observed dataset. This supervised learning problem involves maximization of the probability density function (PDF) of the surrogate conditional distribution over existing pairs model and data [@radev2020bayesflow]. Following optimization, samples from the posterior distribution for previously unseen data may be obtained by sampling the surrogate conditional distribution, which does not require further optimization or MCMC sampling. While drastically reducing the cost of Bayesian inference, amortized variational inference can only be used for inverse problems where a dataset of model and data pairs is available that sufficiently captures the underlying joint distribution. In reality, such an assumption is rarely true in geophysical applications due to the Earth's strong heterogeneity across geological scenarios and our lack of access to its interior [@siahkoohi2020ABIpto; @sun2021physics; @jin2022unsupervised]. Additionally, the accuracy of Bayesian inference with data-driven amortized variational inference methods degrades as the distribution of the data shifts with respect to pretraining data [@schmitt2021bayesflow]. Among these shifts are changes in the distribution of noise, the number of observed data in multi-source inverse problems, and the distribution of unknowns, in other words, the prior distribution. In this work, we leverage amortized variational inference to accelerate Bayesian inference while building resilience against data distribution shifts through an unsupervised, data-specific, a physics-based latent distribution correction method. During this process, the latent distribution of a normalizing-flow-based surrogate conditional distribution [@kruse2021hint] is corrected to minimize the Kullback-Leibler (KL) divergence between the predicted and true posterior distributions. The invertibility of the conditional normalizing flow---a family of invertible neural networks [@dinh2016density]---guarantees the existence of a corrected latent distribution [@pmlr-v119-asim20a] that when "pushed forward" by the conditional normalization flow matches the posterior distribution. During pretraining, the conditional normalizing flow learns to Gaussianize the input model and data joint samples [@kruse2021hint], resulting in a standard Gaussian latent distribution. As a result, for slightly shifted data distributions, the conditional normalization flow can provide samples from the the posterior distribution given an "approximately Gaussian" latent distribution as input [@marzuk2018; @marzouk2018multifidelity]. Motivated by this, and to limit the costs of the latent distribution correction step, we learn a simple diagonal (elementwise) scaling and shift to the latent distribution through a physic-based objective that minimizes the KL divergence between the predicted and true posterior distributions. As with amortized variational inference, after latent distribution correction, we gain cheap access to corrected posterior samples. Besides offering computational advantages, our proposed method implicitly learns the prior distribution during conditional normalizing flow pretraining. As advocated in the literature [@pmlr-v70-bora17a; @pmlr-v119-asim20a] learned priors have the potential to better describe the prior information when compared to generic handcrafted priors that are chosen purely for their simplicity and applicability. A schematic representation of our proposed method is shown in Figure #schematic_correction_simple\. ### Figure: {#schematic_correction_simple .wide} ![](figures/Figure_1.png){width=100%} :Schematic representation of our proposed method. We modify the standard Gaussian latent distribution of a pretrained conditional normalizing flow through a computationally cheap diagonal physics-based correction procedure to mitigate the errors due to data distribution shifts. Upon correction, the new latent samples result in corrected posterior samples when fed into the pretrained conditional normalizing flow. ### Related work In the context of variational inference for inverse problems, @rizzuti2020SEGpub\, @andrle2021invertible\, @zhao2021bayesian\, @zhang2021bayesian2\, and @zhao2022interrogating proposed a non-amortized variational inference approach to approximate posterior distributions through the use of normalizing flows. These methods do not require training data, however they require choosing a prior distribution and repeated computationally expensive evaluation of the forward operator and the adjoint of its Jacobian. Therefore, the proposed methods may prove computationally expensive when applied to inverse problems involving computationally expensive forward operators. To speed up the convergence of non-amortized variational inference, @siahkoohi2020ABIpto introduces a normalizing-flow-based nonlinear preconditioning scheme. In this approach, a pretrained conditional normalizing flow capable of providing a low-fidelity approximation to the posterior distribution is used to warm-start the variational inference optimization procedure. In a related work, @kothari2021trumpets partially address challenges associated with non-amortized variational inference by learning a normalizing-flow-based prior distribution in a learned low-dimensional space via an injective network. Additionally to learning a prior, this approach also allowed non-amortized variational inference in a lower dimensional space, which could potentially have computational benefits. Alternatively, amortized variational inference was applied by @adler2018deep\, @kruse2021hint\, @kovachki2021conditional\, @siahkoohi2021Seglbe\, and @khorashadizadeh2022conditional to further reduce the computational costs associated with Bayesian inference. These supervised methods learn an implicit prior distribution from training data and provide posterior samples for previously unseen data for a negligible cost due to the low cost of forward evaluation of neural networks. The success of such techniques hinges on having access to high-quality training data, including pairs of model and data that sufficiently capture the underlying model and data joint distribution. To address this limitation, @siahkoohi2020ABIpto take amortized variational inference a step further by proposing a two-stage multifidelity approach where during the first stage a conditional normalizing flow is trained in the context of amortized variational inference. To account for any potential shift in data distribution, the weights of this pretrained conditional normalizing flow are then further finetuned during an optional second stage of physics-based variational inference, which is customized for the specific imaging problem at hand. While limiting the risk of errors caused by shifts in the distribution of data, the second physics-based stage can be computationally expensive due to the high dimensionality of the weight space of conditional normalizing flows. Our work differs from the proposed method in @siahkoohi2020ABIpto in that we learn to correct the latent distribution of the conditional normalizing flow, which typically has a much smaller dimensionality (approximately ``\times 90`` in our case) than the dimension of the conditional normalizing flow weight space. The work we present is principally motivated by @pmlr-v119-asim20a\, which demonstrates that normalizing flows---due to their invertibility---can mitigate biases caused by shifts in the data distribution. This is achieved by reparameterizing the unknown by a pretrained normalizing flow with fixed weights while optimizing over the latent variable in order to fit the data. The reparameterization together with a Gaussian prior on the latent variable act as a regularization while the invertibility ensures the existence of a latent variable that fits the data. @pmlr-v119-asim20a exploit this property using a normalizing flow that is pretrained to capture the prior distribution associated with an inverse problem. By computing the maximum-a-posterior estimate in the latent space, @pmlr-v119-asim20a\, as well as @li2021traversing and @orozco2021photoacoustic\, limit biases originating from data distribution shifts while utilizing the prior knowledge of the normalizing flow. We extend this method by obtaining an approximation to the full posterior distribution of an inverse problem instead of a point estimate, e.g., maximum-a-posteriori. Our work is also closely related to the non-amortized variational inference techniques presented by @whang2021composing and @kothari2021trumpets\, in which the latent distribution of a normalizing flow is altered in an unsupervised way in order to perform Bayesian inference. In contrast to our approach, these methods employ a pretrained normalizing flow that approximates the prior distribution. As a result, it is necessary to significantly alter the latent distribution in order to correct the pretrained normalizing flow to sample from the posterior distribution. In response, @whang2021composing and @kothari2021trumpets train a second normalizing flow aimed at learning a latent distribution that approximates the posterior distribution after passing through the pretrained normalizing flow. Our study, however, utilizes a conditional normalizing flow, which, before any corrections are applied, already approximates the posterior distribution. We argue that our approach requires a simpler correction in the latent space to mitigate biases caused by shifts in the data distribution. This is crucial when dealing with large-scale inverse problems with computationally expensive forward operators. ### Main contributions The main contribution of our work involves a variational inference formulation for solving probabilistic Bayesian inverse problems that leverages the benefits of data-driven learned posteriors whilst being informed by physics and data. The advantages of this formulation include * Enhancing the solution quality of inverse problems by implicitly learning the prior distribution from the data; * Reliably reducing the cost of uncertainty quantification and Bayesian inference; and * Providing safeguards against data distribution shifts. ### Outline In the sections below, we first formulate multi-source inverse problems mathematically and cast them within a Bayesian framework. We then describe variational inference and examine how existing model and data pairs can be used to obtain an approximation to the posterior distribution that is amortized, i.e., the approximation holds over a distribution of data rather than a specific set of observations. We showcase amortized variational inference on a high-dimensional seismic imaging example in a controlled setting where we assume observed data during inference is drawn from the same distribution as training seismic data. As means to mitigate potential errors due to data distribution shifts, we introduce our proposed correction approach to amortized variational inference, which exploits the advantages of learned posteriors while reducing potential errors induced by certain data distribution shifts. Two linearized seismic imaging examples are presented, in which the distribution of the data (simulated via linearized Born modeling) is shifted by altering the forward model and the prior distribution. These numerical experiments are intended to demonstrate the ability of the proposed latent distribution correction method to correct for errors caused by shifts in the distribution of data. Finally, we verify our proposed Bayesian inference method by conducting posterior contraction experiments. ## Theory Our purpose is to present a technique for using deep neural networks to accelerate Bayesian inference for ill-posed inverse problems while ensuring that the inference is robust with respect to data distribution shifts through the use of physics. We begin with an introduction to Bayesian inverse problems and discuss variational inference [@jordan1999introduction] as a probabilistic framework for solving Bayesian inverse problems. ### Inverse problems We are concerned with estimating an unknown multidimensional quantity ``\B{x}^{\ast} \in \mathcal{X}``, often referred to as the unknown model, given ``N`` noisy and indirect observed data (e.g, shot records in seismic imaging) ``\B{y} = \{\B{y}_i\}_{i=1}^N`` with ``\B{y}_i \in \mathcal{Y}``\. Here ``\mathcal{X}`` and ``\mathcal{Y}`` denote the space of unknown models and data, respectively. The physical underlying data generation process is assumed to be encoded in forward modeling operators, ``\mathcal{F}_i:\mathcal{X} \rightarrow \mathcal{Y}``\, which relates the unknown model to the observed data via the forward model ```math #forward_model \B{y}_i = \mathcal{F}_i(\B{x}^{\ast}) + \boldsymbol{\epsilon}_i, \quad i=1,\dots,N. ``` In the above expression, ``\boldsymbol{\epsilon}_i`` is a vector of measurement noise, which might also include errors in the forward modeling operator. Solving ill-posed inverse problems is challenged by noise in the observed data, potential errors in the forward modeling operator, and the intrinsic nontrivial nullspace of the forward operator [@aster2018parameter]. These challenges can lead to non-unique solutions where different estimates of the unknown model may fit the observed data equally well. Under such conditions, the use of a single model estimate ignores the intrinsic variability within inverse problem solutions, which increases the risk of overfitting the data. Therefore, not only the process of estimating ``\B{x}^{\ast}`` from ``\B{y}`` requires regularization, but it also calls for a statistical inference framework that allows us to characterize the variability among the solutions by quantifying the solution uncertainty [@tarantola2005inverse]. ### Bayesian inference for solving inverse problems To systematically quantify the uncertainty, we cast the inverse problem into a Bayesian framework [@tarantola2005inverse]. In this framework, instead of having a single estimate of the unknown, the solution is characterized by a probability distribution over the solution space ``\mathcal{X}`` that is conditioned on data, namely the posterior distribution. This conditional distribution, denoted by ``p_{\text{post}} (\B{x} \mid \B{y})``, can according to the Bayes' rule be written as follows: ```math #bayes_rule p_{\text{post}} (\B{x} \mid \B{y}) = \frac{p_{\text{like}} (\B{y} \mid \B{x})\, p_{\text{prior}} (\B{x})}{p_{\text{data}} (\B{y})}. ``` which equivalently can be expressed as ```math #bayes_rule_log - \log p_{\text{post}} (\B{x} \mid \B{y}) & = - \sum_{i=1}^{N} \log p_{\text{like}} (\B{y}_i \mid \B{x} ) - \log p_{\text{prior}} (\B{x}) + \log p_{\text{data}} (\B{y}) \\ & = \frac{1}{2 \sigma^2} \sum_{i=1}^{N} \big \| \B{y}_i-\mathcal{F}_i(\B{x}) \big\|_2^2 - \log p_{\text{prior}} (\B{x}) + \text{const}, ``` in case the observed data (``\B{y}_i``\) are independent conditioned on the unknown model ``\B{x}``. In equations #bayes_rule and #bayes_rule_log\, the likelihood function ``p_{\text{like}} (\B{y} \mid \B{x})`` quantifies how well the predicted data fits the observed data given the PDF of the noise distribution. For simplicity, we assume the distribution of the noise is a zero-mean Gaussian distribution with covariance ``\sigma^2 \B{I}`` but other choices can be incorporated. The prior distribution ``p_{\text{prior}} (\B{x})`` encodes prior beliefs on the unknown quantity, which can also be interpreted as a regularizer for the inverse problem. Finally, ``p_{\text{data}} (\B{y})`` denotes the data PDF, which is a normalization constant that is independent of ``\B{x}``\. Acquiring statistical information regarding the posterior distribution requires access to samples from the posterior distribution. Sampling the posterior distribution, commonly achieved via MCMC [@robert2004monte] or variational inference techniques [@jordan1999introduction], is computationally costly in high-dimensional inverse problems due to the costs associated with many needed evaluations of the forward operator [@malinverno2004expanded; @tarantola2005inverse; @malinverno2006two; @MartinMcMC2012; @blei2017variational; @chevron2017; @fang2018uqfip; @stuart2019two; @zhao2019gradient; @kotsi2020; @zhang2021introduction]. For multi-source inverse problem the costs are especially high as evaluating the likelihood function involves ``N`` forward operator evaluations (equation #bayes_rule_log\). Stochastic gradient Langevin dynamics [SGLD; @welling2011bayesian; @li2016preconditioned; @siahkoohiGEOdbif] alleviates the need to evaluate the likelihood for all the ``N`` forward operators by allowing for stochastic approximations to the likelihood, i.e., evaluating the likelihood over randomly selected indices ``i \in \{1, \ldots, N \}``\. While SGLD can provably provide accurate posterior samples with more favorable computational costs [@welling2011bayesian], due to the sequential nature of MCMC methods, SGLD still requires numerous iterations to fully traverse the probability space [@gelman2013bayesian], which is computationally challenging in large-scale multi-source inverse problems. In the next section, we introduce variational inference as an alternative Bayesian inference method that has the potential to scale better than MCMC-methods in inverse problems with costly forward operators [@blei2017variational; @zhang2021introduction]. ### Variational inference As an alternative to MCMC-based methods, variational inference methods [@jordan1999introduction] reduce the problem of sampling from the posterior distribution ``p_{\text{post}} (\B{x} \mid \B{y})`` to an optimization problem. The optimization problem involves approximating the posterior PDF via the PDF of a tractable surrogate distribution ``p_{\phi} (\B{x})`` with parameters ``\Bs{\phi}`` by minimizing a divergence (read "distance") between ``p_{\phi} (\B{x})`` and ``p_{\text{post}}(\B{x} \mid \B{y})`` with respect to surrogate distribution parameters ``\Bs{\phi}``\. This optimization problem can be solved approximately, which allows for trading off computational cost for accuracy [@jordan1999introduction]. After optimization, we gain access to samples from the posterior distribution by sampling ``p_{\phi} (\B{x})`` instead, which does not involve forward operator evaluations. Due to its simplicity and connections to the maximum likelihood principle [@bishop2006pattern], we formulate variational inference via the Kullback-Leibler (KL) divergence. The KL divergence can be explained as the cross-entropy of ``p_{\text{post}}(\B{x} \mid \B{y})`` relative to ``p_{\phi}(\B{x})`` minus the entropy of ``p_{\phi}(\B{x})``\. This definition describes the reverse KL divergence, denoted by ``\KL\,\big( p_{\phi}(\B{x}) \mid\mid p_{\text{post}}(\B{x} \mid \B{y})\big)``, which is not equal to the forward KL divergence, ``\KL\,\big( p_{\text{post}}(\B{x} \mid \B{y}) \mid\mid p_{\phi}(\B{x}) \big)``\. This non-symmetry in KL divergence leads to different computational and approximation properties during variational inference, which we describe in detail in the following sections. We will first describe the reverse KL divergence, followed by the forward KL divergence. Finally, we will describe normalizing flows as a way of parameterized surrogate distributions to facilitate variational inference. #### Non-amortized variational inference The reverse KL divergence is the common choice for formulating variational inference [@rizzuti2020SEGpub; @andrle2021invertible; @zhao2021bayesian; @zhang2021bayesian2; @zhao2022interrogating] in which the the physically-informed posterior density guides the optimization over ``\Bs{\phi}``\. The reverse KL divergence can be mathematically stated as ```math #reverse_kl_divergence \KL\,\big( p_{\phi}(\B{x}) \mid\mid p_{\text{post}}(\B{x} \mid \B{y}_{\text{obs}})\big) = \mathbb{E}_{\B{x} \sim p_{\phi}(\B{x})} \big[ - \log p_{\text{post}} (\B{x} \mid \B{y}_{\text{obs}}) + \log p_{\phi}(\B{x}) \big], ``` where ``\B{y}_{\text{obs}} \sim p_{\text{data}} (\B{y})`` refers to a specific single observed data. ``\B{x}`` in the right hand side of the expression in equation #reverse_kl_divergence is a random variable obtained by sampling the surrogate distribution ``p_{\phi}(\B{x})``, over which we evaluate the expectation. Variational inference using the reverse KL divergence involves minimizing equation #reverse_kl_divergence with respect to ``\Bs{\phi}`` during which the logarithm of the posterior PDF is approximated by the logarithm of the surrogate PDF, when evaluated over samples from the surrogate distribution. By expanding the negative-log posterior density via Bayes' rule (equation #bayes_rule_log\), we write the non-amortized variational inference optimization problem as ```math #vi_reverse_kl \Bs{\phi}^{\ast} = \argmin_{\phi} \mathbb{E}_{\B{x} \sim p_{\phi}(\B{x})} \left [ \frac{1}{2 \sigma^2} \sum_{i=1}^{N}\big \| \B{y}_{\text{obs}, i}-\mathcal{F}_i (\B{x}) \big\|_2^2 - \log p_{\text{prior}} \left(\B{x} \right) + \log p_{\phi}(\B{x}) \right ]. ``` The expectation in the above equation is approximated with a sample mean over samples drawn from ``p_{\phi}(\B{x})``\. The optimization problem in equation #vi_reverse_kl can be solved using stochastic gradient descent and its variants [@robbins1951stochastic; @nemirovski2009robust; @rmsprop; @kingma2014adam] where at each iteration the objective function is evaluated over a batch of samples drawn from ``p_{\phi}(\B{x})`` and randomly selected (without replacement) indices ``i \in \{1, \ldots, N \}``\. To solve this optimization problem, there are two considerations to take into account. First consideration involves the tractable computation of the surrogate PDF and its gradient with respect to ``\Bs{\phi}``. As described in the following sections, normalizing flows [@rezende2015variational], which are a family of specially designed invertible neural networks [@dinh2016density], facilitate the computation of these quantities via the change-of-variable formula in probability distributions [@villani2009optimal]. The second consideration involves differentiating (with respect to ``\Bs{\phi}``\) the expectation (sample mean) operation in equation #vi_reverse_kl\. Evaluating this expectation requires sampling from the surrogate distribution ``p_{\phi}(\B{x})``\, which depends ``\Bs{\phi}``\. Differentiating through the sampling procedure from the surrogate distribution ``p_{\phi}(\B{x})`` can be facilitated through the reparameterization trick [@Kingma2014]. In this approach sampling from ``p_{\phi}(\B{x})`` is interpreted as passing latent samples ``\B{z} \in \mathcal{Z}`` from a simple base distribution, such as standard Gaussian distribution, through a parametric function parameterized by ``\Bs{\phi}`` [@Kingma2014]\. With this interpretation, the expectation over ``p_{\phi}(\B{x})`` can be computed over the latent distribution instead, which does not depend on ``\Bs{\phi}``\, followed by a mapping of latent samples through the parametric function. This process enables computing the gradient of the expression in equation #vi_reverse_kl with respect to ``\Bs{\phi}`` [@Kingma2014]. Following optimization, ``p_{\phi^{\ast}}(\B{x})`` provides unlimited samples from the posterior distribution---virtually for free. While there are indications that this approach can be computationally favorable compared to MCMC sampling methods [@blei2017variational; @zhang2021introduction], each iteration during optimization problem #vi_reverse_kl involves evaluating the forward operator and the adjoint of its Jacobian, which can be computationally costly depending on ``N`` and the number of iterations required to solve #vi_reverse_kl\. In addition, and more importantly, this approach is non-amortized---i.e., the resulting surrogate distribution ``p_{\phi^{\ast}}(\B{x})`` approximates the posterior distribution for the specific data ``\B{y}_{\text{obs}}`` that is used to solve optimization problem #vi_reverse_kl\. This necessitates the optimization problem to be solved again for a new instance of the inverse problem with different data. In the next section, we introduce an amortized variational inference approach that addresses these limitations. #### Amortized variational inference Similarly to reverse KL divergence, forward KL divergence involves calculating the difference between the logarithms of the surrogate PDF and the posterior PDF. In contrast to reverse KL divergence, however, to compute the forward KL divergence the PDFs are evaluated over samples from the posterior distribution rather than the surrogate distribution samples (see equation #reverse_kl_divergence\). The forward KL divergence can be written as follows ```math #forward_kl_divergence \KL\,\big(p_{\text{post}}(\B{x} \mid \B{y}) \mid\mid p_{\phi}(\B{x})\big) = \mathbb{E}_{\B{x} \sim p_{\text{post}}(\B{x} \mid \B{y})} \big[ -\log p_{\phi} (\B{x}) + \log p_{\text{post}}(\B{x} \mid \B{y}) \big]. ``` Following the expression above, it is infeasible to evaluate the forward KL divergence in inverse problems as it requires access to samples from the posterior distribution---the samples that we are ultimately after and do not have access to. However, the average (over data) forward KL divergence can be computed using available model and data pairs in the form of samples from the joint distribution ``p(\B{x}, \B{y})``\. This involves integrating (marginalizing) the forward KL divergence over existing data ``\B{y} \sim p_{\text{data}}(\B{y})``\: ```math #average_forward_kl & \mathbb{E}_{\B{y} \sim p_{\text{data}}(\B{y})} \Big[ \KL\,\big(p_{\text{post}}(\B{x} \mid \B{y}) \mid\mid p_{\phi}(\B{x})\big) \Big] \\ & =\mathbb{E}_{\B{y} \sim p_{\text{data}}(\B{y})} \mathbb{E}_{\B{x} \sim p_{\text{post}}(\B{x} \mid \B{y})} \Big[ -\log p_{\phi} (\B{x} \mid \B{y}) + \underbrace{ \log p_{\text{post}}(\B{x} \mid \B{y})}_{ \text{constant w.r.t. } \Bs{\phi}} \Big] \\ & = \iint \underbrace{p_{\text{data}}(\B{y}) p_{\text{post}}(\B{x} \mid \B{y})}_{= p(\B{x}, \B{y})} \Big[ -\log p_{\phi} (\B{x} \mid \B{y})\Big] \mathrm{d}\B{x}\,\mathrm{d}\B{y} + \text{const} \\ & =\mathbb{E}_{(\B{x}, \B{y}) \sim p(\B{x}, \B{y})} \big[ -\log p_{\phi} (\B{x} \mid \B{y}) \big] + \text{const}. ``` In the above expression ``p_{\phi}(\B{x} \mid \B{y})`` represents a surrogate conditional distribution that approximates the posterior distribution for any data ``\B{y} \sim p_{\text{data}}(\B{y})``\. The third line in equation #average_forward_kl is the result of applying the chain rule of PDFs[^1]. By minimizing the average KL divergence we obtain the following amortized variational inference objective: [^1]: ``p(x, y) = p(x \mid y)\, p(y), \ \forall\, x \in \mathcal{X},\, y \in \mathcal{Y}``. ```math #vi_forward_kl \Bs{\phi}^{\ast} & = \argmin_{\phi}\,\mathbb{E}_{\B{y} \sim p_{\text{data}}(\B{y})} \big[ \KL\,(p_{\text{post}}(\B{x} \mid \B{y}) \mid\mid p_{\phi}(\B{x} \mid \B{y})) \big] \\ & = \argmin_{\phi}\, \mathbb{E}_{(\B{x}, \B{y}) \sim p(\B{x}, \B{y})} \big[ -\log p_{\phi} (\B{x} \mid \B{y}) \big]. ``` The above optimization problem represent a supervised learning framework for obtaining fully-learned posteriors using existing pairs of model and data. The expectation is approximated with a sample mean over available model and data joint samples. Note that this method does not impose any explicit assumption on the noise distribution (see equation #bayes_rule_log\), and the information about the forward model is implicitly encoded in the model and data pairs. As a result, this formulation is an instance of likelihood-free simulation-based inference methods [@cranmer2020frontier; @lavin2021simulation] that allows us to approximate the posterior distribution for previously unseen data as, ```math #amortization p_{\phi^{\ast}}(\B{x} \mid \B{y}_{\text{obs}}) \approx p_{\text{post}} (\B{x} \mid \B{y}_{\text{obs}}),\quad \forall\, \B{y}_{\text{obs}} \sim p_{\text{data}} (\B{y}). ``` Equation #amortization holds for previously unseen data drawn from ``p_{\text{data}} (\B{y})`` provided that the optimization problem #vi_forward_kl is solved accurately [@kruse2021hint; @schmitt2021bayesflow], i.e., ``\mathbb{E}_{\B{y} \sim p_{\text{data}}(\B{y})} \big[ \KL\,(p_{\text{post}}(\B{x} \mid \B{y}) \mid\mid p_{\phi^{\ast}}(\B{x} \mid \B{y})) \big] =0``\. Following an one-time upfront cost of training, equation #amortization can be used to sample the posterior distribution with no additional forward operator evaluations. While computationally cheap, the accuracy of the amortized variational inference approach in equation #vi_forward_kl is directly linked to the quality and quantity of model and data pairs used during optimization [@cranmer2020frontier]. This raises questions regarding the reliability of this approach in domains that sufficiently capturing the underlying joint model and data distribution is challenging, e.g., in geophysical applications due to the Earth's strong heterogeneity across geological scenarios and our lack of access to its interior [@siahkoohi2020ABIpto; @sun2021physics; @jin2022unsupervised]. To increases the resilience of amortized variational inference when faced with data distribution shifts, e.g., changes in the forward model or prior distribution, we propose a latent distribution correction to physically inform the inference. Before describing our proposed physics-based latent distribution correction approach, we introduce conditional normalizing flows [@kruse2021hint] to parameterize the surrogate conditional distribution for amortized variational inference. ### Conditional normalizing flows for amortized variational inference To limit the computational cost of amortized variational inference, both during optimization and inference, it is imperative that the surrogate conditional distribution be able to: (1) approximate complex distributions, i.e., it should have a high representation power, which is required to represent possibly multi-modal distributions; (2) support cheap density estimation, which involves computing the density ``p_{\phi}(\B{x} \mid \B{y})`` for given ``\B{x}`` and ``\B{y}``\; and (3) permit fast sampling from ``p_{\phi}(\B{x} \mid \B{y})`` for cheap posterior sampling during inference. These characteristics are provided by conditional normalizing flows [@kruse2021hint], which are a family of invertible neural networks [@dinh2016density] that are capable of approximating complex conditional distributions [@NEURIPS2020_2290a738; @ishikawa2022universal]. A conditional normalizing flows---in the context of amortized variational inference---aims to map input samples ``\B{z}`` from a latent standard multivariate Gaussian distribution ``\mathrm{N}(\B{z} \mid \B{0}, \B{I})`` to samples from the posterior distribution given the observed data ``\B{y} \sim p_{\text{data}}(\B{y})`` as an additional input. This nonlinear mapping can formally be stated as ``f_{\phi}^{-1}(\,\cdot\,; \B{y}): \mathcal{Z} \to \mathcal{X}``\, with ``f_{\phi}^{-1}(\B{z}; \B{y})`` being the inverse of the conditional normalizing flow with respect to its first argument. Due to the low computational cost of evaluating invertible neural networks in reverse [@dinh2016density], using conditional normalizing flows as a surrogate conditional distribution ``p_{\phi}(\B{x} \mid \B{y})`` allows for extremely fast sampling from ``p_{\phi}(\B{x} \mid \B{y})``\. In addition to low-cost sampling, the invertibility of conditional normalizing flows permits straightforward and cheap estimation of the density ``p_{\phi}(\B{x} \mid \B{y})``\. This allows for tractable amortized variational inference via equation #vi_forward_kl through the following change-of-variable formula in probability distributions [@villani2009optimal], ```math #density_estimation p_{\phi}(\B{x} \mid \B{y}) = \mathrm{N}\left(f_{\phi}(\B{x}; \B{y}) \,\bigl\vert\, \B{0}, \B{I}\right)\, \Big | \det \nabla_{\B{x}} f_{\phi}(\B{x}; \B{y}) \Big |, \quad \forall\, \B{x}, \B{y} \sim p(\B{x}, \B{y}). ``` In the above formula, ``\mathrm{N}\left(f_{\phi}(\B{x}; \B{y}) \,\bigl\vert\, \B{0}, \B{I}\right)`` represents the PDF for a multivariate standard Gaussian distribution evaluated at ``f_{\phi}(\B{x}; \B{y})``\. Thanks to the special design of invertible neural networks [@dinh2016density], density estimation via equation #density_estimation is cheap since evaluating the conditional normalizing flow and the determinant of its Jacobian ``\det \nabla_{\B{x}} f_{\phi}(\B{x}; \B{y})`` are almost free of cost. Given the expression for ``p_{\phi}(\B{x} \mid \B{y})`` in equation #density_estimation\, we derive the following training objective for amortized conditional normalizing flows: ```math #vi_forward_kl_nf \Bs{\phi}^{\ast} & = \argmin_{\phi}\, \mathbb{E}_{(\B{x}, \B{y}) \sim p(\B{x}, \B{y})} \big[ -\log p_{\phi} (\B{x} \mid \B{y}) \big] \\ & = \argmin_{\phi}\, \mathbb{E}_{(\B{x}, \B{y}) \sim p(\B{x}, \B{y})} \Big[ \frac{1}{2} \left \| f_{\phi}(\B{x}; \B{y}) \right \|^2_2 - \log \Big | \det \nabla_{\B{x}} f_{\phi}(\B{x}; \B{y}) \Big | \Big]. ``` In the above objective, the ``\ell_2``-norm follows from a standard Gaussian distribution assumption on the latent variable, i.e., the output of the normalizing flow. The second term quantifies the relative change of density volume [papamakarios2021] and can be interpreted as an entropy regularization of ``p_{\phi}(\B{x} \mid \B{y})``\, which prevents the conditional normalizing flow from converging to solutions, e.g., ``f_{\phi}(\B{x}; \B{y}) := \B{0}``\. Due to the particular design of invertible networks [@dinh2016density; @kruse2021hint], computing the gradient of ``\det \nabla_{\B{x}} f_{\phi}(\B{x}; \B{y})`` has a negligible extra cost. Figure #schematic_pretraining illustrates the pretraining phase as a schematic. ### Figure: {#schematic_pretraining .wide} ![](figures/Figure_2.png){width=100%} :A schematic representation of pretraining conditional normalizing flows in the context of amortized variational inference. During pretraining, joint model and data joint samples ``\B{x}, \B{y} \sim p(\B{x}, \B{y})`` from the training dataset and are fed to the conditional normalizing flow. The training objective (equation #vi_forward_kl_nf\) enforces the conditional normalizing flow to Gaussianize its input. After training, given a previously unseen observed data ``\B{y}_{\text{obs}} \sim p_{\text{data}}(\B{y})`` we sample from the posterior distribution using the inverse of the conditional normalizing flow. We achieve this by feeding latent samples ``\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I})`` to the conditional normalizing flow's inverse ``f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}})`` while conditioning on the observed data ``\B{y}_{\text{obs}}``\, ```math #sampling_amortized f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}}) \sim p_{\text{post}} (\B{x} \mid \B{y}_{\text{obs}}), \quad \B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I}). ``` This step is illustrated in Figure #schematic_sampling\. As the process above does not involve forward operator evaluations, sampling with pretrained conditional normalizing flows is fast once an upfront cost of amortized variational inference is incurred. In the next section, we apply the above amortized variational inference to a seismic imaging example in a controlled setting in which we assume no data distribution shifts during inference. ### Figure: {#schematic_sampling .wide} ![](figures/Figure_3.png){width=85%} :A schematic representation of posterior sampling with pretrained conditional normalizing flows. To sample from ``p_{\text{post}} (\B{x} \mid \B{y}_{\text{obs}})``\, the observed data and latent samples are fed to the conditional normalizing flow's inverse ``f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}})``\. Each latent sample realization results in a realization of the posterior distribution. ## Validating amortized variational inference The objective of this example is to apply amortized variational inference to the high-dimensional seismic imaging problem. We show that a relatively good pretrained conditional normalizing flow within the context of amortized variational inference can be used to provide approximate posterior samples for previously unseen seismic data that is drawn from the same distribution as training seismic data. We begin by introducing seismic imaging and describe challenges with Bayesian inference in this problem. ### Seismic imaging We are concerned with constructing an image of the Earth's subsurface using indirect surface measurements that record the Earth's response to synthetic sources being fired on the surface. The nonlinear relationship between these measurements, known as shot records, and the squared-slowness model of the Earth's subsurface is governed by the wave equation. By linearizing this nonlinear relation, seismic imaging aims to estimates the short-wavelength component of the Earth's subsurface squared-slowness model. In its simplest acoustic form, the linearization with respect to the slowness model---around a known, smooth background squared slowness model ``\B{m}_0``\---leads to the following linear forward problem: ```math #linear-fwd-op \B{d}_i = \B{J}(\B{m}_0, \B{q}_i) \delta \B{m}^{\ast} + \boldsymbol{\epsilon}_i, \quad i = 1, \dots, N. ``` We invert the above forward model to estimate the ground truth seismic image ``\delta \B{m}^{\ast}`` from ``N`` processed (linearized) shot records ``\left \{\B{d}_{i}\right \}_{i=1}^{N}`` where ``\B{J}(\B{m}_0, \B{q}_i)`` represents the linearized Born scattering operator [@gubernatis77]. This operator is parameterized by the source signature ``\B{q}_{i}`` and the smooth background squared-slowness model ``\B{m}_0``\. Noise is denoted by ``\boldsymbol{\epsilon}_i``, and represents measurement noise and linearization errors. While amortized variational inference does not require knowing the closed from expression of the noise density to simulate pairs of data and model (e.g., it is sufficient to be able to simulate noise instances), for simplicity we assume the noise distribution is a zero-centered Gaussian distribution with known covariance ``\sigma^2 \B{I}``\. Due to the presence of shadow zones and noisy finite-aperture shot data, wave-equation based linearized seismic imaging (in short seismic imaging for the purposes of this paper) corresponds to solving an inconsistent and ill-conditioned linear inverse problem [@lambare1992iterative; @schuster1993least; @nemeth1999least]. To avoid the risk of overfitting the data and to quantify uncertainty, we cast the seismic imaging problem into a Bayesian inverse problem [@tarantola2005inverse]. To address the challenge of Bayesian inference in this high-dimensional inverse problem, we adhere to our amortized variational inference framework. Within this approach, for an one-time upfront cost of training a conditional normalizing flow, we get access to posterior samples for previously unseen observed data that are drawn from the same distribution as the distribution of training seismic data. This includes data acquired in areas of the Earth with similar geologies, e.g., in neighboring surveys. In addition, in our framework no explicit prior density function needs to be chosen as the conditional normalizing flow learns the prior distribution during pretraining from the collection of seismic images in the training dataset. The implicitly learned prior distribution by the conditional normalizing flow minimizes the risk of negatively biasing the outcome of Bayesian inference by using overly simplistic priors. In the next section, we describe the setup for our amortized variational inference for seismic imaging. #### Acquisition geometry To mimic the complexity of real seismic images, we propose a "quasi"-real data example in which we generate synthetic data by applying the linearized Born scattering operator to ``4750`` 2D sections with size ``3075\, \mathrm{m} \times 5120\, \mathrm{m}`` extracted from the shallow section of the Kirchhoff migrated [Parihaka-3D] field dataset [@Veritas2005; @WesternGeco2012]. We consider a ``12.5\, \mathrm{m}`` vertical and ``20\, \mathrm{m}`` horizontal grid spacing, and we augment an artificial ``125\, \mathrm{m}`` water column on top of these images. We parameterize the linearized Born scattering operator via a fictitious background squared-slowness model, derived from the Kirchhoff migrated images. To ensure good coverage, we simulate ``102`` shot records with a source spacing of ``50\, \mathrm{m}``. Each shot is recorded for two seconds with ``204`` fixed receivers sampled at ``25\, \mathrm{m}`` spread on top of the model. The source is a Ricker wavelet with a central frequency of ``30\,\mathrm{Hz}``\. To mimic a more realistic imaging scenario, we add band-limited noise to the shot records, where the noise is obtained by filtering white noise with the source wavelet (Figure #d-obs\). #### Training configuration Casting seismic imaging into amortized variational inference, as described in this paper, is hampered by the high-dimensionality of the data due to the multi-source nature of this inverse problem. To avoid computational complexities associated with directly using ``N`` shot records as input to the conditional normalizing flow, we choose to condition the conditional normalizing flow on the reverse-time migrated image, which can be estimated by applying the adjoint of the linearized Born scattering operator to the shot records, ```math #rtm \delta \B{m}_{\text{RTM}} = \sum_{i=1}^{N} \B{J}(\B{m}_0, \B{q}_i)^{\top} \B{d}_i. ``` While ``\B{d}_i`` in the above expression is defined according to the linearized forward model in equation #linear-fwd-op\, which does not involve linearization errors, our method can handle observed data simulated from wave-equation based nonlinear forward modeling. Conditioning on the reverse-time migrated image and not on the shot records directly may result in learning an approximation to the true posterior distribution [@Adler_2022]\. While technique from statistics involving learned summary functions [@radev2020bayesflow; @schmitt2021bayesflow] can reduce the dimensionality of the observed data, we propose to limit the Bayesian inference bias induced by conditioning on the reverse-time migrated image via our physics-based latent variable correction approach. We leave utilizing summary functions in the context of seismic imaging Bayesian inference to future work. To create training pairs, ``(\delta \B{m}^{(i)}, \delta \B{m}_{\text{RTM}}^{(i)}), \ i=1,\ldots,4750``, we first simulate (see Figure #schematic_pretraining\) noisy seismic data according to the above-mentioned acquisition design for all extracted seismic images ``\delta \B{m}^{(i)}`` from shallow sections of the imaged Parihaka dataset. Next, we compute ``\delta \B{m}_{\text{RTM}}^{(i)}`` by applying reverse-time-migration to the observed data for each image ``\delta \B{m}^{(i)}``\. As for the conditional normalizing flow architecture, we follow @kruse2021hint and use hierarchical normalizing flows due to their increased expressiveness when compared to conventional invertible architectures [@dinh2016density]. The expressive power of hierarchical normalizing flows is a result of applying a series of conventional invertible layers [@dinh2016density] to different scales of the input in a hierarchical manner (refer to @kruse2021hint for a schematic representation of the architecture). This leads to a invertible architecture with a dense Jacobian [@kruse2021hint] that is capable of representing complicated bijective transformations. We train this conditional normalizing flow on the pairs ``(\delta \B{m}^{(i)}, \delta \B{m}_{\text{RTM}}^{(i)}), \ i=1,\ldots,4750`` according to the objective function in equation #vi_forward_kl_nf with the Adam stochastic optimization method [@kingma2014adam] with a batchsize of ``16`` for one thousand passes over the training dataset (epochs). We use an initial stepsize of ``10^{-4}`` and decrease it after each epoch until reaching the final stepsize of ``10^{-6}``\. To monitor overfitting, we evaluate the objective function at the end of every epoch over random subsets of the validation set, consisting of ``530`` seismic images extracted from the shallow sections of the imaged Parihaka dataset and the associated reverse-time migrated images. As illustrated in Figure #seismic_avi_loss\, the training and validation objective values exhibit a decreasing trend, which suggests no overfitting. We stopped the training after one thousand epochs due to a slowdown in the decrease of the training and validation objective values. ### Figure: {#seismic_avi_loss} ![](figures/Figure_4.png){width=60%} :Training and validation objective values as a function of epochs. The validation objective value is computed over randomly selected batched of the validation set at the end of each epoch. ### Results and observations Following training, the pretrained conditional normalizing flow is able to produce samples from the posterior distribution for seismic data not used in training. These samples resemble different regularized (via the learned prior) least-squares migration images that explain the observed data. To demonstrate this, we simulate seismic data for a previously unseen perturbation model using the forward model #linear-fwd-op with the same noise variance. Figure #data shows an example of a single noise-free (Figure #d-noise-free\) and noisy (Figure #d-obs\) shot record for one of ``102`` sources. ### Figure: {#data .wide} ![](figures/Figure_5a.png){width=35% #d-noise-free} ![](figures/Figure_5b.png){width=35% #d-obs} :A shot record generated from an image extracted from the Parihaka dataset. (a) Noise-free linearized data. (b) Linearized data with bandwidth-limited noise. We perform reverse-time migration to obtain the necessary input for the conditional normalizing flow to obtain posterior samples. We show the ground-truth seismic image (to be estimated) and the resulting reverse-time migrated image in Figures #true_model and #rtm\, respectively. Clearly, the reverse-time migrated image has grossly wrong amplitudes, and more importantly, due to limited-aperture shot data, the edges of the image are not well illuminated. ### Figure: {#seismic_avi_setup} ![](figures/Figure_6a.png){width=65% #true_model} \ ![](figures/Figure_6b.png){width=65% #rtm} :Amortized variational inference testing phase setup. (a) High-fidelity ground-truth image. (b) Reverse-time migrated image with SNR ``- 12.17\,\mathrm{dB}``\. We obtain one thousand posterior samples by providing the reverse-time migrated image and latent samples drawn from the standard Gaussian distribution to the pretrained conditional normalizing flow (equation #amortization\). This process is fast as it does not require any forward operator evaluations. To illustrate the variability among the posterior samples, we show six of them in Figure #seismic_avi_samples\. As shown in Figure #seismic_avi_samples, these image samples have amplitudes in the same range as the ground-truth image and better predict the reflectors at the edges of the image compared to the reverse-time migration image (Figure #rtm\). In addition, the posterior samples indicate improved imaging in deep regions, which is typically more difficult due to the placement of the sources and receiver near the surface. ### Figure: {#seismic_avi_samples} ![](figures/Figure_7a.png){width=50%} ![](figures/Figure_7b.png){width=50%} \ ![](figures/Figure_7c.png){width=50%} ![](figures/Figure_7d.png){width=50%} \ ![](figures/Figure_7e.png){width=50%} ![](figures/Figure_7f.png){width=50%} :Samples drawn from posterior distribution using the pretrained conditional normalizing flow via equation #amortization with SNRs ranging from ``8.08\,\mathrm{dB}`` to ``8.92\,\mathrm{dB}``\. Samples from the posterior provide access to useful statistical information including approximations to moments of the distribution such as the mean and pointwise standard deviation (Figure #seismic_avi_results\). We compute the mean of the posterior samples to obtain the conditional mean estimate, i.e., the expected value of the posterior distribution. This estimate is depicted in Figure #avi_cm\. From Figure #avi_cm\, we observe that the overall amplitudes are well recovered by the conditional mean estimate, which includes partially recovered reflectors in badly illuminated areas close to the boundaries. Although the reconstructions are not perfect, they significantly improve upon the reverse-time migrations estimate. We did not observe a significant increase in the signal-to-noise ratio (SNR) of the conditional mean estimate when more than one thousand samples from the posterior are drawn. We use the one thousand samples to also estimate the pointwise standard deviation (Figure #avi_std\), which serves as an assessment of the uncertainty. To avoid bias from strong amplitudes in the estimated image, we also plot the stabilized division of the standard deviation by the envelope of the conditional mean in Figure #avi_std_normalized\. As expected, the pointwise standard deviation in Figures #avi_std and #avi_std_normalized indicate that we have the most uncertainty in areas of complex geology---e.g., near channels and tortuous reflectors, and in areas with a relatively poor illumination (deep and close to boundaries). The areas with large uncertainty align well with difficult-to-image parts of the model. The normalized pointwise standard deviation (Figure #avi_std_normalized\) aims to visualize an amplitude-independent assessment of uncertainty, which indicates high uncertainty on the onset and offset of reflectors (both shallow and deeper sections), while showing low uncertainty in the areas of the image with no reflectors. ### Figure: {#seismic_avi_results} ![](figures/Figure_8a.png){width=65% #avi_cm} \ ![](figures/Figure_8b.png){width=65% #avi_std} \ ![](figures/Figure_8c.png){width=65% #avi_std_normalized} :Amortized variational inference results. (a) The conditional (posterior) mean estimate with SNR ``9.44\,\mathrm{dB}``\. (b) The pointwise standard deviation among samples drawn from the posterior. (c) Normalized pointwise standard deviation by the conditional mean estimate (Figure #avi_cm\). After incurring an upfront cost of training the conditional normalizing flow, the computational cost of sampling the posterior distribution is low as it does not involve any forward operator evaluations. However, the accuracy of the presented results is directly linked to the availability of high-quality training data that fully represent the joint distribution for model and data. Due to our lack of access to the subsurface of the Earth, obtaining high-quality training data is challenging when dealing with geophysical inverse problems. To address this issue, we propose to supplement amortized variational inference with a physics-based latent distribution correction technique that increases the reliability of this approach when dealing with moderate shifts in the data distribution during inference. [Parihaka-3D]:https://wiki.seg.org/wiki/Parihaka-3D ## A physics-based treatment to data distribution shifts during inference For accurate Bayesian inference in the context of amortized variational inference, the surrogate conditional distribution ``p_{\phi}( \B{x} \mid \B{y})`` must yield a zero amortized variational inference objective value (equation #vi_forward_kl\). Achieving this objective is challenging due to lack of access to model and data pairs that sufficiently captures the underlying joint distribution in equation #vi_forward_kl\. Additionally, due to potential shifts to the joint distribution during inference, i.e., shifts in the prior distribution or the forward (likelihood) model (equation #forward_model\), the conditional normalizing flow can no longer reliably provide samples from the posterior distribution due to lack of generalization. Under such conditions feeding latent samples drawn from a standard Gaussian distribution to the conditional normalizing flow may lead to posterior sampling errors. To quantify the posterior distribution approximation error and to propose our correction method, we will use the invariance of the KL divergence to differentiable and invertible mappings [@liu2014divergence]. This property relates conditional normalizing flow's error in posterior distribution approximation to its error in Gaussianizing the input model and data pairs. ### KL divergence invariance relation The errors that the pretrained conditional normalizing flow makes in approximating the posterior distribution can be formally quantified using the invariance of the KL divergence under diffeomorphism mappings [@liu2014divergence]. Using this relation, we relate the posterior distribution approximation errors (KL divergence between true and predicted posterior) to the errors that the conditional normalizing flow makes in gaussianizing its inputs (KL divergence between the distribution of "gaussianized" inputs and standard Gaussian distribution). Specifically, for observed data ``\B{y}_{\text{obs}}`` drawn from a shifted data distribution ``\widehat{p}_{\text{data}} (\B{y}) \neq p_{\text{data}} (\B{y})``, the invariance relation states ```math #kl_divergence_invariance \KL\,\left(p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) \mid\mid \mathrm{N}(\B{z} \mid \B{0}, \B{I})\right) = \KL\, \left(p_{\text{post}}(\B{x} \mid \B{y}_{\text{obs}}) \mid\mid p_{\phi}(\B{x} \mid \B{y}_{\text{obs}})\right) > 0. ``` In this expression, ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` represents the distribution of conditional normalizing flow output ``\B{z} = f_{\phi}(\B{x}; \B{y}_{\text{obs}})``\. That is, passing inputs ``\B{x} \sim p(\B{x} \mid \B{y}_{\text{obs}})`` for one instance of observed data ``\B{y}_{\text{obs}} \sim \widehat{p}_{\text{data}} (\B{y})`` to the conditional normalizing flow implicitly defines a (conditional) distribution ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` in the conditional normalizing flow output space. We refer to this distribution as the shifted latent distribution as it is the result of a data distribution shift translated through the conditional normalizing flow to the latent space. The data distribution shifts can be caused by changes in number of sources, noise distribution, wavelet source frequency, and geological features to be imaged. Equation #kl_divergence_invariance states that the conditional normalizing flow fails to accurately Gaussianize the input models ``\B{x} \sim p(\B{x} \mid \B{y}_{\text{obs}})`` for the given data ``\B{y}_{\text{obs}}``\. Failure to take into account the mismatch between the shifted latent distribution ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` and ``\mathrm{N}(\B{z} \mid \B{0}, \B{I})`` leads to posterior sampling errors as the KL divergence between the predicted and true posterior distributions is nonzero (equation #kl_divergence_invariance\). In other words, feeding latent samples drawn from a standard Gaussian distribution to ``f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}})`` produces samples from ``p_{\phi}(\B{x} \mid \B{y}_{\text{obs}})``\, which does not accurately approximate the true posterior distribution under the assumption of data distribution shift. On the other hand, with the same reasoning via the KL divergence invariance relation, feeding samples from the shifted latent distribution ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})`` to the conditional normalizing flow yields accurate posterior samples. However, obtaining samples from ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})`` is not trivial as we do not have a closed-form expression for its density. In the next section, we introduce a physics-based approximation to the shifted-latent distribution. ### Physics-based latent distribution correction Ideally, performing accurate posterior sampling via the pretrained conditional normalizing flow---in the presence of data distribution shifts---requires passing samples from the shifted latent distribution ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` to ``f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}})``\. Unfortunately, accurately sampling ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` requires access to the true posterior distribution, which we are ultimately after and do not have access to. Alternatively, we propose to quantify ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` using Bayes' rule, ```math #bayes_rule_physic_inf p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) = \frac{p_{\text{like}} (\B{y}_{\text{obs}} \mid \B{z})\, p_{\text{prior}} (\B{z})}{\widehat{p}_{\text{data}} (\B{y}_{\text{obs}})}, ``` where the physics-informed likelihood function ``p_{\text{like}} (\B{y}_{\text{obs}} \mid \B{z})`` and the prior distribution ``p_{\text{prior}} (\B{z})`` over the latent variable are defined as ```math #physics_informed_density - \log p_{\phi}(\B{z} \mid \B{y}_{\text{obs}}) & = - \sum_{i=1}^{N} \log p_{\text{like}} (\B{y}_{\text{obs}, i} \mid \B{z} ) - \log p_{\text{prior}} (\B{z}) + \log \widehat{p}_{\text{data}} (\B{y}_{\text{obs}}) \\ & := \frac{1}{2 \sigma^2} \sum_{i=1}^{N} \big \|\B{y}_{\text{obs}, i} - \mathcal{F}_i \circ f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}}) \big\|_2^2 + \frac{1}{2 } \big \| \B{z} \big \|_2^2 + \text{const}. ``` In the above expression, the physics-informed likelihood function ``p_{\text{like}} (\B{y}_{\text{obs}} \mid \B{z})`` follows from the forward model in equation #forward_model with a Gaussian assumption on the noise with mean zero and covariance matrix ``\sigma^2 \B{I}``, and the prior distribution ``p_{\text{prior}} (\B{z})`` is chosen as a standard Gaussian distribution with mean zero and covariance matrix ``\B{I}``. The choice of the likelihood function ensures physics and data fidelity by giving more importance to latent variables that once passed through the pretrained conditional normalizing flow and the forward operator provide smaller data misfits while the prior distribution ``p_{\text{prior}} (\B{z})`` injects our prior beliefs about the latent variable, which is by design chosen to be distributed according to a standard Gaussian distribution. Due to our choice of the likelihood function and prior distribution above, the effective prior distribution over the unknown ``\B{x}`` is in fact a conditional prior characterized by the pretrained conditional normalizing flow [@pmlr-v119-asim20a]. As observed by @yang2018conditional and @orozco2021photoacoustic\, using a conditional prior may be more informative than its unconditional counterpart because it is conditioned by the observed data ``\B{y}_{\text{obs}}``\. Our approach can be also viewed as an instance of online variational Bayes [@zeno2018task] where data arrives sequentially and previous posterior approximates are used as priors for subsequent approximations. In the next section, we improve the available amortized approximation to the posterior distribution by relaxing the standard Gaussian distribution assumption of the conditional normalizing flow latent distribution. #### Gaussian relaxation of the latent distribution By definition, feeding samples from ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})`` to the pretrained amortized conditional normalizing flows provides samples from the posterior distribution (see discussion beneath equation #kl_divergence_invariance\). To maintain the low computational cost of sampling with amortized variational inference, it is imperative that ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})`` is sampled as cheaply as possible. To this end, we exploit the fact that conditional normalizing flows in the context of amortized variational inference are trained to Gaussianize the input model random variable (equation #vi_forward_kl_nf\). This suggests that the shifted latent distribution ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})`` will be close to a standard Gaussian distribution for a certain class of data distribution shifts. We exploit this property and approximate the shifted latent distribution ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` via a Gaussian distribution with an unknown mean and diagonal covariance matrix, ```math #gaussian_approximation p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) \approx \mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big), \quad \B{z} \in \mathcal{Z}. ``` In the above expression, the vector ``\Bs{\mu}`` corresponds to the mean and the vector ``\operatorname{diag}(\B{s})^2`` represents a diagonal covariance matrix with diagonal entries ``\B{s} \odot \B{s}`` (with the symbol ``\odot`` denoting elementwise multiplication) that need to be determined. We estimate these quantities by minimizing the reverse KL divergence between the relaxed Gaussian latent distribution ``\mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big)`` and the shifted latent distribution ``p_{\phi}(\B{z} \mid \B{y}_{\text{obs}})``\. According to the variational inference objective function associated with the reverse KL divergence in equation #vi_reverse_kl\, this correction can be achieved by solving the following optimization problem (see derivation in Appendix A), ```math #reverse_kl_covariance_diagonal \Bs{\mu}^{\ast}, \B{s}^{\ast} & = \argmin_{\Bs{\mu}, \B{s}} \KL\,\big(\mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big) \mid\mid p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) \big) \\ & = \argmin_{\Bs{\mu}, \B{s}} \mathbb{E}_{\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I})} \bigg [\frac{1}{2 \sigma^2} \sum_{i=1}^{N} \big \| \B{y}_{\text{obs}, i}-\mathcal{F}_i \circ f_ {\phi} \big(\B{s} \odot\B{z} + \Bs{\mu}; \B{y}_{\text{obs}} \big) \big\|_2^2 \\ & \qquad \qquad \qquad \qquad \quad + \frac{1}{2} \big\| \B{s} \odot \B{z} + \Bs{\mu} \big\|_2^2 - \log \Big | \det \operatorname{diag}(\B{s}) \Big | \bigg ]. ``` We solve optimization problem #reverse_kl_covariance_diagonal with the Adam optimizer where we select random batches of latent variable variables ``\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I})`` and data indices\. We initialize the optimization problem #reverse_kl_covariance_diagonal by ``\Bs{\mu} = \B{0}`` and ``\operatorname{diag}(\B{s})^2 = \B{I}``\. This initialization acts as a warm-start and an implicit regularization [@pmlr-v119-asim20a] since ``f_{\phi}^{-1}(\B{z}; \B{y}_{\text{obs}})`` for standard Gaussian distributed latent samples ``\B{z}`` provides approximate samples from the posterior distribution---thanks to amortization over different observed data ``\B{y}``\. As a result, we expect the optimization problem #reverse_kl_covariance_diagonal to be solved relatively cheaply. Additionally, the imposed standard Gaussian distribution prior on ``\B{s} \odot \B{z} + \Bs{\mu}`` regularizes inversion for the corrections since ``\KL\,\left(p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) \mid\mid \mathrm{N}(\B{z} \mid \B{0}, \B{I})\right)`` is minimized during amortized variational inference (equation #kl_divergence_invariance\). To relax the (conditional) prior imposed by the pretrained conditional normalizing flow, instead of a standard Gaussian prior, a Gaussian prior with a larger variance can be imposed on the corrected latent variable. Conditional normalizing flows' inherent invertibility allows the normalizing flow to represent any solution ``\B{x} \in \mathcal{X}`` in the solution space. This has the additional benefit of limiting the adverse affects of imperfect pretraining of ``f_{\phi}`` in domains where access to high-fidelity training data is limited, which is often the case in practice. The output of the conditional normalizing flow can be further regularized by including additional regularization terms in equation #reverse_kl_covariance_diagonal to prevent it from producing out-of-range, non-physical results. Figure #schematic_correction summarizes our proposed method latent distribution correction method. ### Figure: {#schematic_correction .wide} ![](figures/Figure_9.png){width=100%} :A schematic representation of out proposed method. When dealing with nonzero amortized variational inference objective value (equation #average_forward_kl\) or in presence of data distribution shifts during inference, we correct the latent distribution of the pretrained conditional normalizing flow via a diagonal physics-based correction. After the correction, the new latent samples result in corrected posterior samples when fed to the pretrained normalizing flow. #### Inference with corrected latent distribution Once the optimization problem #reverse_kl_covariance_diagonal is solved with respect to ``\Bs{\mu}`` and ``\B{s}``\, we obtain corrected posterior samples by passing samples from the corrected latent distribution ``\mathrm{N} \big(\B{z} \mid \Bs{\mu}^{\ast}, \operatorname{diag}(\B{s}^{\ast})^2\big) \approx p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})`` to the conditional normalizing flow, ```math #inference_after_correction \B{x} = f_{\phi}^{-1}(\B{s}^{\ast} \odot\B{z} + \Bs{\mu}^{\ast}; \B{y}_{\text{obs}}), \quad \B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I}). ``` These corrected posterior samples are implicitly regularized by the reparameterization with the pretrained conditional normalizing flow and the standard Gaussian distribution prior on ``\B{z}`` [@pmlr-v119-asim20a; @orozco2021photoacoustic; @siahkoohi2022EAGEweb]. Next section applies this physics-based correction to a seismic imaging example, in which we use the pretrained conditional normalizing flow from the earlier example. ## Latent distribution correction applied to seismic imaging The purpose of our proposed latent distribution correction approach is to accelerate Bayesian inference while maintaining fidelity to a specific observed dataset and physics. While this method is generic and can be applied to a variety of inverse problems, it is particularly relevant when solving geophysical inverse problems, where the unknown quantity is high dimensional, the forward operator is computationally costly to evaluate, and there is a lack of access to high-quality training data that represents the true heterogeneity of the Earth's subsurface. Therefore, we apply this approach to seismic imaging to utilize the advantages of generative models for solving inverse problems, including fast conditional sampling and learned prior distributions, while limiting the negative bias induced by shifts in data distributions. The results are presented for two cases. The first case involves introducing a series of changes in the distribution of observed data, for example changing the number of sources and the noise levels. This is followed by correcting for the error in predictions made by the pretrained conditional normalizing flow using our proposed method. In the second case, in addition to the shifts in the distribution of the observed data (forward model), we also introduce a shift in the prior distribution. We accomplish this by selecting a ground-truth image from a deeper section of the Parihaka dataset that has different image characteristics than the training images, such as tortuous reflectors and more complex geological features. In both cases, we expect the outcome of the Bayesian algorithm to improve following the correction of latent distributions described above. We provide qualitative and quantitative evaluations of the Bayesian inference results. ### Shift in the forward model In the following example, we introduce shifts in the distribution of observed data---compared to the pretraining phase---by changing the forward model. The shift involves reducing the number of sources (``N`` in equation #forward_model\) by a factor of two to four, while adding band-limited noise with 1.5 to three times larger standard deviation (``\sigma`` in equation #bayes_rule_log\). We will demonstrate the potential pitfalls of relying solely on the pretrained conditional normalizing flows in circumstances where the distribution of observed data has shifted. With the use of our latent distribution correction, we will demonstrate that we are able to correct for errors that are made by the pretrained conditional normalizing flow as a result of changes to data distribution. Following the description of the problem setup, we will also provide comparisons between the conditional mean estimation quality before and after the latent distribution correction step. Before moving on to our results relating to uncertainty quantification on the image, we demonstrate the importance of the correction step by visualizing the improvements in fitting the observed data. Lastly, we perform a series of experiments to verify our Bayesian inference results. #### Problem setup To induce shifts in the data distribution, we reduce the number of sources and increase the standard deviation of the added band-limited noise. Consequently, we have reduced the amount of data (due to having fewer source experiments) and decreased the SNR of each shot record (due to being contaminated with stronger noise). As a consequence, seismic imaging becomes more challenging, i.e., more difficult to estimate the ground truth image, and it is expected that the uncertainty associated with the problem will also increase. We use the same ground-truth image as in the previous example (Figure #true_model\), while experimenting with 25, 51, 102 sources and adding band-limited noise that has ``1.5``, ``2.0``, ``2.5``, and ``3.0`` times larger standard deviation than the pretraining setup. For each combination of source number and noise level (``12`` combinations in total), we compute the reverse-time migrated image corresponding to that combination. Next, we perform latent distribution corrections for each of the ``12`` seismic imaging instances. All latent distribution correction optimization problems (equation #reverse_kl_covariance_diagonal\) are solved using the Adam optimization algorithm [@kingma2014adam] for five passes over the shot records (epochs). We did not observe a significant decrease in the objective function after five epochs. The objective function is evaluated each iteration by drawing a single latent sample from the standard Gaussian distribution and randomly selecting (without replacement) a data index ``i \in \{1, \ldots, 25 \}``\. We use a stepsize of ``10^{-1}`` and decrease it by a factor of ``0.9`` at the end of every two epochs. After solving the optimization problem #reverse_kl_covariance_diagonal for the different seismic imaging instances, we obtain corrected posterior samples for each instance. The next section provides a detailed discussion of the latent distribution correction that was applied to one such instance that had a significant shift in data distribution. #### Improved Bayesian inference via latent distribution correction The aim of this section is to demonstrate how latent distribution correction can be used to mitigate errors induced by data distribution shifts. Specifically, we present the results for the case where the number of sources is reduced by a factor of four (``N = 25``\) as compared to the pretrained data generation setup. The ``25`` sources are spread periodically over the survey area with a source sampling of approximately ``200`` meters. Moreover, we contaminate the resulting shot records with band-limited noise with an increased standard deviation of ``2.5`` times when compared to the pretraining phase. The overall SNR for the data thus becomes ``-2.78\,\mathrm{dB}``\, which is ``7.95\,\mathrm{dB}`` lower that the SNR of the observed data during pretraining (Figure #d-obs\). Figure #data_ldc_0 shows one of the ``25` shot records. ### Figure: {#data_ldc_0 .wide} ![](figures/Figure_10a.png){width=35% #d-noise-free_ldc_0} ![](figures/Figure_10b.png){width=35% #d-obs_ldc_0} :A shot record from the shifted data distribution. (a) Noise-free linearized data (same as Figure #d-noise-free\). (b) Noisy linearized data with ``2.5`` larger band-limited noise standard deviation (SNR ``-2.78\,\mathrm{dB}``)\. Utilizing the above observed dataset, we compute the reverse-time migrated image as an input to our pretrained conditional normalizing flow (Figure #rtm_ldc_0\). In contrast to the reverse-time migrated image shown in Figure #d-obs\, this migrated image is, as expected, noisier, and it displays visible near-source imaging artifacts as a result of coarse source sampling. Additionally, we compute the least-squares migrated seismic image that is obtained by minimizing the negative-log likelihood (see the likelihood term in equation #bayes_rule_log\). This image, shown in Figure #lsrtm_ldc_0\, was constructed by fitting the data without incorporating any prior information. It is evident from this image that there are strong artifacts caused by noise in the data, underscoring the importance of incorporating prior knowledge into solving seismic imaging. ### Figure: {#seismic_ldc_0_setup} ![](figures/Figure_11a.png){width=65% #rtm_ldc_0} \ ![](figures/Figure_11b.png){width=65% #lsrtm_ldc_0} :Latent distribution correction experiment setup. (a) Reverse-time migrated image corresponding to the shifted forward model with SNR ``-8.22\,\mathrm{dB}``\. (b) Least squares imaging, which is equivalent to the minimizing ``\sum_{i=1}^{N} \big \| \B{d}_i-\B{J}(\B{m}_0, \B{q}_i)\delta \B{m} \big\|_2^2`` with respect to ``\delta \B{m}`` with no regularization. The SNR for this estimate is ``6.90\,\mathrm{dB}``\. ##### Improvements in posterior samples and conditional mean estimate To obtain amortized (uncorrected) posterior samples, we feed the reverse-time migrated image (Figure #rtm_ldc_0\) and latent samples drawn from the standard Gaussian distribution to the pretrained normalizing flow. These samples, which are shown in the left column of Figure #seismic_ldc_0_samples\, contain artifacts near the top of the image. These artifacts are related to the near-source reverse-time migrated image artifacts (Figure #rtm_ldc_0\). Since the reverse-time migrated images used during pretraining do not contain near-source imaging artifacts---due to fine source sampling---the pretrained normalizing flow fails to eliminate them. Further, the uncorrected posterior samples do not accurately predict reflectors as they approach the boundaries and deeper sections of the image. To illustrate the improved posterior sample quality following latent distribution correction, we feed latent samples drawn from the corrected latent distribution to the pretrained normalizing flow (right column of Figure #seismic_ldc_0_samples\). Comparing the left and right columns in Figure #seismic_ldc_0_samples indicates an improvement in the quality of samples from the posterior distribution, which can be attributed to the attenuation of near-top artifacts and an improvement in the image quality close to the boundary and deeper reflectors in the image. Moreover, the SNR values of the posterior samples after correction are approximately ``3\,\mathrm{dB}`` higher, which represents a significant improvement. ### Figure: {#seismic_ldc_0_samples} ![](figures/Figure_12a.png){width=50%} ![](figures/Figure_12b.png){width=50%} \ ![](figures/Figure_12c.png){width=50%} ![](figures/Figure_12d.png){width=50%} \ ![](figures/Figure_12e.png){width=50%} ![](figures/Figure_12f.png){width=50%} :Samples from the posterior distribution (left) without latent distribution correction with SNRs ranging from ``4.57\,\mathrm{dB}`` to ``5.21\,\mathrm{dB}``\; and (right) after latent distribution correction with SNRs ranging from ``7.80\,\mathrm{dB}`` to ``8.53\,\mathrm{dB}``\. To compute the conditional mean estimate, we simulate one thousand posterior samples before and after latent distribution correction. As with the posterior samples before correction, drawing samples after correction is very cheap once the correction is done as it only requires evaluating the conditional normalizing flow over the corrected latent samples. Figures #ldc_0_avi_cm and #ldc_0_cm\ show conditional mean estimates before and after latent distribution correction, respectively. The conditional mean estimate before correction reveals similar artifacts as the posterior samples before correction, in particular, near-top imaging artifacts due to coarse sources sampling and less illumination of reflectors located closer to the boundary and deeper portions of the image. The importance of our proposed latent distribution correction can be observed by juxtaposing the conditional mean estimate before (Figure #ldc_0_avi_cm\) and after correction (Figure #ldc_0_cm\). The conditional mean estimate obtained after latent distribution correction eliminates the aforementioned inaccuracies and enhances the quality of the image by approximately ``4\,\mathrm{dB}``\. We gain similar improvements in SNR compared to the least-squares migrated image (Figure #lsrtm_ldc_0\) with virtually the same cost, i.e., five passes over the shot records. This is significant improvement in SNR also is complimented by access to information regarding the uncertainty of the image. ### Figure: {#seismic_ldc_0_results_cm} ![](figures/Figure_13a.png){width=65% #ldc_0_avi_cm} \ ![](figures/Figure_13b.png){width=65% #ldc_0_cm} :Improvements in conditional mean estimate due to latent distribution correction. (a) The conditional (posterior) mean estimate using the pretrained conditional normalizing flow without correction (SNR ``6.29\,\mathrm{dB}``\). (b) The conditional mean estimate after latent distribution correction (SNR ``10.36\,\mathrm{dB}``\). ##### Data-space quality control As the latent distribution correction step involves finding latent samples that are better suited to fit the data (equation #reverse_kl_covariance_diagonal\), we can expect an improvement in fitting the observed data after correction. Predicted data is obtained by applying the forward operator to the conditional mean estimates, before and after latent distribution correction. Figures #d-pred_avi_ldc_0 and #d-pred_ldc_0 show the predicted shot records before and after correction, respectively. In spite of the fact that both predicted data appear to be similar to ideal noise-free data (Figure #d-noise-free_ldc_0\), the data residual associated with the conditional mean without correction reveals several coherent events that contain valuable information about the unknown seismic image. The latent distribution correction allows us to fit these coherent events as indicated by the data residual associated with the corrected conditional mean estimate. ### Figure: {#data_QC_ldc_0 .wide} ![](figures/Figure_14a.png){width=35% #d-pred_avi_ldc_0} ![](figures/Figure_14b.png){width=35% #d-pred_ldc_0} \ ![](figures/Figure_14c.png){width=35% #d-error_avi_ldc_0} ![](figures/Figure_14d.png){width=35% #d-error_ldc_0} \ :Quality control in data space. Data is simulated by applying the forward operator to the conditional mean estimate (a) before (SNR ``11.62\,\mathrm{dB}``\); and (b) after latent distribution correction (SNR ``16.57\,\mathrm{dB}``\). (c) Prediction errors associated with Figure #d-pred_avi_ldc_0\. (d) Prediction errors associated with Figure #d-pred_ldc_0 (after latent distribution correction). ##### Uncertainty quantification---pointwise standard deviation and histograms We exploit cheap access to corrected samples from the posterior in order to extract information regarding uncertainty in the image estimates. Figure #ldc_0_std displays the pointwise standard deviation among the one thousand corrected posterior samples. The overprint by the strong reflectors can be reduced by normalizing the standard deviation using a stabilized division by the conditional mean (Figure #ldc_0_std_normalized\). The pointwise standard deviation plots indicate high uncertainty in areas near the boundaries of the image and in the deep parts of the image where illumination is relatively poor. This observation is more evident in Figure #seismic_ldc_0_results_profile\, which displays three vertical profiles as ``99\%`` confidence intervals (orange colored shading) illustrating the expected increasing trend of uncertainty with depth. We additionally observe that the ground truth (dashed black) falls within the confidence intervals for most of the areas. ### Figure: {#seismic_ldc_0_results_std} ![](figures/Figure_15a.png){width=65% #ldc_0_std} \ ![](figures/Figure_15b.png){width=65% #ldc_0_std_normalized} :Uncertainty quantification with latent distribution correction. (a) The pointwise standard deviation among samples drawn from the posterior after latent distribution correction. (b) Normalized pointwise standard deviation by the conditional mean estimate (Figure #ldc_0_cm\). ### Figure: {#seismic_ldc_0_results_profile} ![](figures/Figure_16a.png){width=65% #avi_ldc_0_std_profile_64} \ ![](figures/Figure_16b.png){width=65% #avi_ldc_0_std_profile_128} \ ![](figures/Figure_16c.png){width=65% #avi_ldc_0_std_profile_192} \ :Confidence intervals for three vertical profiles. Traces of ``99 \%`` confidence interval (shaded orange color), corrected conditional mean (solid blue), and ground truth (dashed black) at (a) ``1.28\,\mathrm{km}``, (b) ``2.56\, \mathrm{km}``\, and (c) ``3.84\,\mathrm{km}``\ horizontal location. To demonstrate how the corrected posterior is informed by the observed data, we calculated histograms at three locations in Figure #ldc_0_std\. Prior histograms are calculated by feeding latent samples drawn from the standard Gaussian distribution to the pretrained conditional normalizing flow without using data conditioning (see @kruse2021hint and @siahkoohi2021Seglbe for more information). These samples in the image spaces are indicative of samples from the prior distribution implicitly learned by the conditional normalizing flow during pretraining. The resulting prior histograms are shown in Figure #seismic_ldc_0_results_hist\. Corresponding histograms are also obtained for the uncorrected amortized posterior distribution (equation #sampling_amortized\). As mentioned before, the uncorrected posterior distribution serves as an implicit conditional prior for the subsequent step of correction of the latent distribution. The green histograms in Figure #seismic_ldc_0_results_hist represent the uncorrected amortized posterior distribution. A similar procedure is followed to obtain histograms after latent distribution correction (blue histograms). As expected, the histograms of the posterior distribution are considerably narrower than those of the learned prior, which indicates that the posterior is further informed by the specific observed dataset and physics. As a means of evaluating the effect of latent distribution correction, we provide a vertical solid line showing the ground truth value's location. All three corrected posterior histograms for each location are shifted towards the ground truth, and their (conditional) mean plotted with the dashed vertical line indicates improved recovery of the ground truth. Compared to the amortized uncorrected histograms, the corrected histograms in Figures #avi_ldc_0_std_hist_60-70 -- #avi_ldc_0_std_hist_70-200 are further contracted, suggesting that the latent distribution correction step has further informed the inference by the data. ### Figure: {#seismic_ldc_0_results_hist} ![](figures/Figure_17a.png){width=65% #avi_ldc_0_std_hist_60-70} \ ![](figures/Figure_17b.png){width=65% #avi_ldc_0_std_hist_70-200} \ ![](figures/Figure_17c.png){width=65% #avi_ldc_0_std_hist_200-150} \ :Pointwise prior (red), uncorrected amortized posterior (green), and latent distribution corrected posterior (blue) histograms along with the true perturbation values (solid black line) and the corrected conditional mean (dashed black line) for points located at (a) ``(1.2\, \mathrm{km}, \ 0.875\, \mathrm{km})``, (b) ``(1.4\, \mathrm{km}, \ 2.5\, \mathrm{km})``, and (c) ``(4.0\, \mathrm{km},\ 1.875\, \mathrm{km})``. #### Bayesian inference verification While we investigated the accuracy of the conditional mean estimate after correction, we do not have access to the underlying true posterior distribution to verify our proposed posterior sampling method. This is partly due to our learned prior and the implicit conditional prior used in latent variable correction, which make traditional MCMC-based comparisons challenging. To further validate our Bayesian inference procedure, we conduct a series of experiments in which we investigate the effect of gradual increase in the number of sources (``N``) and reduction of the noise level. As the number of sources increases and the noise level decreases, we expect to see an increase in seismic image quality and a decrease in uncertainty. ##### Estimation accuracy The accuracy of the Bayesian parameter estimation method is directly affected by the amount of data that has been collected [@tarantola2005inverse]. That is to say with more observed data (larger ``N``\), we should be able to obtain a more accurate seismic image estimate. The same principle allows us to stack out noise when increasing the fold in seismic data acquisition. In order to assess whether our Bayesian inference approach has this property, we repeat the latent distribution correction process while varying the number of sources (using ``N = 25, 51, 102`` sources) and the amount of band-limited noise (standard deviations ``1.5, 2.0, 2.5``, and ``3.0`` times greater than the noise standard deviation during pretraining). Each of the ``12`` instances of latent distribution correction problems is treated similarly with respect to the number of passes made over the shot records and other optimization parameters. For each of the ``12`` combinations of source numbers and noise levels, we calculate the corrected conditional mean estimate and plot the SNRs as a function of the noise's standard deviation in Figure #avi_ldc_post_accuracy_snr\. For a fixed number of sources (``25``, ``51``, and ``102`` sources shown with red, green, and blue colors respectively), we plot the corrected conditional means SNR as a function of the noise standard deviation. There is a clear increase in SNR trend as we decrease the noise level. In the same way, for each fixed noise level, the SNR increases with the number of sources. This verifies the our Bayesian inference method yields a more accurate estimate of the conditional mean for larger number of sources and smaller noise levels. ### Figure: {#avi_ldc_post_accuracy_snr} ![](figures/Figure_18.png){width=50%} :Estimation accuracy as a function of number of sources and noise levels. Colors correspond to different source numbers. ##### Bayesian posterior contraction An alternative Bayesian inference verification method involves analyzing the Bayesian posterior contraction, that is, the decrease of uncertainty with more data. To examine whether or not our Bayesian inference method possesses this property, we visually inspect the resulting pointwise standard deviation plots in Figure #all_stds for the ``12`` possible combinations of source numbers and noise levels. Each row corresponds to the pointwise standard deviation plot for a fixed noise standard deviation (``\sigma``), where the number of sources (``N``) decreases from left to right. In each column, we maintain the number of sources and we plot the pointwise standard deviation as we increase the noise standard deviation from top to bottom. There is a consistent increase in standard deviation values as we move from the top-left to the bottom-right corner. In other words, the posterior contract (shrinks) when we have more data (more numbers of sources, Figure #all_stds from right to left) and when we have less noise (Figure #all_stds from bottom to top), which effectively means more data. ### Figure: {#all_stds} ![](figures/Figure_19.png){width=100.0%} :Bayesian posterior contraction: visual inspection. Pointwise standard deviations for varying number of sources (decreasing left to right) and noise variances (increases top to bottom). Figure #all_stds_box offers an alternate method of visualizing posterior contraction, displaying box plots of the standard deviation values for each of the ``12`` images in Figure #all_stds\. In each of the three box plots, the vertical axis corresponds to the noise standard deviation, and the horizontal axis represents the possible values in the posterior pointwise standard deviation plots. The box indicates the values that are between the first and third quartiles (where half of the possible values fall) and the line in the middle indicates the median value. Figures #avi_ldc_0_nsrc25 to #avi_ldc_0_std_nsrc102 show box plots for experiments with ``25``, ``51``, and ``102`` sources, respectively, with each box plot color reflecting a particular noise level. In each of the Figures #avi_ldc_0_nsrc25 to #avi_ldc_0_std_nsrc102\, we observe a decrease in the range of posterior standard deviation values, including median and quantiles, as we lower the noise level from left to right. Similarly, for the same noise levels, that is, box plots of the same color, the standard deviations decrease from Figure #avi_ldc_0_nsrc25 to Figure #avi_ldc_0_std_nsrc102 (increasing number of sources). The observed trends in Figures #all_stds and #all_stds_box verify that our Bayesian inference method exhibits the Bayesian posterior contraction property. ### Figure: {#all_stds_box} ![](figures/Figure_20a.png){width=33.33% #avi_ldc_0_nsrc25} ![](figures/Figure_20b.png){width=33.33% #avi_ldc_0_nsrc51} ![](figures/Figure_20c.png){width=33.33% #avi_ldc_0_std_nsrc102} :Box plots of pointwise standard deviation values as a function of noise level for number of sources (a) ``N=25``\, (b) ``N=51``\, and (c) ``N=102``\. ### Shift in the forward model and prior distribution This example is intended to demonstrate how our latent distribution correction method can perform Bayesian inference when the unknown ground truth image has properties that differ from the seismic images in the pretraining dataset. This expectation is based on the invertible nature of conditional normalizing flows, which allows them to represent any image in the image space [@pmlr-v119-asim20a]. As a means to mimic this scenario, unlike images used in pretraining, we have extracted two 2D seismic images from deeper sections of the Parihaka dataset. In these sections, there are fewer continuous reflectors and more complex geological features. Figures #true_model_ldc_1 and #true_model_ldc_2 show two images with drastically different geological features when compared to Figure #true_model\. Following the pretraining data acquisition setup, we add a water column on top of these images to reduce the near-source artifacts. Similar to the previous experiment involving forward model shifts, we use only 25 sources with 200 meters sampling distance and add noise with a ``2.5`` fold larger standard deviation than during the pretraining phase. With this setup, the noisy observed data associated with experiments involving seismic images in Figures #true_model_ldc_1 and #true_model_ldc_2 have a SNR of ``-1.56\,\mathrm{dB}`` and ``-2.41\,\mathrm{dB}``\, respectively. As part of our analysis, we compute the reverse-time (second row in Figure #seismic_ldc_12_setup\) and least-squares (third row in Figure #seismic_ldc_12_setup\) migrated images for these two ground-truth images, where the former images serve as inputs to the pretrained conditional normalizing flows. Similar to the previous example, the reverse-time migrated images contain the near-source artifacts and are contaminated by the input noise. Moreover, the least-squares migrated images highlight the importance of including prior information in this imaging problem, since this image contains strong noise-related artifacts, which might impact downstream tasks, such as horizon tracking. ### Figure: {#seismic_ldc_12_setup} ![](figures/Figure_21a.png){width=50% #true_model_ldc_1} ![](figures/Figure_21b.png){width=50% #true_model_ldc_2} \ ![](figures/Figure_21c.png){width=50% #rtm_ldc_1} ![](figures/Figure_21d.png){width=50% #rtm_ldc_2} \ ![](figures/Figure_21e.png){width=50% #lsrtm_ldc_1} ![](figures/Figure_21f.png){width=50% #lsrtm_ldc_2} :Setup for experiments involving shifts in the prior distribution. (a) and (b) Two high-fidelity ground-truth images from deeper sections of the Parihaka dataset. (c) and (d) Reverse-time migrated image corresponding ground-truth images in Figures #true_model_ldc_1 (SNR ``-8.02\,\mathrm{dB}``) and #true_model_ldc_2 (SNR ``-8.49\,\mathrm{dB}``), respectively. (e) and (f) Least squares imaging results (no regularization) corresponding ground-truth images in Figures #true_model_ldc_1 (SNR ``4.94\,\mathrm{dB}``) and #true_model_ldc_2 (SNR ``5.59\,\mathrm{dB}``), respectively. #### Conditional mean and pointwise standard deviations To obtain samples from the posterior distribution, we feed the reverse-time migrated images (Figures #rtm_ldc_1 and #rtm_ldc_2\) into the pretrained conditional normalizing flow. The posterior is sampled using either standard Gaussian distributions or corrected latent samples. It is apparent, once more, that the uncorrected conditional mean estimates are significantly contaminated by artifacts in the near-source region (Figures #ldc_1_avi_cm and #ldc_2_avi_cm\). Another type of noticeable error in these predicted images includes lower amplitudes in deeper and closer to boundary reflectors. A comparison of the conditional means estimates before (Figures #ldc_1_avi_cm and #ldc_2_avi_cm\) and after (Figures #ldc_1_cm and #ldc_2_cm\) latent distribution correction indicates attenuation of near-source artifacts as well as an improvement in reflector illumination near the boundary and deeper sections where images are more difficult to capture. Following latent distribution correction, the conditional mean estimate SNR is improved by three to four decibels for both images. To provide more quantitative results, we ran the experiments as set up in this section for eight additional seismic images sampled from deeper sections of the Parihaka dataset, sampled from deeper sections of the Parihaka dataset. The SNR of the estimated seismic images before and after latent distribution are ``5.91 \pm 0.49\,\mathrm{dB}`` and ``9.15 \pm 0.73\,\mathrm{dB}``\, respectively. ### Figure: {#seismic_ldc_12_results_cm} ![](figures/Figure_22a.png){width=50% #ldc_1_avi_cm} ![](figures/Figure_22b.png){width=50% #ldc_2_avi_cm} \ ![](figures/Figure_22c.png){width=50% #ldc_1_cm} ![](figures/Figure_22d.png){width=50% #ldc_2_cm} :The improvement in conditional mean estimate due to latent distribution correction. (a) and (b) Amortized uncorrected conditional (posterior) mean estimates with SNRs ``5.47\,\mathrm{dB}`` and ``6.17\,\mathrm{dB}``\, respectively. (c) and (d) Conditional (posterior) mean estimates after latent distribution correction with SNRs ``9.40\,\mathrm{dB}`` and ``9.11\,\mathrm{dB}``\, respectively. Careful inspection at the boundaries in the corrected conditional mean estimates reveals some nonrealistic events near the boundaries. The plots of pointwise standard deviations (Figures #ldc_1_std and #ldc_2_std\) associated with corrected conditional means, however, clearly indicate that there is uncertainty for these events. This illustrates the importance of uncertainty quantification and not relying on a single estimate when addressing ill-posed inverse problems. To diminish the imprint of strong reflectors in the pointwise standard deviations plots, we also display these images when normalized with respect to the envelopes of the conditional mean estimates in Figures #ldc_1_std_normalized and #ldc_2_std_normalized\. ### Figure: {#seismic_ldc_12_results_std} ![](figures/Figure_23a.png){width=50% #ldc_1_std} ![](figures/Figure_23b.png){width=50% #ldc_2_std} \ ![](figures/Figure_23c.png){width=50% #ldc_1_std_normalized} ![](figures/Figure_23d.png){width=50% #ldc_2_std_normalized} \ :Uncertainty quantification with latent distribution correction. (a) and (b) The pointwise standard deviation estimates among samples drawn from the posterior after latent distribution correction. (c) and (d) Normalized pointwise standard deviations after latent distribution correction, normalized by the envelop of conditional mean estimates. #### Data residuals As before, we confirm that the latent distribution correction improves the fit of the data. Figure #data_QC_ldc_12 shows the predicted data as well as the data residuals for all conditional mean estimates. The predicted data are obtained by applying the forward operator to the conditional mean estimates, both before and after latent distribution correction. The predicted data before and after correction are presented in the first and second columns, respectively. The corresponding data residuals before and after correction, which are computed by subtracting the predicted data from ideal noise-free data, can be seen in the third and last columns of Figure #data_QC_ldc_12\. Evidently, the latent distribution correction stage has resulted in a better fit with the observed data, as coherent data events show up in Figures #d-error_avi_ldc_1 and #d-error_avi_ldc_2\, but are attenuated in the corrected residual plots (Figures #d-error_ldc_1 and #d-error_ldc_2\). ### Figure: {#data_QC_ldc_12 .wide} ![](figures/Figure_24a.png){width=25% #d-pred_avi_ldc_1} ![](figures/Figure_24b.png){width=25% #d-pred_ldc_1} ![](figures/Figure_24c.png){width=25% #d-error_avi_ldc_1} ![](figures/Figure_24d.png){width=25% #d-error_ldc_1} \ ![](figures/Figure_24e.png){width=25% #d-pred_avi_ldc_2} ![](figures/Figure_24f.png){width=25% #d-pred_ldc_2} ![](figures/Figure_24g.png){width=25% #d-error_avi_ldc_2} ![](figures/Figure_24h.png){width=25% #d-error_ldc_2} :Quality control in data space. The first and second row correspond to experiments involving Figures #true_model_ldc_1 and #true_model_ldc_2\, respectively. (a) and (e) Predicted data before correction with SNRs ``9.13\,\mathrm{dB}`` and ``9.02\,\mathrm{dB}``\, respectively. (b) and (f) Predicted data after latent distribution correction with SNRs ``15.27\,\mathrm{dB}`` and ``14.52\,\mathrm{dB}``\, respectively. (c) and (g) Data residuals (before correction) associated with Figures #d-pred_avi_ldc_1 and #d-pred_avi_ldc_2\, respectively. (d) and (h) Data residuals (after correction) associated with Figures #d-pred_ldc_1 and #d-pred_ldc_2\, respectively. ## Discussion The examples presented demonstrate that deep neural networks trained in the context of amortized variational inference can facilitate solving inverse problems in two ways: (1) incorporating prior knowledge gained through pretraining; and (2) accelerating Bayesian inference and uncertainty quantification. Despite the fact that amortized variational inference is capable of sampling the posterior distribution without requiring forward operator evaluations during inference, the extent to which it is considered reliable is dependent upon the availability of high quality training data. We demonstrate this limitation of amortized variational inference via a seismic imaging example where we alter the number of sources, variance of noise, and the present geological features to be imaged in comparison to the pretraining phase. These variations shift the data distribution, and in certain cases, e.g., imaging slightly different geological features, this can be also thought as evaluating the pretrained conditional normalizing flow over data samples from the low probability regions of the training distribution. Due to shifts in the data distribution, the obtained posterior samples via amortized variational inference in our numerical experiments had unusual near-source artifacts and less illuminated reflectors, which we can be partly attributed to lack of these artifacts in the training reverse-time migrated images. As part of our efforts to extend the application of the supervised amortized variational inference methods to domains with limited access to high-quality training data, we developed an unsupervised, physics-based variational inference formulation over the latent space of a pretrained conditional normalizing flow that mitigates some of the posterior sampling errors induced by data distribution shifts. In this approach, a diagonal correction to the latent distribution is learned that ensures that the conditional normalizing flow output distribution better matches the desired posterior distribution. Based on observations and other research [@pmlr-v119-asim20a; @orozco2021photoacoustic; @siahkoohi2022EAGEweb], we found that normalizing flows, because of their invertibility, are capable of partially mitigating errors related to changes in data distributions when they are used to solve inverse problems. We leave the delineation of which data distribution shifts can be handled by our diagonal correction approach to future work. Needless to say, by adhering to more complex transformations in the latent space, e.g., via a neural network, a wide range of data distribution shifts can be handled but it would potentially require a more computationally costly correction step. Latent distribution correction requires several passes over the data (five in our example), which includes evaluation of the forward operator and the adjoint of its Jacobian. Due to the amortized nature of our approach, these costs are significantly lower than those incurred by other non-amortized variational inference methods [@rizzuti2020SEGpub; @andrle2021invertible; @zhang2021bayesian2; @zhao2022interrogating], specifically tens of thousands of forward operator evaluations reported by @zhao2021bayesian in the context of travel-time tomography. By amortizing over the data, the pretrained conditional normalizing flow provides posterior samples for previously unseen data (drawn from the same distribution as training data) without the need for latent distribution correction. In presence of moderate data distribution shifts, the learned posterior is adjusted to new observed data via the latent distribution correction step, which could be considered an instance of transfer learning [@yosinski2014transferable]. In contrast, existing methods that perform variational inference in the latent space [@whang2021composing; @kothari2021trumpets] require a more substantial modification to the latent distribution, as the pretrained model in these methods originally provides samples from the prior distribution. For large-scale inverse problems, such as seismic imaging, where solving a partial differential equation is required to evaluate the forward operator, reduced computational costs of Bayesian inference are particularly important. Our approach reduces the computational costs associated with Bayesian inference in such problems while also complementing the inversion with a learned prior. In order to quantify the extent to which a diagonal correction to the latent distribution mitigates errors resulting from data distribution shifts, more research is required. While the examples we presented in this paper were in regards to linear inverse problem, variational inference can be also applied to nonlinear problems [@zhang2020seismic; @zhang2021bayesian2; @zhang2021introduction; @li2021traversing; @Curtis3d]. However, in the context of amortized variational inference there are two main considerations that need to be taken into account when dealing with large-scale nonlinear inverse problems. First, the parametric surrogate (conditional) distribution should be capable of approximating multi-modal densities. Invertible neural networks [@dinh2016density] are a suitable choice for this parameterization as they are known to be universal approximators [@NEURIPS2020_2290a738; @ishikawa2022universal], meaning that invertible networks are capable of approximating a general class of diffeomorphisms. In other words, as long as there is a smooth invertible map between the latent space and the desired posterior distribution, invertible neural networks can be used to approximate the posterior distribution. The second consideration is regarding finding low-dimensional summary statistics [@radev2020bayesflow] of observed data to avoid training the conditional normalizing flow over high dimensional data. In our linear seismic imaging example, we "summarized" seismic data (shot records) via the reverse-time migrated image. The conditions for which this summarization does not negatively bias the outcome of Bayesian inference (in the context of linear inverse problems) is described by @orozco2022SPIEadjoint\. Further work is needed to design low-dimensional summary statistics for nonlinear inverse problems in order to successfully apply our framework to large-scale nonlinear inverse problems. As far as seismic imaging is concerned, uncertainty can be attributed to two main sources [@thore2002structural; @osypov2013model; @Ely2018]: (1) data errors, including measurement noise; (2) modeling errors, including linearization errors, which diminish with the accuracy of the background velocity model. The scope of our research is focused on the first source of uncertainty, and we will explore how variational inference models can be used to capture errors in background models in future work. In contrast to the problem highlighted in this paper, capturing the uncertainty caused by errors in the background model would require generating training data involving imaging experiments for a variety of plausible background velocity models, which would be computationally intensive. Recent developments in Fourier neural operators [@li2021fourier; @siahkoohi2022SEGvcw] may prove to be useful in addressing this problem. ## Conclusions In high-dimensional inverse problems with computationally expensive forward modeling operators, Bayesian inference is challenging due to the cost of sampling the high-dimensional posterior distribution. The computational costs associated with the forward operator often limit the applicability of sampling the posterior distribution with traditional Markov chain Monte Carlo and variational inference methods. Added to the computational challenges are the difficulties associated with selecting a prior distribution that encodes our prior knowledge while not negatively biasing the outcome of Bayesian inference. Amortized variational inference addresses these challenges by incurring an offline initial training cost for a deep neural network that can approximate the posterior distribution for previously unseen observed data, distributed according to the training data distribution. When high-quality training data is readily available, and as long as there are no shifts in the data distribution, the pretrained network is capable of providing samples from the posterior distribution for previously unknown data virtually free of additional costs. Unfortunately, in certain domains, such as geophysical inverse problems, where the structure of the Earth's subsurface is unknown, it can be challenging to obtain a training dataset, e.g., a collection of images of the subsurface, which statistically captures the strong heterogeneity exhibited by the Earth’s subsurface. Furthermore, changes to the data generation process could negatively influence the quality of Bayesian inferences with amortized variational inferences due to generalization errors associated with neural networks. To address these challenges while exploiting the computational benefits of amortized variational inference, we proposed a data-specific, physics-based, and computationally cheap correction to the latent distribution of a conditional normalizing flow, pretrained to via an amortized variational inference objective. This correction involves solving a variational inference problem in the latent space of the pretrained conditional normalizing flow where we obtain a diagonal correction to the latent distribution such that the predicted posterior distribution more closely matches the desired posterior distribution. Using a seismic imaging example, we demonstrate that the proposed latent distribution correction, at a cost of five reverse-time migrations, can be used to mitigate the effects of data distribution shifts, which includes changes in the forward model as well as the prior distribution. Our evaluation indicated improvements in seismic image quality, comparable to least squares imaging, after the latent distribution correction step, as well as estimate on the uncertainty of the image. We presented the pointwise standard deviation as a measure of uncertainty in the image, which indicated an increase in variability in complex geological areas and poorly illuminated areas. This approach will enable uncertainty quantification in large-scale inverse problems, which otherwise would be computationally expensive to achieve. ## Related material The latent distribution correction optimization problem (equation #reverse_kl_covariance_diagonal\) involves computing gradients of the composition of the forward operator and the pretrained conditional normalizing flow with respect to the latent variable. Computing this gradient requires actions of the forward operator and the adjoint of its Jacobian. In our numerical experiments, these operations involved solving wave equations. For maximal numerical performance, we use [JUDI] [@witteJUDI2019] to construct wave-equation solvers, which utilizes the just-in-time [Devito] [@devito-compiler; @devito-api] compiler for the wave-equation based simulations. The invertible network architectures are implemented using [InvertibleNetworks.jl] [@invnet], a memory-efficient framework for training invertible nets in Julia programming language. For more details on our implementation, please refer to our code at [https://github.com/slimgroup/ReliableAVI.jl](https://github.com/slimgroup/ReliableAVI.jl). [JUDI]:https://github.com/slimgroup/JUDI.jl [Devito]:https://www.devitoproject.org/ [InvertibleNetworks.jl]:https://github.com/slimgroup/InvertibleNetworks.jl ## Acknowledgments This research was carried out with the support of Georgia Research Alliance and partners of the ML4Seismic Center. ## Appendix A ### Derivation of latent distribution correction objective function The optimization problem for correcting the latent distribution involves minimizing the KL divergence between the Gaussian relaxation of the latent distribution ``\mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big)`` and the shifted latent distribution ``p_{\phi} (\B{z} \mid \B{y}_{\text{obs}})``, which is conditioned on an instance of out-of-distribution data ``\B{y}_{\text{obs}} \sim \widehat{p}_{\text{data}} (\B{y})``\. This is an instance of non-amortized variational inference (see equations #reverse_kl_divergence and #vi_reverse_kl\) defined over the latent variable: ```math #reverse_kl_covariance_diagonal_derivation & \argmin_{\Bs{\mu}, \B{s}}\, \KL\,\big(\mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big) \mid\mid p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) \big) \\ & = \argmin_{\Bs{\mu}, \B{s}}\, \mathbb{E}_{\B{z} \sim \mathrm{N} (\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2)} \bigg [ -\log p_{\phi} (\B{z} \mid \B{y}_{\text{obs}}) + \log \mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big) \bigg ]. ``` The above expression can be further simplified by rewriting the expectation with respect to ``\mathrm{N} \big(\B{z} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big)`` as the expectation with respect to a standard Gaussian distribution, followed by an elementwise scaling by ``\B{s}`` and a shift by ``\Bs{\mu}`` [see reparameterization trick in @Kingma2014], i.e., ```math #change-expectation \argmin_{\Bs{\mu}, \B{s}}\, \mathbb{E}_{\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I})} \bigg [ -\log p_{\phi} (\B{s} \odot\B{z} + \Bs{\mu} \mid \B{y}_{\text{obs}}) + \log \mathrm{N} \big(\B{s} \odot\B{z} + \Bs{\mu} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big) \bigg ]. ``` The last term in the expectation in equation #change-expectation is the log-density of a Gaussian distribution, which is equal to: ```math #gaussian_density & \log \mathrm{N} \big(\B{s} \odot\B{z} + \Bs{\mu} \mid \Bs{\mu}, \operatorname{diag}(\B{s})^2\big) \\ & = -\frac{D}{2} \log (2 \pi) -\frac{1}{2} \log \Big | \det \operatorname{diag}(\B{s})^2 \Big | -\frac{1}{2} \big(\B{s} \odot\B{z} + \Bs{\mu} -\Bs{\mu} \big)^{\top} \operatorname{diag}(\B{s})^{-2} \big(\B{s} \odot\B{z} + \Bs{\mu} -\Bs{\mu} \big) \\ & = - \log \Big | \det \operatorname{diag}(\B{s}) \Big | -\frac{1}{2} \big\| \B{z} \big\|_2^2 + \text{const} \\ & = - \log \Big | \det \operatorname{diag}(\B{s}) \Big | + \text{const}. ``` In the above equation, ``D`` is the dimension of ``\B{z}``\, and the constants represent terms that are not a function of ``\Bs{\mu}`` or ``\B{s}``\. By inserting equation #gaussian_density into equation #change-expectation we arrive at the following objective function for the latent distribution correction: ```math #reverse_kl_covariance_diagonal_final \argmin_{\Bs{\mu}, \B{s}}\, \mathbb{E}_{\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I} )} \bigg [ -\log p_{\phi} (\B{s} \odot\B{z} + \Bs{\mu} \mid \B{y}_{\text{obs}}) - \log \Big | \det \operatorname{diag}(\B{s}) \Big | \bigg ] \\ ``` Finally, we use use Bayes' rule and inserting the shifted latent density function from equation #physics_informed_density to arrive at the objective function for latent distribution correction (equation #reverse_kl_covariance_diagonal\): ```math #reverse_kl_covariance_diagonal-derived \argmin_{\Bs{\mu}, \B{s}} \mathbb{E}_{\B{z} \sim \mathrm{N} (\B{z} \mid \B{0}, \B{I})} \bigg [& \frac{1}{2 \sigma^2} \sum_{i=1}^{N} \big \| \B{y}_{\text{obs}, i}-\mathcal{F}_i \circ f_ {\phi} \big(\B{s} \odot\B{z} + \Bs{\mu}; \B{y}_{\text{obs}} \big) \big\|_2^2 \\ & + \frac{1}{2} \big\| \B{s} \odot \B{z} + \Bs{\mu} \big\|_2^2 - \log \Big | \det \operatorname{diag}(\B{s}) \Big | \bigg ]. ``` ```math_def \newcommand{\R}{\mbox{$\mathbb{R}$}} \newcommand{\B}{\mathbf} \newcommand{\Bs}{\boldsymbol} \newcommand{\KL}{\mathbb{KL}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} ```