---
title: Uncertainty quantification for inverse problems with weak PDE-constraints
runninghead: UQ  with weak PDE-constraints
author: |
  Zhilong Fang^1^, Curt Da Silva^2^, Rachel Kuske^3^, and Felix J. Herrmann^1,^^4^\
  ^1^Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia\
	^2^Department of Mathematics, University of British Columbia\
  ^3^School of Mathematics, Georgia Institute of Technology\
  ^4^School of Earth and Atmospheric Sciences, Georgia Institute of Technology
bibliography:
    - zfang.bib
---

## Abstract:

In a statistical inverse problem, the objective is a complete statistical description of unknown parameters from noisy observations in order to quantify uncertainties of the parameters of interest. We consider inverse problems with partial-differential-equation-constraints, which are applicable to a variety of seismic problems. Bayesian inference is one of the most widely-used approaches to precisely quantify statistics through a posterior distribution, incorporating uncertainties in observed data, modeling kernel, and prior knowledge of the parameters. Typically when formulating the posterior distribution, the partial-differential-equation-constraints are required to be exactly satisfied, resulting in a highly nonlinear forward map and a posterior distribution with many local maxima. These drawbacks make it difficult to find an appropriate approximation for the posterior distribution. Another complicating factor is that traditional Markov chain Monte Carlo methods are known to converge slowly for realistically sized problems. In this work, we relax the partial-differential-equation-constraints by introducing an auxiliary variable, which allows for Gaussian deviations in the partial-differential-equations. Thus, we obtain a new bilinear posterior distribution consisting of both data and partial-differential-equation misfit terms. We illustrate that for a particular range of variance choices for the partial-differential-equation misfit term, the new posterior distribution has fewer modes and can be well-approximated by a Gaussian distribution, which can then be sampled in a straightforward manner. Since it is prohibitively expensive to explicitly construct the dense covariance matrix of the Gaussian approximation for intermediate to large-scale problems, we present a method to implicitly construct it, which enables efficient sampling. We apply this framework to two-dimensional seismic inverse problems with ``1,800`` and ``92,455`` unknown parameters.  The results illustrate that our framework can produce comparable statistical quantities to those produced by conventional Markov chain Monte Carlo type methods while requiring far fewer partial-differential-equation solves, which are the main computational bottlenecks in these problems.

## Introduction

Inverse problems with partial-differential-equation (PDE) constraints arise in many applications of geophysics [@Tarantola1982FWI; @pratt1999seismic; @haber2000optimization; @epanomeritakis2008newton; @VirieuxOverview2009]. The goal of these problems is to infer the values of unknown spatial distributions of physical parameters (e.g., sound speed, density, or electrical  conductivity) from indirectly measured data, where the underlying physical model is described by a PDE (e.g., the Helmholtz equation or Maxwell's equations). The most challenging aspects of these problems arise from the fact that they are typically multimodal, with many spurious local minima [@biegler2012large], which can inhibit gradient-based optimization algorithms from estimating the true parameters successfully.

This multimodality stems in part from the fact that the observed data are measured on a small subset of the entire boundary of the domain [@bui2013computational] and the nonlinear parameter-to-data forward map [@vanLeeuwen2013GJImlm; @vanleeuwen2015IPpmp]. One approach to dealing with the multimodality is to formulate the inverse problem as a deterministic optimization problem that aims at minimizing the misfit between the predicted and observed data in an appropriate norm, while also adding a regularization term that may eliminate the nonconvexity in certain situations [@VirieuxOverview2009; @MartinMcMC2012]. The result of this deterministic approach is an estimate of the model parameters that is consistent with the observed data and contains few unwanted features. Since observed data typically contain measurement noise and modeling errors, we are not only interested in an estimate that best fits the data, but also in a complete statistical description of the unknown model parameters [@Tarantola1982Bayesian; @osypov2013model]. To that end, statistical approaches, and in particular the Bayesian inference method, are desirable and necessary. Unlike in the deterministic case, the solution produced by Bayesian inference is a posterior probability density function (PDF), which incorporates uncertainties in the observed data, the forward modeling map, and one's prior knowledge of the parameters. Once we can tractably compute the posterior distribution, we can extract various statistical properties of the unknown parameters.

Bayesian inference methods have been applied to a number of PDE-constrained geophysical statistical inverse problems [@MartinMcMC2012; @bui2013computational; @zhu2016bayesian; @ely2017assessing]. In these reported works, the PDE is typically treated as a strict constraint when formulating the posterior PDF, i.e., the field variables should always exactly satisfy the PDE. This leads to the so-called reduced or adjoint-state method [@plessix2006review; @hinze2008optimization] that eliminates the field variables by solving the PDE, resulting in a posterior PDF with multiple modes. To study the posterior PDF, Markov chain Monte Carlo (McMC) type methods, including the Metropolis-Hasting based methods  [@haario2006dram; @stuart2016two; @ely2017assessing], the stochastic Newton-type method [@MartinMcMC2012], and the randomize-then-optimize (RTO) method [@bardsley2015randomize] sample the posterior PDF by drawing samples from a proposal distribution followed by an accept or reject step. To compute the accept/reject ratio, these methods have to evaluate the posterior PDF for each sample, which leads to solving a large number of computationally expensive PDEs. Moreover, according to the scaling analysis by @roberts2001optimal, McMC type methods require a significantly larger number of samples to reach a status of convergence for large-scale problems in comparison with small-scale problems, which is well known as the curse of dimensionality. These difficulties preclude the straightforward applications of these methods to large-scale problems with more than ``10^6`` unknown parameters.

### Contributions

In this work, we present a new formulation of the posterior distribution that subsumes the conventional reduced formulation as a special case. Instead of treating the PDE as a strict constraint and eliminating the field variables by solving the PDE, we relax the PDE-constraint by introducing the field variables as auxiliary variables. This idea is similar to the method of @vanleeuwen2015IPpmp applied to deterministic PDE-constrained optimization problems, in which the PDE misfit is treated as a penalty term in the misfit function and weighted by a penalty parameter. Moreover, the idea of relaxing the PDE-constraint is also widely-used in weather forecasting applications [@fisher2005equivalence] for the sake of improving the stability of results. In the field of seismic exploration, @fang2015EAGEuqw and @zhilong2016uncertainty first introduced a method to construct a posterior PDF with weak acoustic wave-equation constraints. In the following study by @Lim2017thesis, the authors showed that for small-scale problems, this posterior PDF can be sampled by the randomize maximum likelihood-McMC (RML-McMC) method. We demonstrate that the conventional reduced posterior PDF is a special case of our new formulation. By exploiting the structure of the new posterior PDF, we show that, with an appropriate penalty parameter, the new posterior PDF can be approximated by a Gaussian distribution, which is centered at the maximum a posteriori (MAP) estimate that maximizes the posterior PDF. To construct this Gaussian approximation, we exploit the local derivative information of the posterior PDF and formulate the covariance matrix as a PDE-free operator, which allows us to compute the matrix-vector product without the requirement of computing a large number of additional PDEs. By avoiding an explicit formulation of the covariance matrix, which would be impractical to compute and store, we can apply a recently proposed bootstrap type method [@efron1981nonparametric; @efron1992bootstrap] --- the so-called randomize-then-optimize method [@bardsley2014randomize] --- to affordably draw samples from this surrogate distribution.

We apply our new computational framework to several seismic wave-equation based inverse problems ranging in size and complexity of the underlying parameters. Our first example compares our sampling method with a benchmark method --- the randomize maximum likelihood (RML) method [@chen2012ensemble] --- to validate our Gaussian approximation on a simple model with 1800 unknown parameters. Next, we apply our computational framework to a more complex model with 92,455 unknown parameters to test the feasibility of the approach to more realistically sized problems.

This paper is organized into three major sections.  The first introduces the derivation of the posterior PDF and the corresponding sampling method in a general setting. The second section introduces each component in the general framework when applied to the full-waveform inversion type problems. The final section presents the results of the application of our framework to several numerical inverse problems for velocity models with different size and complexity.

## Bayesian framework for inverse problems with a weak PDE-constraint {#weakpde}

In a PDE-constrained inverse problem, the goal is to infer the unknown discretized ``\ngrid``-dimensional physical  model parameters ``\mvec \in \RR^{\ngrid}`` from ``\ndata``-dimensional noisy observed data ``\dvec \in \CC^{\ndata}``. As the noisy data are stochastic in nature, so are the inversion results obtained from them. Bayesian inference is a widely-used approach that seeks to estimate the posterior PDF of the unknown parameters ``\mvec`` by incorporating the statistics of the measurement and modeling error and one's prior knowledge of the underlying model. Mathematically, Bayesian inference applies Bayes' law to formulate the posterior PDF ``\rhopost(\mvec|\dvec)`` of the model parameters ``\mvec`` given the observed data ``\dvec`` by combining a likelihood PDF and a prior PDF as

```math #Bayes
  \rhopost(\mvec|\dvec) \propto \rholike(\dvec|\mvec)\rhoprior(\mvec),
```

where the likelihood PDF ``\rholike(\dvec|\mvec)`` describes the probability of observing the data ``\dvec`` given the model parameters ``\mvec``, and the prior PDF ``\rhoprior(\mvec)`` describes one's prior beliefs in the unknown model parameters. The proportionality constant depends on the observed data ``\dvec``, which are fixed. Once we have a computationally tractable estimate of the posterior PDF, we can apply certain sampling methods to draw samples from the posterior PDF, which can then be used to compute statistical properties of interest such as the MAP estimate, the mean value, the model covariance matrix, the model standard deviation (STD), and the marginal distributions of ``\mvec`` [@kaipio:2004; @matheron2012estimating]. The primary issue for statisticians is to construct the posterior PDF and design methods that can draw samples from it with affordable cost.

### The posterior PDF

To motivate the derivation of the new posterior PDF, it is helpful to start with the conventional formulation of the posterior PDF for PDE-constrained inverse problems. The so-called reduced approach eliminates the PDE-constraint by solving the PDE, which leads to the following nonlinear forward modeling map ``F(\mvec)``:

```math #ConF
  F(\mvec) = \Pmat\A(\mvec)^{-1}\q.
```

Here the vector ``\q \in \CC^{\ngrid}`` represents the discretized (known) source term. The matrix ``\A(\mvec) \in \CC^{\ngrid \times \ngrid}`` denotes the discretized PDE operator and the operator ``\Pmat \in \RR^{\ndata \times \ngrid}`` samples the data ``\dvec`` from the vector of field variables ``\uvec``, which is the solution of the PDE ``\uvec = \A(\mvec)^{-1}\q``. In real-world seismic applications, the observed data always contain both correlated measurement noise arising from environmental disturbances and modeling errors, which are difficult to precisely quantify and model. One popular approach to simplify the problem is to assume that the combined measurement and modeling noise ``\epsilon \in \CC^{\ndata}`` is drawn from a Gaussian distribution with zero mean and covariance matrix ``\SIGMAN``  [@bui2013computational; @osypov2013model; @bardsley2015randomize], i.e., ``\epsilon \sim \NN(0,\SIGMAN)``, independent of ``\mvec``. The assumption of Gaussianity results in distributions that are relatively easy to model and sample, thereby providing a rich source of tractable examples [@kaipio:2004]. With this assumption in mind and the additional assumption that the prior distribution of the model ``\mvec`` is also Gaussian with the mean model parameters ``\mp`` and the covariance matrix ``\SIGMAP``, we arrive at the following posterior distribution:

```math #CPost
  \rhopost(\mvec|\dvec) \propto \exp\left ({-\frac{1}{2}\|F(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}}-\frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}}\right ),
```

where the symbol ``\|\centerdot\|_{\SIGMAN^{-1}}`` denotes the weighted ``\ell_{2}``-norm with the weighting matrix ``{\SIGMAN^{-1}}``. There are several challenges in computing various quantities associated with the posterior distribution in equation #CPost\. In order to obtain the MAP estimate ``\mMAP``, we need to solve the following deterministic optimization problem:

```math #AdjPrb
  \mMAP &= \argmax_{\mvec} \rhopost(\mvec|\dvec) = \argmin_{\mvec}-\log{\rhopost(\mvec|\dvec)} \\
  & = \argmin_{\mvec}{\frac{1}{2}\|F(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}}+\frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}},
```

which can be solved by the so-called adjoint-state method. As noted in @vanLeeuwen2013GJImlm\, the nonlinear forward modeling map ``F(\mvec)`` results in the objective function ``-\log{\rhopost(\mvec|\dvec)}`` being highly oscillatory with respect to the model parameters ``\mvec``, which yields many local minima. To find the globally optimal solution, a sufficiently close initial model is necessary, which may be difficult to obtain in real-world scenarios. As mentioned previously, the nonlinear parameter-to-data map also results in computational difficulties when sampling the posterior distribution in equation #CPost\. Specifically, the tradeoff between designing a proposal distribution that is well tuned to the true posterior distribution and one that is computationally cheap to sample is not a straightforward choice to make. As one models a proposal that is easier to sample, typically the price to pay is having to draw more samples until convergence is reached.

These challenges result from the nonlinear forward modeling map ``F(\mvec)`` induced by the strict PDE-constraint in the optimization problem in equation #AdjPrb\. To overcome these difficulties, @vanLeeuwen2013GJImlm and @vanleeuwen2015IPpmp proposed a penalty formulation to solve deterministic PDE-constrained optimization problems, wherein they relax the strict PDE-constraint by penalizing the data misfit function by a weighted PDE misfit with a penalty parameter ``\lambda``. This results in the following joint optimization problem with respect to both the model parameters ``\mvec`` and the field variables collected in the vector  ``\uvec``:

```math #PenObj
  \argmin_{\mvec,\uvec} \fpen(\mvec,\uvec)=\frac{1}{2}\|\Pmat\uvec - \dvec\|^{2}_{\SIGMAN^{-1}}+\frac{\lambda^2}{2}\|\A(\mvec)\uvec-\q\|^2+\frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}.
```

The authors note that the problem in equation #PenObj is a separable nonlinear least-squares problem, in which the optimization with respect to ``\uvec`` is a linear data-fitting problem when ``\mvec`` is fixed. In @vanleeuwen2015IPpmp\, the authors eliminate the field variables ``\uvec`` by the variable projection method [@golub2003separable] in order to avoid the high memory costs involved in storing a unique field variable for each individual source. The variable projection method also eliminates the dependence of the objective function in equation #PenObj on ``\uvec``. As for each input parameter ``\mvec``, there is a unique ``\overline{\uvec}(\mvec)`` satisfying ``\nabla_{\uvec} \fpen(\mvec,\uvec)|_{\uvec=\overline{\uvec}(\mvec)} = 0,`` which has the closed form solution

```math #OptU1
  \overline{\uvec}({\mvec}) = \big (\lambda^{2}\A({\mvec})^{\top}\A({\mvec}) + \Pmat^{\top}\SIGMAN^{-1}\Pmat\big )^{-1}\big (\lambda^{2}\A({\mvec})^{\top}\q + \Pmat^{\top}\SIGMAN^{-1}\dvec\big ),
```

where the symbol ``\top`` denotes the (complex-conjugate) transpose. As noted by @vanLeeuwen2013GJImlm and @vanleeuwen2015IPpmp, minimizing the objective function (equation #PenObj) with a carefully selected ``\lambda`` is less prone to being trapped in suboptimal local minima, because the inversion is carried out over a larger search space (implicitly through ``\overline{\uvec}({\mvec})``) and it is therefore easier to fit the observed data for poor starting models compared to the conventional reduced formulation in equation #AdjPrb\.

Motivated by the penalty approach to solving the deterministic inverse problems, we propose a more generic posterior PDF for statistical PDE-constrained inverse problems. As before, we relax the PDE-constraint by introducing the field variables ``\uvec`` as auxiliary variables, i.e., we have

```math #PostDP
  \rhopost(\uvec,\mvec|\dvec) \propto \rho(\dvec|\uvec,\mvec)\rho(\uvec,\mvec),
```

where the conditional PDF ``\rho(\dvec|\uvec,\mvec)`` now describes the probability of observing the data ``\dvec`` given the field variables ``\uvec`` and model parameters ``\mvec``. To formulate the joint PDF ``\rho(\uvec,\mvec)``, we apply the standard conditional decomposition [@sambridge2006trans]

```math #PrptPDF
  \rho(\uvec,\mvec) = \rho(\uvec|\mvec)\rhoprior(\mvec).
```

Hence,

```math #PostDP2
  \rhopost(\uvec,\mvec|\dvec) \propto \rho(\dvec|\uvec,\mvec)\rho(\uvec|\mvec)\rhoprior(\mvec).

```

The implication of the reduced formulation is that the field variables ``\uvec``  satisfy the PDE strictly---i.e., ``\uvec=\A(\mvec)^{-1}\q``. This adherence to the PDE corresponds to solutions ``\uvec`` that satisfy the PDE ``\A(\mvec)\uvec=\q`` with the probability density ``\rho(\uvec|\mvec) = \delta\big (\A(\mvec)\uvec-\q\big )``, where ``\delta()`` denotes the Kronecker delta function [@dummit2004abstract]\. Conversely, replacing the constraint by a quadratic penalty term allows for a Gaussian error with zero mean and covariance matrix ``\lambda^{-2}\Imat`` in the PDE misfit ``\A(\mvec)\uvec - \q``. This yields

```math #CndPDF
  \rho(\uvec|\mvec) = (2\pi)^{-\frac{\ngrid}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}} \exp\left(-\frac{\lambda^2}{2}\|\A(\mvec)\uvec-\q\|^{2}\right).
```

Indeed, for a given ``\mvec``, this distribution ``\rho(\uvec|\mvec)`` with respect to ``\uvec`` is a Gaussian distribution with a mean of ``\A(\mvec)^{-1}\q`` and a covariance matrix of ``\lambda^{-2}\A(\mvec)^{-1}\A(\mvec)^{-\top}``. The conditional probability  ``\rhopost(\uvec,\mvec|\dvec)`` is the joint PDF with respect to both the model parameters ``\mvec`` and field variables ``\uvec``. Because the control or model variables ``\mvec`` are of primary interest, we eliminate the dependence of the joint PDF on the auxiliary variables ``\uvec`` by marginalizing over ``\uvec``:

```math #Marginal
   \rhopost(\mvec|\dvec) &= \int \rhopost(\uvec,\mvec|\dvec)\mathrm{d}\uvec \\
      &\propto\text{det}\big (\Hmat(\mvec)\big )^{-\frac{1}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}}  \\
      &\quad \times \exp\Big(-\frac{1}{2}\big (\lambda^2\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} + \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big )\Big),
```

where

```math #DefineH
\Hmat({\mvec}) &= -\nabla^{2}_{\uvec}\log\rhopost(\uvec,{\mvec}|\dvec)\big\rvert_{\uvec=\overline{\uvec}(\mvec)}\\
&=\lambda^{2}\A({\mvec})^{\top}\A({\mvec}) + \Pmat^{\top}\SIGMAN^{-1}\Pmat,

```
<!-- [\ref{secAppMdist}] -->
and ``\overline{\uvec}(\mvec)`` is given by equation #OptU1\. A complete derivation of the marginal PDF ``\rhopost(\mvec|\dvec)`` is given in Appendix A . As in the deterministic case, the posterior PDF corresponding to the conventional reduced formulation (cf. equation #CPost) can also be derived from the marginal PDF ``\rhopost(\mvec|\dvec)`` in equation #Marginal\. Indeed, as ``\lambda \rightarrow \infty``, we have

```math #limitation2
  \lim_{\lambda \rightarrow \infty} \rhopost(\mvec|\dvec) \propto \exp\big ({-\frac{1}{2}\|\Pmat\A(\mvec)^{-1}\q - \dvec\|^{2}_{\SIGMAN^{-1}}-\frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}}\big ),
```  

which, as expected, is the posterior PDF corresponding to the conventional reduced formulation. The derivation of equation #limitation2 can be found in Appendix A in more details.

<!-- [\ref{secAppLMPDF}] -->

To illustrate the advantages of our penalty formulation for the posterior PDF (cf. equation #Marginal) over the conventional reduced formulation (cf. equation #CPost), we conduct a simple experiment adopted from @esser2016tvr. We invert for the sound speed given partial measurements of the pressure field and generate data with a known band-limited source. The data are recorded at four receivers (see Figure #fig:SmpExpa) and is contaminated with Gaussian noise (see Figure #fig:SmpExpb). As in @esser2016tvr, we define the forward operator ``F(m)`` as

```math #Fsmp
    F(m) = \Pmat\A^{-1}(m)\q = \begin{pmatrix}
    \mathcal{F}^{-1}e^{i\omega m\|\xvec_1-\xvec_s\|_{2}}\mathcal{F}\q \\
    \mathcal{F}^{-1}e^{i\omega m\|\xvec_2-\xvec_s\|_{2}}\mathcal{F}\q \\
    \mathcal{F}^{-1}e^{i\omega m\|\xvec_3-\xvec_s\|_{2}}\mathcal{F}\q \\
    \mathcal{F}^{-1}e^{i\omega m\|\xvec_4-\xvec_s\|_{2}}\mathcal{F}\q
    \end{pmatrix},
```

where the operator ``\mathcal{F}`` denotes the temporal Fourier transform and ``\omega`` denotes the angular frequency. The vectors ``\xvec_i``, ``i=1, \, ..., \,4`` and ``\xvec_s`` denote the receiver and source locations, respectively, and  the scalar ``m`` represents the slowness of the medium, i.e., ``m = \dfrac{1}{v}`` where ``v`` is the velocity. The source and receiver locations are in Figure [#fig:SmpExpa] denoted by the symbols ``(*)`` and ``(\nabla)``. With the forward model defined in equation #Fsmp\, we formulate the posterior distribution for the reduced formulation and the penalty formulation by choosing a Gaussian distribution with a mean of ``5 \cdot 10^{-4} \, \mathrm{s/m}`` and a standard deviation of ``1.2 \cdot 10^{-4} \, \mathrm{s/m}`` for the prior distribution ``\rhoprior(m)``. Finally, we add a Gaussian noise with a variance of ``4 \times 10^{-4}``  to the observed data, resulting in a noise-to-signal ratio of ``\|\text{noise}\|_{2}/\|\text{signal}\|_{2}\,=\, 0.24``.

### Figure: SimpleExample{#fig:SmpExp}
![Snap shot](./fig/Motivation2/FourRcv/example1a){#fig:SmpExpa width=50%}
![Data](./fig/Motivation2/FourRcv/example1b){#fig:SmpExpb width=same}
(a) Snapshot of the time-domain wavefield generated by a single source (``*``); (b) recorded time-domain data at four receiver locations (``\nabla``).

Figure [#fig:SmpPDF] depicts the posterior PDFs of the reduced and penalty formulations for ``\lambda =  \, 10, \, 50,`` and ``250``. As expected, local maxima are present in the PDF of the reduced formulation and in the PDFs of the penalty formulation when the ``\lambda`` values become too large. As ``\lambda`` increases from  ``\lambda=10``, where the posterior is unimodol, local maxima appear for larger ``\lambda=250``, only one of which has strong statistical significance. For ``\lambda=250``, the resulting PDF is close to the one yielded by the reduced formulation, which corresponds to ``\lambda\rightarrow\infty``.  From this stylized example, the strongly relaxed formulation appears unimodal and with low bias. As we will demonstrate in later sections, being less prone to local maxima reduces the computational cost associated with sampling these distributions.

### Figure: SimplePDF{#fig:SmpPDF}
![``\lambda = 10``](./fig/Motivation2/FourRcv/rhopen_02){#fig:SmpExp3 width=50%}
![``\lambda = 50``](./fig/Motivation2/FourRcv/rhopen_1){#fig:SmpExp5 width=50%} \
![``\lambda = 250``](./fig/Motivation2/FourRcv/rhopen_5){#fig:SmpExp6 width=50%}
![Reduced](./fig/Motivation2/FourRcv/rhored){#fig:SmpExp1 width=50%}
The four posterior PDFs corresponding to the penalty formulations with ``\lambda = 10`` (a), ``50`` (b), and ``250`` (c), and the reduced formulation (d).

### Selection of ``\lambda``

Before discussing computationally efficient sampling schemes, we first propose a method to select values for ``\lambda``, which will balance the tradeoff between the unimodality of the distribution and its deviation from the reduced-formulation PDF.

To arrive at a scheme to select ``\lambda``, we focus on two terms of the negative logarithm function of the posterior PDF in equation [#Marginal]:

```math #margPDFLog
  \phi(\mvec)=&-\log \rhopost(\mvec|\dvec) \\
  = &\frac{1}{2}\log \det\big (\Imat + \frac{1}{\lambda^2}\SIGMAN^{-\frac{1}{2}}\Pmat\A(\mvec)^{-1}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-\frac{\top}{2}}\big) \\ &+ \frac{1}{2} \big (\lambda^2\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} + \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big ),
```

namely the determinant

```math #DetPhi1
\phi_{1}(\mvec) = \frac{1}{2}\log\det\big (\Imat + \frac{1}{\lambda^2}\SIGMAN^{-\frac{1}{2}}\Pmat\A(\mvec)^{-1}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-\frac{\top}{2}}\big ),
```

and the misfit

```math #DetPhi2
\phi_{2}(\mvec) = \frac{1}{2} \big (\lambda^2\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} + \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big ).
```
These equations imply that we should avoid situations in which ``\lambda \rightarrow 0`` and ``\lambda \rightarrow \infty``. When ``\lambda \rightarrow 0``, the optimal variable ``\overline{\uvec}(\mvec)`` tends to fit the data. As a result, ``\phi_{2}(\mvec) \rightarrow \frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}``, which means that the observed data are not informative to the unknown parameters and have few contributions to the posterior distribution. Furthermore, as ``\lambda \rightarrow 0``, the nonlinear determinant function ``\phi_{1}(\mvec) \rightarrow \infty`` and will dominate the overall function ``\phi(\mvec)``, which results in a highly nonlinear mapping ``\mvec \rightarrow \phi(\mvec)``. On the other hand, when ``\lambda \rightarrow \infty``, we find that ``\phi_{1}(\mvec) \rightarrow 0`` and ``\phi_{2}(\mvec) \rightarrow \frac{1}{2}\|\Pmat\A(\mvec)^{-1}\q - \dvec\|^{2}_{\SIGMAN^{-1}} + \frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}`` in which case the misfit  ``\phi(\mvec)`` converges to the nonlinear reduced formulation. Considering both facts, we want to find an appropriate ``\lambda``, so that  ``\phi_1(\mvec)`` is relatively small compared to ``\phi_2(\mvec)``, thus ensuring enough information from the observed data, while ``\phi_2(\mvec)`` is still less likely to contain local minima.

Based on spectral arguments, @vanleeuwen2015IPpmp proposed a scaling for the penalty parameter according to the largest eigenvalue ``\mu_{1}`` of the matrix ``\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-1}\Pmat\A(\mvec)^{-1}``. Relative to ``\mu_1``, a penalty parameter ``\lambda^2 \gg \mu_1`` can be considered large while ``\lambda^2 \ll \mu_1`` is considered small.  As a result, ``\lambda`` chosen much greater than this reference scale---i.e., when ``\lambda^2 \gg \mu_{1}``, the minimizers for field variables ``\overline{\uvec}({\mvec})`` will converge to the solution of the wave equation ``\A(\mvec)^{-1}\q`` with a convergence rate of ``\mathcal{O}(\lambda^{-2})``, and therefore our penalty misfit approaches the reduced misfit. A similar consideration applies when ``\lambda^2 \ll \mu_1.`` After extensive parameter testing, we found that choosing ``\lambda^{2} = 0.01 \mu_{1}`` strikes the right balance so that the posterior PDF is less affected by local maxima compared to the reduced formulation while the determinant term ``\phi_{1}(\mvec)`` remains negligible compared to ``\phi_{2}(\mvec)``. With this choice of ``\lambda``, we can therefore neglect the  ``\phi_1(\mvec)`` term, as it is small relative to ``\phi_2(\mvec)``, and consider an approximate posterior PDF ``\rhopostover(\mvec|\dvec)`` consisting only of the ``\phi_2(\mvec)`` term, i.e.,  we now have the approximate equality

```math #PDFFinal
\rhopost(\mvec|\dvec) &\approx \rhopostover(\mvec|\dvec) \\ &\propto  \exp
\left(-\frac{\lambda^2}{2}\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} -\frac{1}{2} \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} - \frac{1}{2} \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\right).
```

This approximation results in a posterior PDF that is much easier to evaluate. From here on out, we consider ``\rhopostover(\mvec|\dvec)`` as our PDF of interest in the subsequent sampling considerations.

### Sampling method

Given this choice of ``\lambda``, yielding the PDF in equation #PDFFinal\, the computational considerations of drawing samples from this approximate distribution are paramount in designing a tractable method. McMC-type methods are unfortunately computationally unfeasible for high-dimensional problems, owing to the relatively large number of the expensive evaluations of posterior distributions needed to converge adequately. For this reason, we follow an alternative approach --- widely used in Bayesian inverse problems with PDE-constraints [@bui2013computational; @zhu2016bayesian] --- where we approximate the target posterior PDF by a Gaussian PDF. To construct this Gaussian PDF, we first find the MAP estimate ``\mMAP`` by solving

```math #FindmMAP
  \mMAP = \argmin_{\mvec}-\log\big (\rhopostover\left(\mvec|\dvec\right)\big ).
```

Next, we use the local second-order derivative information of the posterior PDF at the MAP point --- i.e., the Hessian ``\Hpost = -\nabla^{2}_{\mvec} \log \rhopostover(\mvec|\dvec)`` --- to construct the covariance matrix of the Gaussian PDF, which yields the Gaussian distribution ``\mathcal{N}(\mMAP,\Hpost^{-1})``.  Afterwards, we draw samples from the Gaussian distribution from which we compute statistical quantities. We  incorporate this sampling strategy with the proposed posterior PDF with weak PDE-constraints and obtain the following Bayesian framework as shown in Algorithm [#alg:UQ1]:

### Algorithm: Bayesian framework for inverse problems with weak PDE-constraints {#alg:UQ1}
|   1. Set ``\SIGMAN``, ``\SIGMAP``, prior mean model ``\mp``, and a value for the penalty parameter ``\lambda``;
|   2. Formulate the posterior PDF ``\rhopostover(\mvec|\dvec)`` with equation #PDFFinal;
|   3. Find the MAP estimate ``\mMAP`` by minimizing ``-\log{\rhopostover(\mvec|\dvec)}``;
|   4. Compute the Hessian matrix ``\Hpost =- \nabla^{2}_{\mvec}{\log\big (\rhopostover(\mvec|\dvec)\big )}`` at the MAP estimate ``\mMAP``
|       and define the Gaussian PDF ``\mathcal{N}(\mMAP,\Hpost^{-1})``;
|   5. Draw ``\nsmp`` samples from the Gaussian PDF ``\mathcal{N}(\mMAP,\Hpost^{-1})``;
|   6. Compute statistical properties of interest from the ``\nsmp`` samples.
: Bayesian framework for inverse problems with weak PDE-constraints

Compared to  McMC type methods, the additional evaluations of the posterior PDF are not needed once we calculate the MAP estimate ``\mMAP``, which significantly reduces the computational cost. However, the accuracy of samples drawn from this surrogate PDF strongly depends on the accuracy of the Gaussian approximation in a neighborhood of ``\mMAP``, which is related to our choice of ``\lambda``. To illustrate this dependence, we continue the example shown in Figure #fig:SmpPDF and compare the Gaussian approximation of the reduced formulation (i.e., ``\lambda = \infty``) to the penalty formulation for different values of ``\lambda,`` plotted in Figure [#fig:SmpGaussPDF]. In this case the largest eigenvalue is ``\mu_{1} = 10^4`` and the corresponding is ``\lambda = 10``. Clearly, when selecting ``\lambda = 10``, the Gaussian approximation is relatively close to the true PDF, whereas increasing ``\lambda`` decreases the accuracy of the Gaussian approximation.

### Figure: SimpleGaussPDF{#fig:SmpGaussPDF}
![``\lambda = 10``](./fig/Motivation2/FourRcv/rhopen_cmpApp_02){#fig:SmpGaussExp3 width=50%}
![``\lambda = 50``](./fig/Motivation2/FourRcv/rhopen_cmpApp_1){#fig:SmpGaussExp5 width=50%} \
![``\lambda = 250``](./fig/Motivation2/FourRcv/rhopen_cmpApp_5){#fig:SmpGaussExp6 width=50%}
![Reduced](./fig/Motivation2/FourRcv/rhored_cmpApp){#fig:SmpGaussExp1 width=50%}
The Gaussian approximations of the four posterior PDFs associated with the model in equation #Fsmp\.

Armed with an accurate Gaussian approximation to the unimodal PDF (for an appropriate choice of ``\lambda``), we are now in a position to draw samples  from the Gaussian distribution ``\mathcal{N}(\mMAP,\Hpost^{-1})``. For small-scale problems, an explicit expression of the Hessian ``\Hpost`` is available. Hence, we can draw samples ``\mvec_{\text{s}}`` from the distribution ``\mathcal{N}(\mMAP,\Hpost^{-1})`` by utilizing the Cholesky factorization of the Hessian ``\Hpost``---i.e., ``\Hpost   =  \Rmat^{\top}_{\text{post}}\Rmat_{\text{post}}`` --- as follows [@rue2001fast]:

```math #CholFactSmp
     \mvec_{\text{s}} = \mMAP + \Rpost^{-1}\rvec,
```

where the matrix ``\Rmat_{\text{post}}`` is an upper triangular matrix and the  vector ``\rvec`` is a random vector drawn from the ``n_{\text{grid}}``-dimensional standard Gaussian distribution ``\mathcal{N}(0,\mathcal{I}_{n_{\text{grid}} \times n_{\text{grid}}})``. Nevertheless, for large-scale problems, constructing and storing an explicit expression of the Hessian ``\Hpost`` is infeasible. Typically, to avoid the construction of the explicit Hessian matrix, the Hessian ``\Hpost`` is constructed as a matrix-free implicit operator, and we only have access to computing the matrix-vector product with an arbitrary vector. As a result, we  need a matrix-free sampling method to draw samples. In the following section, we will develop the implementation details for a suitable sampling method for seismic wave-equation-constrained inverse problems.

## Uncertainty quantification for seismic wave-equation-constrained inverse problems {#seismic}

Wave-equation based inversions, where the coefficients of the wave-equation are the unknowns, are amongst the most computationally challenging inverse problems as they typically require wave-equation solves for multiple source experiments on a large domain where the wave travels many wavelengths. Additionally, these inverse problems, also known as full-waveform inversion (FWI) in the seismic community [@pratt1999seismic], involve fitting oscillatory predicted and observed data, which can result in parasitic local minima.

Motivated by the penalty formulation with its weak PDE-constraints and the results presented above presumably, we will derive a time-harmonic Bayesian inversion framework that is capable of handling relatively large-scale problems where the wave propagates about 30 wavelengths, a moderate number for realistic problems. Given acoustic seismic data, collected at the surface from sources that fire along the same surface, our aim is to construct the statistical properties of the spatial distribution of the acoustic wave velocity. With these properties, we are able to conduct uncertainty quantification for the recovered velocity model. The subsections are organized roughly in according to the main steps outlined in Algorithm #alg:UQ1\.


### Steps 1 and 2: designing the posterior and prior PDF

To arrive at a Bayesian formulation for wave-equation based inverse problems with weak constraints, we consider monochromatic seismic data collected at ``\nrcv`` receivers locations from ``\nsrc`` seismic source locations and sampled at ``\nfreq`` frequencies. Hence, the observed data ``\dvec\in\CC^{\ndata}`` with ``\ndata = \nrcv \times \nsrc\times\nfreq``. As before, the synthetic data are obtained by applying, for each source, the projection operator ``\Pmat \in \CC^{\nrcv \times \ngrid}``. For the ``i^{\text{th}}`` source and ``j^\text{th}`` frequency, the time-harmonic wave equation corresponds to the following discretized Helmholtz system:

```math #HelmEq
  \A_{j}(\mvec) \uvec_{i,j} = \q_{i,j}\quad \text{with}\quad \A_{j}(\mvec) = \Delta + \omega_{j}^{2}\operatorname{diag}\left(\mvec^{-2}\right).
```

In this expression, the ``\q_{i,j}``'s are the monochromatic sources, the symbol ``\Delta`` refers to the discretized Laplacian, ``\omega`` represents the angular frequency, and ``\mvec\in\RR^{\ngrid}`` denotes the vector with the discretized velocities. With a slight abuse of notation, this vector appears as the elementwise reciprocal square on the diagonal. To discretize this problem, we use the Helmholtz discretization from @chen2013optimal\.

If we consider the data from all sources and frequencies simultaneously, the posterior PDF for the weak-constrained penalty formulation becomes

```math #PostMH
  \rhopostover(\mvec|\dvec) \propto  &\exp\left(-\frac{1}{2}\sum_{i=1}^{\nsrc}\sum_{j=1}^{\nfreq}\|\Pmat\overline{\uvec}_{i,j}(\mvec) - \dvec_{i,j}\|^{2}_{\SIGMAN^{-1}}-\frac{\lambda^{2}}{2}\|\A_{j}(\mvec)\overline{\uvec}_{i,j}(\mvec)-\q_{i,j}\|^{2}\right)\\
  &\times \exp\left(- \frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\right).
```

Aside from choosing a proper value for the penalty parameter ``\lambda``, another crucial component of the posterior PDF in equation [#PostMH] is the choice of the prior PDF. From a computational perspective, a suitable prior should have a bounded variance and one should be able to draw samples with moderate cost [@bui2013computational]. More specifically, this results in having computationally feasible access to (matrix-free) actions of the square-root of the prior covariance operator or its inverse on random vectors. To meet this requirement, we utilize Gaussian smoothness priors [@matheron2012estimating], which provide a flexible way to describe random fields and are commonly employed in Bayesian inference [@lieberman2010parameter; @MartinMcMC2012; @bardsley2015randomize]. Following @lieberman2010parameter, we construct a Gaussian smoothness prior ``\rhoprior(\mvec) \propto \mathcal{N} \left( \mp, \SIGMAP \right)`` with a reference mean model ``\mp`` and a covariance matrix ``\SIGMAP`` given by

```math #SIGMAP
  \SIGMAP(k,l) = a\exp\left(\frac{-\|\svec_{k}-\svec_{l}\|^2}{2b^2}\right)+c\delta_{k,l}.
```

In this expression for the covariance, the vectors ``\svec_{k} = (z_{k},x_{k})`` and ``\svec_{l} = (z_{l},x_{l})`` denote the ``k^{\text{th}}`` and ``l^{\text{th}}``  spatial coordinates corresponding to the ``k^{\text{th}}`` and ``l^{\text{th}}`` elements in the vector ``\mvec``, respectively. The parameters ``a``, ``b``, and ``c`` control the correlation strength, variance, and spatial correlation distance. The variance of the ``k^{\text{th}}`` element ``m_{k}``  is ``\text{var}(m_{k})=\SIGMAP(k,k) = a + c``. The parameter ``c`` also ensures that the prior covariance matrix remains numerically well-conditioned [@MartinMcMC2012].  Clearly, when the distance between ``\svec_{k}`` and ``\svec_{l}`` is large---i.e.,  ``\frac{\|\svec_{k}-\svec_{l}\|^2}{2b^2} \gg 1``, the cross-covariance ``\SIGMAP(k,l)``  vanishes quickly.


### Steps 3 and 4: Gaussian approximation

After setting up the posterior PDF, we need to construct its Gaussian approximation ``\mathcal{N}(\mMAP,\Hpost^{-1})``, corresponding to steps 3 and 4 in Algorithm [#alg:UQ1]. In order to achieve this objective, first we compute the MAP estimate of the posterior PDF ``\mMAP``, which is equivalent to minimizing the negative logarithm  ``-\log \rhopostover(\mvec|\dvec)``:

```math #OptPrb
  \mMAP &= \argmin_{\mvec} -\log\rhopostover(\mvec|\dvec) \\
               &=\argmin_{\mvec}\sum_{i=1}^{\nsrc}\sum_{j=1}^{\nfreq}\frac{1}{2}\|\Pmat\overline{\uvec}_{i,j}(\mvec) - \dvec_{i,j}\|^{2}_{\SIGMAN^{-1}} + \frac{\lambda^{2}}{2}\|\A_{j}(\mvec)\overline{\uvec}_{i,j}(\mvec)-\q_{i,j}\|^{2}\\ &\quad \quad \quad \quad \quad+ \frac{1}{2}\|\mvec-\mp\|_{\SIGMAP^{-1}}^{2}.
```

Note that the objective function ``-\log\rhopostover(\mvec|\dvec)`` is analogous to the cost function of the deterministic optimization problem [@vanleeuwen2015IPpmp]\. Using similar techniques as in the aforementioned work, we can express the gradient ``\g`` for this objective as

```math #GofMis
  \g = \sum_{i=1}^{\nsrc}\sum_{j=1}^{\nfreq} \lambda^{2}\Gmat_{i,j}^{\top}\big (\A_{j}(\mvec)\overline{\uvec}_{i,j}(\mvec)-\q_{i,j}\big )+\SIGMAP^{-1}(\mvec-\mp),
```

where the sparse Jacobian matrix ``\Gmat_{i,j} = \big (\nabla_{\mvec}{\A_{j}(\mvec)\big )\overline{\uvec}_{i,j}}(\mvec)``. Following @van2013fast, we use the limited-memory-Broyden–Fletcher–Goldfarb–Shanno method [l-BFGS, @nocedal2006numerical] to solve the optimization problem in equation #OptPrb to find the MAP estimate ``\mMAP``.

Once we have computed ``\mMAP``, we focus on approximating the posterior PDF ``\rhopostover(\mvec|\dvec)`` by a Gaussian distribution centered at ``\mMAP``. For simplicity, we omit the dependence of ``\A_j(\mvec)`` and ``\overline{\uvec}_{i,j}(\mvec)`` on ``\mvec``. A necessary component in this process is computing the Hessian ``\Hpost = -\nabla_{\mvec}^2\log \rhopostover(\mvec|\dvec)``, which is given by

```math #FullHessian
    \Hpost &= \Hlike + \SIGMAP^{-1} \\
           &= \sum_{i}^{\nsrc}\sum_{j}^{\nfreq}\lambda^{2}\Gmat_{i,j}^{\top}\Gmat_{i,j} - \Smat_{i,j}^{\top}\left(\Pmat^{\top}\SIGMAN^{-1}\Pmat+\lambda^2\A_{j}^{\top}\A_{j}\right)^{-1}\Smat_{i,j} + \SIGMAP^{-1},
```

where

```math #FullHessiansub
  \Smat_{i,j} = \lambda^{2}( \nabla_{\mvec}{\A_{j}^{\top} )\left(\A_{j}\overline{\uvec}_{i,j}-\q_{i,j}\right)} + \lambda^{2}\A_{j}^{\top}\Gmat_{i,j},
```

and the optimal wavefield ``\overline{\uvec}_{i,j}`` is computed by equation #OptU1\.

The full Hessian ``\Hpost`` is a dense ``\ngrid \times \ngrid`` matrix, which is prohibitive to construct explicitly when ``\ngrid`` is large, even in the two-dimensional setting. On the other hand, it is also prohibitive if we formulate the Hessian   ``\Hpost`` as a traditional implicit operator, which requires ``2 \times \nsrc \times \nfreq`` PDE solves to compute each matrix-vector product ``\Hpost \overline{\mvec}`` with any vector ``\overline{\mvec}``, according to the expression in equation #FullHessian\. Since the posterior covariance matrix is the inverse of the Hessian, we need to invert the square root of the Hessian operator in order to generate random samples. With the implicit Hessian operator, we would need to employ an iterative solver such as LSQR or CG [@golub2012matrix], and the total number of PDE solves required is therefore proportional to  ``2 \times \nsmp \times \niter \times \nsrc \times \nfreq``.  As a result, this type of approach requires an extremely large computational cost when drawing sufficiently many samples. As a remedy, we exploit the structure of the Hessian matrix ``\Hpost`` to find an approximation that can be constructed and applied in a computationally efficient manner.

To exploit the structure of the Hessian matrix ``\Hpost``,  we will focus on the Hessian matrix ``\Hlike`` of the likelihood term, as we already have discussed the matrix ``\SIGMAP`` in the previous section. Based on equations #FullHessian and #FullHessiansub\, ``\Hlike`` consists of three components --- the matrices ``\Pmat^{\top}\SIGMAN^{-1}\Pmat+\lambda^2\A_{j}^{\top}\A_{j}``, ``(\nabla_{\mvec}{\A_{j}^{\top})(\A_{j}\overline{\uvec}_{i,j}-\q_{i,j})}``, and ``\Gmat_{i,j}``. We first consider the Jacobian ``(\nabla_{\mvec}{\A_{j}^{\top})(\A_{j}\overline{\uvec}_{i,j}-\q_{i,j})}`` and specifically the PDE misfit ``\A_{j}\overline{\uvec}_{i,j}-\q_{i,j}``. When the PDE misfit is approximately zero, this overall term is also expected to be small. As the MAP estimate ``\mMAP`` simultaneously minimizes the data misfit, PDE misfit, and model penalty term, the PDE misfit is expected to be small when model and observational errors are not too large. Thus, we obtain a good approximation ``\Hliket`` of the full Hessian ``\Hlike``, which corresponds to the Gauss-Newton Hessian derived by @vanleeuwen2015IPpmp\:

```math #GNWRI
\Hliket = \sum_{i}^{\nsrc}\sum_{j}^{\nfreq}\lambda^{2}\Gmat_{i,j}^{\top}\Gmat_{i,j} - \lambda^4\Gmat_{i,j}^{\top}\A_{j}(\Pmat^{\top}\SIGMAN^{-1}\Pmat+\lambda^2\A_{j}^{\top}\A_{j})^{-1}\A_{j}^{\top}\Gmat_{i,j}.
```

Consequently, we can use the Gauss-Newton Hessian ``\Hliket`` to construct the Gaussian distribution that approximates the posterior PDF ``\rhopostover(\mvec|\dvec)``:

```math #QuadrApprx
    \rhopostover(\mvec|\dvec) \approx \rho_{\text{Gauss}}(\mvec) = \mathcal{N}(\mMAP,\Hpostt^{-1}) = \mathcal{N}(\mMAP,(\Hliket + \SIGMAP^{-1})^{-1}).
```

The Gauss-Newton Hessian ``\Hliket`` has a compact expression derived from the Sherman–Morrison–Woodbury formula [@golub2012matrix]:

```math #GNWRI2
  \Hliket =\sum_{i}^{\nsrc}\sum_{j}^{\nfreq}\Gmat_{i,j}^{\top}\A_{j}^{-\top}\Pmat^{\top}(\SIGMAN + \frac{1}{\lambda^2}\Pmat\A^{-1}_{j}\A^{-\top}_{j}\Pmat^{\top})^{-1}\Pmat\A^{-1}_{j}\Gmat_{i,j}.
```

We shall see that this expression provides a factored formulation to implicitly construct the Gauss-Newton Hessian, which does not require any additional PDE solves to compute matrix-vector products. In order to construct the implicit Gauss-Newton Hessian ``\Hliket``, three matrices are necessary --- ``\A_{j}^{-\top}\Pmat^{\top} \in \CC^{n_{\text{grid}} \times n_{\text{rcv}}}``, ``\Gmat_{i,j} \in \CC^{n_{\text{grid}} \times n_{\text{grid}}}``, and  ``\SIGMAN + \frac{1}{\lambda^2}\Pmat\A^{-1}_{j}\A^{-\top}_{j}\Pmat^{\top}\in \CC^{\nrcv \times \nrcv}``. For each frequency, constructing the matrix ``\A_{j}^{-\top}\Pmat^{\top}`` requires ``\nrcv`` PDE solves. As described in the previous section, the matrix ``\Gmat_{i,j}`` is sparse and driven by the corresponding wavefields ``\overline{\uvec}_{i,j}``, whose computational cost approximately equals to one PDE solve for each source and each frequency. The computational complexity of inverting the matrix ``\SIGMAN + \frac{1}{\lambda^2}\Pmat\A^{-1}_{j}\A^{-\top}_{j}\Pmat^{\top}`` is ``\mathcal{O}(\nrcv^3)``. Since ``\nrcv \ll \ngrid``, inverting this matrix is much cheaper than solving a PDE. Thus, to construct the Gauss-Newton Hessian ``\Hliket``, we only need to solve ``\nfreq \times (\nsrc+\nrcv)`` PDEs. With the computed matrix ``\A_{j}^{-\top}\Pmat^{\top}`` and wavefield ``\overline{\uvec}_{i,j}``, the action of the Gauss-Newton Hessian ``\Hliket`` on any vector ``\overline{\mvec}`` can be performed  with several efficient matrix-vector multiplications related to the matrices ``\A_{j}^{-\top}\Pmat^{\top}``, ``\Gmat_{i,j}`` and ``\SIGMAN + \frac{1}{\lambda^2}\Pmat\A^{-1}_{j}\A^{-\top}_{j}\Pmat^{\top}``. Compared to the conventional approach, this new factored formulation of the implicit operator does not require additional PDE solves to compute a matrix-vector multiplication once it is constructed.  In general, we have that ``\nrcv \ll \nsmp \times \niter`` and using this implicit Gauss-Newton Hessian operator to draw ``\nsmp`` samples is significantly cheaper than the conventional approach. Another advantage of using this operator arises from the fact that the computations of the necessary matrices corresponding to different frequencies are independent from each other. As a result, we can compute and store these matrices in parallel for different frequencies, allowing us to speed up our computations in a distributed computing environment. Furthermore, the expression in equation #GNWRI2 provides a natural decomposition of the Gauss-Newton Hessian ``\Hliket`` as follows:

```math #HmisFactor
    \Hliket &= \Rmat_{\text{like}}^{\top}\Rmat_{\text{like}}, \\
    \Rmat_{\text{like}} &= \begin{bmatrix}
    (\SIGMAN + \frac{1}{\lambda^2}\Pmat\A_{1}^{-1}\A_{1}^{-\top}\Pmat^{\top})^{-\frac{1}{2}}\Pmat\A_{1}^{-1}\Gmat_{1,1}
    \\
    \cdot\cdot\cdot \\
    \\
    (\SIGMAN + \frac{1}{\lambda^2}\Pmat\A_{j}^{-1}\A_{j}^{-\top}\Pmat^{\top})^{-\frac{1}{2}}\Pmat\A_{j}^{-1}\Gmat_{i,j} \\
    \cdot\cdot\cdot\\
    (\SIGMAN + \frac{1}{\lambda^2}\Pmat\A_{\nfreq}^{-1}\A_{\nfreq}^{-\top}\Pmat^{\top})^{-\frac{1}{2}}\Pmat\A_{\nfreq}^{-1}\Gmat_{\nsrc,\nfreq}
    \end{bmatrix}.
```

Similarly to the factored formulation of the implicit Gauss-Newton Hessian, we can construct the factor ``\Rmat_{\text{like}}`` as an implicit operator once we have computed the matrix ``\A_{j}^{-\top}\Pmat^{\top}`` and wavefield ``\overline{\uvec}_{i,j}``. We will use this implicit operator ``\Rmat_{\text{like}}`` for the sampling method introduced in the next subsection.


### Steps 5 and 6: sampling the Gaussian PDFs

The covariance matrix ``\Hpostt^{-1}`` is a dense ``\ngrid \times \ngrid`` matrix, and the construction of its Cholesky factorization involves ``\mathcal{O}(\ngrid^3)`` operations. Both of these facts prohibit us from applying the Cholesky factorization method to sample the Gaussian distribution ``\rho_{\text{Gauss}}(\mvec)`` for large-scale problems. Here we propose to use the so-called optimization-driven Gaussian simulators [@papandreou2010gaussian; @orieux2012sampling] or the randomize-then-optimize (RTO) method [@solonen2014optimization] to sample the Gaussian distribution ``\rho_{\text{Gauss}}(\mvec)``. This method belongs to the classical bootstrap method [@efron1981nonparametric; @efron1992bootstrap; @kitanidis1995recent], and it does not require the explicit formulation of the Hessian matrix as well as the expensive Cholesky factorization. To outline this method, we first use equation #QuadrApprx to divide ``\Hpostt`` into ``\Hliket`` and ``\SIGMAP^{-1}``. The Gauss-Newton Hessian ``\Hliket`` has the factorization in equation #HmisFactor\, and we can also compute the Cholesky factorization of the prior covariance matrix ``\SIGMAP = \Rprior^{\top}\Rprior`` with an upper-triangular matrix ``\Rprior``. Substituting these two factorizations into equation #QuadrApprx\, we can rewrite the Gaussian distribution ``\rho_{\text{Gauss}}(\mvec)`` as follows:

```math #QuadrApprx3
    \rho_{\text{Gauss}}(\mvec) \propto \exp \left(-\frac{1}{2}\|\Rmat_{\text{like}}\mvec-\Rmat_{\text{like}}\mMAP\|^{2} -\frac{1}{2}\|\Rprior^{-\top}\mvec-\Rprior^{-\top}\mMAP\|^{2}\right).
```

@papandreou2010gaussian and @solonen2014optimization noted that independent realizations from the distribution in equation #QuadrApprx3 can be computed by repeatedly solving the following linear data-fitting optimization problem:

```math #RTOgeneral
    \mvec_{\text{s}} &= \argmin_{\mvec} \left\|\Rmat_{\text{like}}\mvec - \Rlike\mMAP -  \rvec_{\text{like}} \right\|^{2} + \left\|\Rprior^{-\top}\mvec - \Rprior^{-\top}\mMAP - \rvec_{\text{prior}} \right\|^{2}, \\
    &\rvec_{\text{like}} \sim \mathcal{N}(\mathbf{0},\mathcal{I}_{ \nd \times  \nd}), \\
    &\rvec_{\prior} \sim \mathcal{N}(\mathbf{0}, \mathcal{I}_{ \ngrid \times  \ngrid}).
```

This optimization problem can be solved by iterative solvers such as LSQR and PCG, which does not require the explicit expression for the matrix ``\Rmat_{\text{like}}`` but merely an operator that can compute the matrix-vector product. As a result, we can use our implicit formulation of ``\Rlike`` in equation #HmisFactor to solve the optimization problem in equation #RTOgeneral and draw samples from the Gaussian distribution ``\rho_{\text{Gauss}}(\mvec)``. The pseudo code of the RTO method to draw samples from the Gaussian distribution ``\rho_{\text{Gauss}}(\mvec)`` is shown in Algorithm [#alg:UQ2], which realizes step 5 in Algorithm [#alg:UQ1]. Because the overall sampling strategy consists of the Gaussian approximation and the RTO method, we will refer to the proposed method as GARTO in the rest of the paper.


### Algorithm: RTO {#alg:UQ2}
|   1. Start with the MAP estimate ``\mMAP``, covariance matrices ``\SIGMAN`` and ``\SIGMAP``;
|   2. Formulate the operator ``\Rlike(\mMAP)`` by equation #HmisFactor, and compute the
|      Cholesky factorization of ``\SIGMAP=\Rprior^{\top}\Rprior``;
|   3. for s = 1:``\nsmp``
|   4.      Generate ``\rvec_{\prior} \sim \mathcal{N}(\mathbf{0}, \mathcal{I}_{ \ngrid \times  \ngrid})`` and ``\rvec_{\like} \sim \mathcal{N}(\mathbf{0},\mathcal{I}_{ \nd \times  \nd})``;
|   5.      Solve ``\mvec_{\text{s}} = \argmin_{\mvec} \left\|\Rmat_{\text{like}}\mvec - \Rlike\mMAP -  \rvec_{\text{like}} \right\|^{2} + \left\|\Rprior^{-\top}\mvec - \Rprior^{-\top}\mMAP - \rvec_{\text{prior}} \right\|^{2}``;
|   6. end
: Sample ``\rho_{\text{Gauss}}(\mvec)`` by the RTO method

### A benchmark method: the randomized maximum likelihood method

We have proposed a computationally efficient algorithm --- GARTO --- that can approximately sample the target distribution in equation #PostMH without additional PDE solves once the MAP estimate and the Gauss-Newton Hessian operator are computed. However, due to the loss of accuracy caused by the Gaussian approximation, it is important to investigate the accuracy of the GARTO method by comparing it to a benchmark method that can sample the target distribution in equation #PostMH regardless of the computational cost. The randomized maximum likelihood (RML) [@chen2012ensemble] method is a viable candidate as a benchmark because it has the capability to draw samples that are good approximations of those from the target distribution for weakly nonlinear problems [@bardsley2014randomize; @bardsley2015randomize]. Indeed, as previously discussed, the target distribution with weak PDE-constraints that the GARTO method aims to sample is less prone to the nonlinearity with a carefully selected ``\lambda``.

To draw a sample, the RML method performs a bootstraping-like method  [@efron1981nonparametric; @efron1992bootstrap] that first samples the data and prior model and then computes the resulting MAP. More precisely, in order to draw a sample from the target distribution in equation #PostMH, the RML method solves the following nonlinear optimization problem:

```math #RML
\mvec_{s}=\argmin_{\mvec}\sum_{i=1}^{\nsrc}\sum_{j=1}^{\nfreq}&\Big (\frac{1}{2}\|\SIGMAN^{-\frac{1}{2}}\Pmat\overline{\uvec}_{i,j}(\mvec) - \SIGMAN^{-\frac{1}{2}}\dvec_{i,j} - \rvec^{(1)}_{i,j}\|^{2} \\ + &\frac{1}{2}\|\lambda\A_{j}(\mvec)\overline{\uvec}_{i,j}(\mvec)-\lambda\q_{i,j}- \rvec^{(2)}_{i,j}\|^{2} \Big )\\+ &\frac{1}{2}\|\SIGMAP^{-\frac{1}{2}}\mvec-\SIGMAP^{-\frac{1}{2}}\mp - \rvec^{(3)}\|^{2},
```

where the vectors ``\rvec^{(1)}_{i,j}``, ``\rvec^{(2)}_{i,j}``, and ``\rvec^{(3)}`` are random realizations from the standard norm distribution ``\NN(0,\Imat)``. We refer interested readers to @chen2012ensemble for more details about the RML method. As a result of this approach, the computational cost of drawing one sample by the RML method is approximately equivalent to solving one FWI problem, which is significantly more expensive than the GARTO method. Therefore, we are only able to conduct an comparison with the RML method on a small-scale problem in the following section.


## Numerical experiments {#numerical}

### Influence of the penalty parameter ``\lambda``

The feasibility of the proposed Bayesian framework relies on the accuracy of the  approximations in equations #PDFFinal and #QuadrApprx\, both of which depend on the selection of ``\lambda``. To get a better understanding of the influence of this parameter on these approximations, we will work with a relatively simple 2D velocity model parameterized by a single varying parameter ``v_0``---i.e., the velocity is given by ``v(z) = v_0 + 0.75z\, \mathrm{m/s}`` and ``z`` increases with vertical depth. We simulate data with a single source and single frequency for ``v_0 \, = \, 2000 \, \mathrm{m/s}`` using a grid spacing of ``25 \, \mathrm{m}``. The frequency of the data is ``5\, \mathrm{Hz}``, which is suitable for avoiding numerical dispersion on this particular grid spacing. We place our source at ``(z, x)`` coordinates ``(50 \, \mathrm{m}, 50 \, \mathrm{m})`` and record the data at ``200`` receivers located at the depth of ``50 \, \mathrm{m}`` with a sampling interval of ``25 \, \mathrm{m}``. We do not simulate the free surface in this example. After the simulation, we add ``10\%`` Gaussian noise to the observed data. Because the prior distribution is independent of ``\lambda``, we only investigate its influence on the negative log-likelihood of the associated PDFs in this experiment. We abbreviate the negative log-likelihood with "NLL".

Figure [#fig:1DCmp1] shows the NLL for the reduced approach (cf. equation #CPost\) as well as for the penalty approach (cf. equation #Marginal) for various values of ``\lambda`` as a function of ``v_{0}``. As discussed previously, we select values of ``\lambda^2`` proportional to the largest eigenvalue ``\mu_{1}`` of the matrix ``\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-1}\Pmat\A(\mvec)^{-1}``, i.e., ``\lambda^2 \, = \, 10^{-10}\mu_1, \, 10^{-6}\mu_1, \,  10^{-4}\mu_1, \, 10^{-2}\mu_1, \,  10^{0}\mu_1, \, \text{and}\, 10^{2}\mu_1``. From this figure, we observe that, when ``\lambda`` is large, i.e., ``\lambda = \, 10^{2}\mu_{1}`` and ``10^{0}\mu_{1}``, the NLL exhibits several local minima, and, as ``\lambda`` increases, it converges to the reduced approach formulation, as expected. We also note that for small ``\lambda``, i.e., ``\lambda = \, 10^{-10}\mu_{1}``, the resulting NLL is monotonically decreasing, which is due to the fact that the determinant term in equation #margPDFLog dominates the NLL. Additionally, in between these two extreme values for ``\lambda`` (i.e., when ``\lambda = \, 10^{-6}\mu_{1},`` ``10^{-4}\mu_{1},`` and ``10^{-2}\mu_{1}``), the resulting NLLs are unimodal. This observation implies that with a carefully selected ``\lambda``, the posterior distribution with weak PDE-constraints potentially contains less local maxima.

### Figure: 1DComparison1{#fig:1DCmp1}
![](./fig/Experiment/Fig/Exp1_Cmp2){width=100%}
The comparison of the negative log-likelihood functions.

To investigate the influence of the parameter ``\lambda`` on the accuracy of the approximations in equations #PDFFinal and #QuadrApprx\, we plot in Figure [#fig:1DCmp2] the NLL corresponding to the true (cf. equation #Marginal\) ``\rhopost(\mvec|\dvec)``, its approximation neglecting the determinant term ``\rhopostover(\mvec|\dvec)`` (cf. equation #PDFFinal\), and the Gaussian approximation ``\rho_{\text{Gauss}}(\mvec)`` (cf. equation #QuadrApprx\) for various values of ``\lambda``. For simplicity, we refer to these three different functions as ``\psi_{1}``, ``\psi_{2}``, and ``\psi_{3}``, respectively. From Figure [#fig:Example1_dm11], we observe that when ``\lambda\, =\, 10^{-10}\mu_{1}``, ``\psi_{2}`` fails to adequately approximate ``\psi_{1}``---i.e., the approximation in equation #PDFFinal fails --- because the determinant term in equation #Marginal dominates the negative logarithm function. As ``\lambda`` increases, the determinant term becomes negligible, and ``\psi_{2}`` becomes a reasonable approximation of ``\psi_{1}``, as shown in Figures [#fig:Example1_dm12] to [#fig:Example1_dm16]. However, among the five various selections of ``\lambda``, only when ``\lambda^{2}  = 10^{-2}\mu_{1}`` does ``\psi_{3}`` adequately approximate  ``\psi_{2}``. This occurs because when ``\lambda`` is relatively large, ``\psi_2`` contains a number of nonoptimal local minima, resulting in ``\psi_2`` being poorly modeled by its Gaussian approximation. Additionally,  when ``\lambda < 10^{-2}\mu_{1}``, the term ``\A_{j}\overline{\uvec}_{i,j}-\q_{i,j}`` in equation #FullHessiansub is not negligible, resulting in the Gauss-Newton Hessian being a poor approximation of the full Hessian. These results imply that the proposed criterion --- i.e., ``\lambda^2 = 10^{-2}\mu_{1}`` --- provides a reasonable choice for the parameter ``\lambda``, which can simultaneously satisfy both the approximations in equations #PDFFinal and #QuadrApprx\. As a result, the corresponding posterior distribution is less prone to the local maxima and can be appropriately approximated by the Gaussian distribution in equation #QuadrApprx\, which ensures the feasibility of the proposed framework.


### Figure: 1DComparison2{#fig:1DCmp2}
![``\lambda^2 = 10^{-10}\mu_{1}``](./fig/Experiment/Fig/Exp1_-10){#fig:Example1_dm11 width=50%}
![``\lambda^2 = 10^{-6}\mu_{1}``](./fig/Experiment/Fig/Exp1_-6){#fig:Example1_dm12 width=same}\
![``\lambda^2 = 10^{-4}\mu_{1}``](./fig/Experiment/Fig/Exp1_-4){#fig:Example1_dm13 width=same}
![``\lambda^2 = 10^{-2}\mu_{1}``](./fig/Experiment/Fig/Exp1_-2){#fig:Example1_dm14 width=same}\
![``\lambda^2 = 10^{0}\mu_{1}``](./fig/Experiment/Fig/Exp1_0){#fig:Example1_dm15 width=same}
![``\lambda^2 = 10^{2}\mu_{1}``](./fig/Experiment/Fig/Exp1_2){#fig:Example1_dm16 width=same}
The comparison of the negative log-likelihood functions of the true distribution (``\psi_{1}``), the approximate distribution (``\psi_{2}``), and the Gaussian approximation distribution (``\psi_{3}``) with the values of ``\lambda^2 = 10^{-10}\mu_{1}, \, 10^{-6}\mu_{1}, \, 10^{-4}\mu_{1}, \, 10^{-2}\mu_{1}, \, 10^{0}\mu_{1},`` and  ``10^{2}\mu_{1}``.

### A 2D layered model

In this section, we develop an example to compare the accuracy of the statistical quantities produced by the GARTO method relative to those produced by the RML method. Considering the large computational cost required by the RML method, we use  a small three-layer model as our baseline model, as shown in Figure [#fig:TrueLayer], whose size is ``1500 \, \mathrm{m} \, \times\, 3000 \, \mathrm{m}``. We discretize the velocity model with a grid spacing of ``50 \, \mathrm{m}``, yielding 1800 unknown parameters. At the surface, we place sixty sources and receivers with a horizontal sampling interval of ``50 \, \mathrm{m}``. To control the computational cost and ensure the feasibility of the RML method, we use a Ricker wavelet centered at ``6 \, \mathrm{Hz}`` to simulate the observed data with frequencies of ``5, \, 6``, and ``7 \, \mathrm{Hz}``. We use an absorbing boundary condition at the surface, so that no surface-related multiples are included. After the simulation, we add ``15 \%`` Gaussian noise to the observed data resulting in a covariance matrix of ``\SIGMAN = 175^2\mathbf{I}``. To set up the prior distribution, we first construct the monotonically increasing model shown in Figure [#fig:PriLayer] as the prior mean model, which corresponds to the well-known observation that the velocity of the Earth, in general, increases with depth. Following the strategy used by @bardsley2015randomize that ensures the prior covariance matrix is well-conditioned, we construct the prior covariance matrix by selecting ``a \, = 0.1\, \mathrm{km}^2\mathrm{/s}^2``, ``b \, = 0.65\,``, and ``c \, = \, 0.01 \, \mathrm{km}^2\mathrm{/s}^2``, resulting in a prior standard deviation of ``0.33 \, \mathrm{km/s}`` that meets the maximal difference between the true and prior mean models. We select the penalty parameter ``\lambda`` for each frequency according to the proposed selection criterion, resulting in ``\lambda \, = 13,\, 12,\,`` and ``11``, respectively. To compute reliable statistical quantities and while also ensuring that the RML method terminates in a reasonable amount of time, we use both the GARTO and the RML methods to generate ``10, 000`` samples from the posterior distribution. Based on our experience, generating ``10,000`` samples is sufficient for both methods to produce stable results in this case.

### Figure: LayerModel{#fig:LayerModel}
![True model](./fig/Exp2/Layer_True){#fig:TrueLayer width=50%}
![Prior mean model](./fig/Exp2/Layer_Ini){#fig:PriLayer width=same}
The true model (a) and the prior mean model (b).

Given that the computational overhead introduced by these methods is negligible compared to the number of PDEs that need to be solved, we use the number of PDEs as our performance metric. To generate ``10,000`` samples, the RML method needs to solve ``10,000`` nonlinear optimization problems. To solve each nonlinear optimization problem, following @vanLeeuwen2014GEOPcav, we stop the optimization when the relative misfit difference between two iterations drops below ``10^{-3}`` resulting in ``100`` l-BFGS iterations. During each l-BFGS iteration, we have to solve ``2 \, \times \, 3 \, \times \, 60`` PDEs to compute the objective and gradient. As a result, the RML method requires ``360 \, \times \, 100\, \times \, 10000 = 360 `` million PDE solves to draw the ``10,000`` samples. Contrary to the RML method, the GARTO method requires significantly fewer PDE solves. The total number of PDE solves required by the GARTO method is ``36,360``, which includes ``36,000`` PDE solves to find the MAP estimate and another ``360`` PDE solves to construct the Gauss-Newton operator. With the MAP estimate and the Gauss-Newton operator in hand, the GARTO method uses the RTO approach to sample the ``10,000`` samples without involving any additional PDE solves as we explained above. Therefore, neglecting the costs associated with solving the least-squares systems, compared to the RML method, the GARTO method requires only ``\frac{1}{10000}\text{th}`` the computational budget to generate the same number of samples.

In addition to the significant computational speedups introduced by the GARTO method, the GARTO method also generates samples that yield similar statistical quantities as those produced by the RML method. For instance, the posterior mean models obtained by the two methods are shown in Figure [#fig:LayerMEANCompare]. From Figures [#fig:LayerRTOMean] and [#fig:LayerInvMean], we observe that aside from some slight differences in the second and third layers, the two results are roughly identical, with an average pointwise relative difference of ``1.5\%``. Both results provide acceptable reconstructions of the original velocity model, despite the fact that the data are noisy. We also use the two posterior mean models to compute the predicted data and compare them with the observed data in Figures [#fig:LayerRTOData] and [#fig:LayerRMLData]\, which faithfully match the observed data, aside from the noise. The pointwise standard deviations computed from both methods, shown in Figure [#fig:LayerSTDCompare], result in estimates that are visually quite similar throughout the entire model. The average relative difference between the standard deviations produced by both methods is ``6\%``, an acceptable error level, and results from the Gaussian approximation in GARTO. Figure [#fig:LayerSTDCompare2] depicts the ``95 \%`` confidence intervals obtained with the two methods at three vertical cross-sections through the model. The  confidence intervals obtained by the two methods are virtually identical, as are the pointwise marginal distributions shown in Figure [#fig:LayerDistCompare]. All these results illustrate that, compared to the RML method, the proposed GARTO method can produce accurate estimates of statistical quantities, while requiring a fraction of the computational costs.


### Figure: MEANCmp_Layer{#fig:LayerMEANCompare}
![GARTO](./fig/Exp2/Layer_RTOMEAN){#fig:LayerRTOMean width=50%}
![RML](./fig/Exp2/Layer_INVMEAN){#fig:LayerInvMean width=same}
The posterior mean models obtained by the GARTO method (a) and the RML method (b).

### Figure: STDCmp_Layer{#fig:LayerSTDCompare}
![GARTO](./fig/Exp2/Layer_RTOSTD){#fig:LayerRTOSTD width=50%}
![RML](./fig/Exp2/Layer_INVSTD){#fig:LayerInvSTD width=same}
The posterior standard deviations obtained by the GARTO method (a) and the RML method (b).

### Figure: DataDCmp_Layer{#fig:LayerDataCompare}
![GARTO](./fig/Exp2/DataRTO){#fig:LayerRTOData width=50%}
![RML](./fig/Exp2/DataRML){#fig:LayerRMLData width=same}
The comparison between the observed data and the predicted data with the posterior mean models obtained by the GARTO method (a) and the RML method (b).

### Figure: STDCmp_Layer2{#fig:LayerSTDCompare2}
![``x \, = \, 500 \, \mathrm{m}``](./fig/Exp2/Layer_STDcmpX_500){#fig:LayerCIcmp500 width=30%}
![``x \, = \, 1500 \, \mathrm{m}``](./fig/Exp2/Layer_STDcmpX_1500){#fig:LayerCIcmp1500 width=same}
![``x \, = \, 2500 \, \mathrm{m}``](./fig/Exp2/Layer_STDcmpX_2500){#fig:LayerCIcmp2500 width=same}
The mean (line) and ``95 \%`` confidence interval (background patch) comparisons of the GARTO method (blue) and the RML method (red) at ``x \, = \, 500 \, \mathrm{m}``, ``1500 \, \mathrm{m}``, and ``2500 \, \mathrm{m}``. The similarity between these two results implies that the confidence intervals obtained with the GARTO method is a good approximation of the ones obtain with the RML method.  


### Figure: DistCmp_Layer2{#fig:LayerDistCompare}
![``x \, = \, 1500 \, \mathrm{m},\, z \, = \, 200 \, \mathrm{m}``](./fig/Exp2/rho_z200x_1500){#fig:Layerdist1 width=50%}\
![``x \, = \, 1500 \, \mathrm{m},\, z \, = \, 700 \, \mathrm{m}``](./fig/Exp2/rho_z700x_1500){#fig:Layerdist2 width=same}
![``x \, = \, 1500 \, \mathrm{m},\, z \, = \, 1200 \, \mathrm{m}``](./fig/Exp2/rho_z1200x_1500){#fig:Layerdist3 width=same}
The comparison of the prior marginal distribution (solid line) and the posterior marginal distributions obtained by the GARTO method (dotted line) and the RML method (dashed line) at the locations of ``(x,\, z) \, = \, (1500\mathrm{m}, \, 200\mathrm{m})``, ``(1500\mathrm{m}, \, 700\mathrm{m})``, and ``(1500\mathrm{m}, \, 1200\mathrm{m})``.


### BG Model

To demonstrate the effectiveness of our method when applied to a more realistic problem, we conduct an experiment on a 2D slice from the 3D BG Compass model shown in Figure [#fig:TrueBG]. This is a synthetic velocity model created by BG Group, which contains a large degree of variability and has been widely used to evaluate performances of different FWI methods [@Li11TRfrfwi; @van2013fast]. The model is 2000  ``\mathrm{m}`` deep and 4500  ``\mathrm{m}`` wide, with a grid size of ``10 \, \mathrm{m}`` resulting in 92,455 unknown parameters. Following @van2013fast, we use a Ricker wavelet with a central frequency of ``15 \, \mathrm{Hz}`` to simulate 91 sources at the depth of ``50 \, \mathrm{m}`` with a horizontal sampling interval of ``50 \, \mathrm{m}``. As before, we do not model the free-surface, so that the data do not contain free-surface related multiples. We place 451 equally spaced receivers at the same depth as the sources to record the data, which contain ``30`` equally spaced frequency components ranging from ``2 \, \mathrm{Hz}`` to ``31 \, \mathrm{Hz}``. This results in 1,231,230 observed data values. To mimic a real-world noise level, we corrupt the  synthetic observations with ``15 \%`` random  Gaussian noise, leading to ``\SIGMAN = 46^2\mathbf{I}``. To construct the prior distribution, we first set its mean model (Figure [#fig:IniBG]) by smoothing the true model followed by horizontal averaging. Second, we construct the covariance matrix of the prior distribution utilizing the fact that we have the true 3D model, which contains 1800 2D slices. We regard these 2D slices as 1800 2D samples, from which we compute the pointwise standard deviation. After horizontal averaging, we obtain the prior standard deviation shown in Figure [#fig:PriSTDBG]. With the prior standard deviation, we select ``a \, = \, 0.02 \mathrm{km}^2/\mathrm{s}^2`` and ``b \, = \, 19.5`` to construct a well-conditioned covariance matrix with a correlation length of ``60 \, \mathrm{m}`` [@bardsley2015randomize]. The parameter ``c`` in equation [#SIGMAP] is calculated according to the standard deviation and the parameters ``a`` and ``b``. Finally, we compute the penalty parameter ``\lambda`` for each frequency (listed in Table #UQLambda) according to the criterion introduced earlier in order to obtain a posterior distribution that is less prone to local maxima. Considering both the computational resources and the accuracy of the inverted statistical quantities, we will use the GARTO method to draw 2000 samples according to @bardsley2015randomize. Compared to the previous example, which had a much simpler model, this model contains a significantly larger number of unknown parameters and data points and is a better proxy for real-world problems.

### Figure: BGModel{#fig:BGModel}
![True model](./fig/Exp4/BGTrue){#fig:TrueBG width=90%} \
![Prior mean model](./fig/Exp4/BGIni){#fig:IniBG width=same} \
![Prior standard deviation](./fig/Exp4/BGInistd){#fig:PriSTDBG width=same}
The true model (a), the prior mean model (b), and the prior standard deviation (c).


#### Table: UQLambda {#UQLambda .wide}
+-------------+------+------+------+------+-----+----+----+----+----+----+
| Frequency   | 2    |  3   |  4   |  5   |  6  |  7 | 8  | 9  | 10 | 11 |
+-------------+------+------+------+------+-----+----+----+----+----+----+
| ``\lambda`` | 37.8 | 29.3 | 24.4 | 20.5 | 17.6|15.2|13.5|12.1|10.9|10.1|
+-------------+------+------+------+------+-----+----+----+----+----+----+
| Frequency   | 12   |  13  |  14  |  15  |  16 | 17 | 18 | 19 | 20 | 21 |
+-------------+------+------+------+------+-----+----+----+----+----+----+
|``\lambda``  | 9.3  | 8.8  | 8.3  | 7.9  | 7.5 | 7.1| 6.9| 6.5|6.4 |6.1 |
+-------------+------+------+------+------+-----+----+----+----+----+----+
| Frequency   |  22  |  23  |  24  |  25  |  26 | 27 | 28 | 29 | 30 | 31 |
+-------------+------+------+------+------+-----+----+----+----+----+----+
|``\lambda``  | 5.9  | 5.7  | 5.5  | 5.4  | 5.2 | 5.1| 4.9| 4.8|4.7 |4.6 |
+-------------+------+------+------+------+-----+----+----+----+----+----+

: The selection of ``\lambda`` for each frequency

During the inversion, we use 200 l-BFGS iterations to compute the MAP estimate plotted in Figure [#fig:MAPBG] with the same stopping criterion as in the previous example. Compared to the true model, we observe that most of the observable velocity structures are reconstructed in the MAP estimate, aside from some small measurement noise related imprints near the boundary of the model. We also observe that the shallow parts (``z \, \leq \, 1400 \, \mathrm{m}``) of the BG model, where the turning waves exist (for a maximal offset of 4500  ``\mathrm{m}`` [@VirieuxOverview2009]) are better recovered relative to the deep parts (``z \, \geq \, 1400 \, \mathrm{m}``), where the only received energy arises from reflected waves. This implies that the portion of the data corresponding to the turning waves is more informative to the velocity reconstruction than that of the reflected waves, which is a well-known observation in seismic inversions.

After obtaining the MAP estimate, we construct the Gauss-Newton Hessian operator and apply the RTO method to generate 2000 samples. This allows us to compute the posterior standard deviation (Figure #fig:STDBG) and compare it with the prior standard deviation (Figure #fig:PriSTDBG). To have a better understanding of the information that the data introduce, we also compute the relative difference ``\frac{\text{STD}_{\text{post}}(m_k)-\text{STD}_{\text{prior}}(m_k)}{|\text{STD}_{\text{prior}}(m_k)|}`` between the posterior and prior standard deviations (Figure [#fig:STDRELDIFFBG]).
In the shallow parts of the model (``z \, \leq \, 1400 \, \mathrm{m}``), the posterior standard deviation decreases between ``10\%\,-\,50\%`` compared to the prior standard deviation, while in the deep parts (``z \, \geq \, 1400 \, \mathrm{m}``), the reduction in standard deviation is smaller than ``3 \%``. This observation is consistent with the notion that, owing to the amplitude decay of propagating waves, the data place more constraints on the velocity variations in the shallow parts of the model compared to the deep parts. Additionally, considering the areas where the turning waves and the reflected waves exist, this observation also implies that the portion of the data corresponding to the turning waves can reduce more uncertainties in the recovered velocity compared to the reflected waves. To further evaluate this inversion result,  we compare the prior model, the MAP estimate of the posterior, and the true velocity at three different cross sections in Figure [#fig:BGCross] (i.e., ``x \, = 1000 \, {\mathrm{m}}, \, 2500 \, {\mathrm{m}},`` and ``4000 \, {\mathrm{m}}``), in which we also plot the prior standard deviation (red patch) and posterior standard deviation (blue patch). In the shallow region of the model, the MAP estimates closely match the true model, while they diverge in the deeper region in a more pronounced manner. This is again consistent with the notion that the data are more informative in the shallow area of the model compared to the deeper areas.

### Figure: BGResult{#fig:BGResult}
![Posterior MAP](./fig/Exp4/BGMAP){#fig:MAPBG width=90%} \
![Posterior standard deviation](./fig/Exp4/BGRTOSTD){#fig:STDBG width=same} \
![Relative difference between the posterior and prior standard deviations](./fig/Exp4/BGSTDRELDIFF){#fig:STDRELDIFFBG width=same}
The posterior MAP estimate (a), the posterior standard deviation (b), and the relative difference ``\frac{\text{STD}_{\text{post}}(m_k)-\text{STD}_{\text{prior}}(m_k)}{|\text{STD}_{\text{prior}}(m_k)|}`` between the posterior and the prior standard deviations (c).

### Figure: BGCross{#fig:BGCross}
![``x \, = \, 1000 \, \mathrm{m}``](./fig/Exp4/BG_STDcmpX_1000){#fig:BGSTDcmp1000 width=30%}
![``x \, = \, 2500 \, \mathrm{m}``](./fig/Exp4/BG_STDcmpX_2500){#fig:BGSTDcmp2500 width=same}
![``x \, = \, 4000 \, \mathrm{m}``](./fig/Exp4/BG_STDcmpX_4000){#fig:BGSTDcmp4000 width=same}
The mean and standard deviation comparisons of the posterior (blue) and the prior (red) distributions at ``x \, = \, 1000 \, \mathrm{m}``, ``2500 \, \mathrm{m}``, and ``4000 \, \mathrm{m}``.

We also consider the pointwise posterior marginal distribution generated by the posterior and prior distributions to further understand the results of the GARTO method. Figure [#fig:BGDistCmp] compares these distributions at four different locations, ``(x,\, z) \, = \, (2240\mathrm{m}, \, 40\mathrm{m})``, ``(2240\mathrm{m}, \, 640\mathrm{m})``, ``(2240\mathrm{m}, \, 1240\mathrm{m})``, and ``(2240\mathrm{m}, \, 1840\mathrm{m})``. The posterior distribution is more concentrated than the prior distribution in the shallow regions of the model, while in the deep parts, the two distributions are almost identical. Therefore, the recovered velocity in the shallow parts is more reliable than in the deep parts.

### Figure: BGDistCmp{#fig:BGDistCmp}
![``x \, = \, 2240 \, \mathrm{m},\, z \, = \, 40 \, \mathrm{m}``](./fig/Exp4/rho_z40x_2240){#fig:BGdist1 width=50%}
![``x \, = \, 2240 \, \mathrm{m},\, z \, = \, 640 \, \mathrm{m}``](./fig/Exp4/rho_z640x_2240){#fig:BGdist2 width=same}\
![``x \, = \, 2240 \, \mathrm{m},\, z \, = \, 1240 \, \mathrm{m}``](./fig/Exp4/rho_z1240x_2240){#fig:BGdist3 width=same}
![``x \, = \, 2240 \, \mathrm{m},\, z \, = \, 1840 \, \mathrm{m}``](./fig/Exp4/rho_z1840x_2240){#fig:BGdist4 width=same}
The comparison of the prior (solid line) and the posterior (dotted line) marginal distributions at the locations of ``(x,\, z) \, = \, (2240\mathrm{m}, \, 40\mathrm{m})``,  ``(2240\mathrm{m}, \, 640\mathrm{m})``, ``(2240\mathrm{m}, \, 1240\mathrm{m})``,  and ``(2240\mathrm{m}, \, 1840\mathrm{m})``.   


To verify our statistical results, we also utilize the RML method as a baseline approach. For this example, drawing a single sample with the RML method using 200 l-BFGS iterations requires at least 1.09 million PDE solves, which is computationally daunting. As a result, we only generate 10 samples via the RML method and compare them to the ``95\%`` confidence intervals (i.e., the blue patch) obtained by the GARTO method in Figure [#fig:BGCIcmp]. From these figures, it is clear that the majority of the 10 samples lie in the blue patch. Moreover, the variation of the ten samples also matches the ``95\%`` confidence interval. In this case, we conclude that the estimated confidence intervals are likely accurate approximations of the true confidence intervals.

### Figure: BGCIcmp{#fig:BGCIcmp}
![``x \, = \, 1000 \, \mathrm{m}``](./fig/Exp4/BG_CIX_1000){#fig:BGCIcmp1000 width=30%}
![``x \, = \, 2500 \, \mathrm{m}``](./fig/Exp4/BG_CIX_2500){#fig:BGCIcmp2500 width=same}
![``x \, = \, 4000 \, \mathrm{m}``](./fig/Exp4/BG_CIX_4000){#fig:BGCIcmp4000 width=same}
The ``95 \%`` confidence intervals and the 10 realizations via the RML method at ``x \, = \, 1000 \, \mathrm{m}``, ``2500 \, \mathrm{m}``, and ``4000 \, \mathrm{m}``.

## Discussion

When the underlying model is given by PDE-constraints consisting of multiple experiments, one is forced to make various approximations and tradeoffs in order to develop a computationally tractable procedure. There are a large number of discretized parameters, corresponding to a discretized 2D or 3D function, and one must necessarily avoid having to explicitly construct dense covariance matrices of the size of the model grid squared, whose construction  requires a large number of PDE solves. Moreover, each evaluation of the posterior distribution involves the solution of multiple PDEs, a computationally expensive affair. Methods that require tens or hundreds of thousands of such posterior evaluations, such as McMC-type methods, do not scale to realistically sized problems. The original PDE-constrained formulation of Bayesian inference, while theoretically convenient, results in a posterior that cannot be appropriately approximated by a Gaussian distribution, whereas the relaxation of the exact PDE-constraints results in a posterior that is much more amenable to Gaussian approximation. Ideally, one would like to parameterize the distribution over the joint model/field variables ``(\mvec,\uvec)`` and estimate the variances accordingly. Hidden in this notation, however, is the fact that in real-world problems, we have hundreds or potentially thousands of source experiments, each of which corresponds to a different ``\uvec``. Storing all of these fields and updating them explicitly becomes prohibitive memory-wise, even on a large cluster. As a result, our approach aims to keep the benefits of relaxing the PDE-constraints while still resulting in a computationally feasible algorithm.

The initial results illustrate that with the specific selection of ``\lambda``, i.e., ``\lambda^2 = 0.01\mu_{1}``, the relaxed formulation of the posterior PDF is less prone to local maxima, which enables us to analyze it via an arguably accurate Gaussian approximation. Once we can manipulate the covariance matrix of the Gaussian approximation through the implicit PDE-free formulation, we have access to a variety of statistical quantities, including the pointwise standard deviation, the confidence intervals, and the pointwise marginal distributions, in a tractable manner. We can use these quantities to assess the uncertainties of our inversion results and to identify the areas with high/low reliability in the model. This information is important and useful for the subsequent processing and interpretation. A straightforward example is that it allows us to assess the reliability of the visible image features obtained by the subsequent procedure of imaging, as in @ely2017assessing.

While the initial results are promising, some aspects of the proposed framework warrant further investigation. Although numerical examples illustrate the feasibility of the proposed method for the case with the selection of  ``\lambda^2 = 0.01\mu_{1}``, it does not guarantee the feasibility of the proposed approach for PDFs arising from other choices of ``\lambda``. For other selections of ``\lambda``, the posterior PDFs can significantly differ from a Gaussian PDF, which makes approximately sampling them challenging. Potentially other sampling schemes can be explored for these distributions.

While we have shown the feasibility of the proposed sampling method for 2D problems, the application of the proposed sampling method to 3D problems is still challenging from the perspective of memory usage. To satisfy the ``\mathcal{O}\big (\ngrid \times (\nrcv + \nsrc) \times \nfreq \big )`` storage requirement for formulating the implicit Gauss-Newton Hessian operator, a large high-performance cluster with enough computational workers and memory is needed to store all of the necessary matrices in parallel.


## Conclusions

We have described a new Bayesian framework for partial-differential-equation-constrained inverse problems. In our new framework, we introduce the field variables as auxiliary variables and relax the partial-differential-equation-constraint. By integrating out the field parameters, we avoid having to store and update the high number of field parameters, and exploit the fact that the new distribution is Gaussian in the field variables for every fixed model. We propose a method for selecting an appropriate penalty parameter such that the new posterior distribution can be approximated by a Gaussian distribution, which is more accurate than the conventional formulation.

We apply the new formulation to the seismic inverse problems and derive each component of the general framework. For this specific application, we use a partial-differential-equation-free Gauss-Newton Hessian to formulate the Gaussian approximation of the posterior distribution. We also illustrate that with this Gauss-Newton Hessian, the Gaussian approximation can be sampled without the requirement of the explicit formulation of the Gauss-Newton Hessian and its Cholesky factorization.

Our proposed method compares favorably to the existing randomized maximum likelihood method for generating different statistics on a simple layered model and the more challenging BG Compass model. Compared to the randomized maximum likelihood method, our method produces results that are quite visually similar, while requiring significantly less partial-differential-equation solves, which are the computational bottleneck in these problems. We expect that these methods will scale reasonably well to 3D models, where traditional methods have only begun to scratch the surface of the problem.

While in this paper we utilize the Gaussian smoothness prior distribution, indeed, the proposed sampling method can also handle other choices of prior distributions, as long as they have sparse covariance matrices. In the future, we can investigate the incorporation of the proposed method with other kinds of prior distributions.


## Acknowledgments

This research was carried out as part of the SINBAD II project with the support of the member organizations of the SINBAD Consortium. The authors wish to acknowledge Dr. James Gunning from CSIRO for his valuable discussion and suggestions. The authors also wish to acknowledge the SENAI CIMATEC Supercomputing Center for Industrial Innovation, with support from BG Brasil and the Brazilian Authority for Oil, Gas and Biofuels (ANP), for the provision and operation of computational facilities and the commitment to invest in Research & Development.

<!-- ## Appendices {#secAppendix} -->

\append{Marginal distribution}
\label{append}


<!-- ### Marginal distribution {#secAppMdist} -->

To derive the marginal PDF ``\rhopost(\mvec|\dvec)`` in equation #Marginal, we start with the joint PDF

```math #PostDPApp
  \rhopost(\uvec,\mvec|\dvec) \propto \rho(\dvec|\uvec,\mvec)\rho(\uvec|\mvec)\rhoprior(\mvec).

```

As the noise in the data are assumed to be Gaussian, we have

```math #PDFUMDApp
  \rho(\dvec|\uvec,\mvec) = (2\pi)^{-\frac{\ndata}{2}}\det(\SIGMAN)^{-\frac{1}{2}}\exp\left(-\frac{1}{2}\|\Pmat\uvec - \dvec\|^{2}_{\SIGMAN^{-1}}\right).
```

Substituting equations #CndPDF and #PDFUMDApp into equation #PostDPApp, we arrive at

```math #JointPDFApp
  \rhopost(\uvec,\mvec|\dvec) \propto &(2\pi)^{-\frac{\ngrid}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}}  \\
  &\times \exp\Big (-\frac{1}{2}\big (\lambda^2\|\A(\mvec){\uvec}-\q\|^{2} + \|\Pmat{\uvec} - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big )\Big ).
```

Clearly, for fixed ``\mvec``, the PDF ``\rhopost(\uvec,{\mvec}|\dvec)`` with respect to ``\uvec`` is rendered into a Gaussian PDF with a mean value ``\overline{\uvec}({\mvec})`` given by equation #OptU1 and a covariance matrix ``\SIGMAU({\mvec})`` given by

```math #HuApp
  \SIGMAU({\mvec})^{-1}=\Hmat({\mvec}) &= -\nabla^{2}_{\uvec}\log\rhopost(\uvec,{\mvec}|\dvec)\big\rvert_{\uvec=\overline{\uvec}(\mvec)}\\
  &=\lambda^{2}\A({\mvec})^{\top}\A({\mvec}) + \Pmat^{\top}\SIGMAN^{-1}\Pmat.
```

Given this identity, we can integrate ``\uvec`` out and derive the following closed form expression for the marginal PDF of ``\mvec:``

```math #MarginalApp
   \rhopost(\mvec|\dvec) &= \int \rhopost(\uvec,\mvec|\dvec)d\uvec \\
      & =(2\pi)^{\frac{\ngrid}{2}}\operatorname{det}\big(\Hmat(\mvec)\big)^{-\frac{1}{2}} \rhopost\big (\mvec,\overline{\uvec}(\mvec)|\dvec\big ) \\
      & \propto\text{det}\big (\Hmat(\mvec)\big )^{-\frac{1}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}} \\
       &\quad \, \, \times\exp\Big(-\frac{1}{2}\big (\lambda^2\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} + \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big )\Big).
```

To derive the  limit of  the marginal posterior PDF, equation #MarginalApp, as ``\lambda \rightarrow \infty``, we insert  equation #HuApp into the term ``\text{det}\big (\Hmat(\mvec)\big )^{-\frac{1}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}}`` and obtain

```math #det
  &\operatorname{det}\big(\Hmat(\mvec)\big)^{-\frac{1}{2}}\det\big (\lambda^{2}\A(\mvec)^{\top}\A(\mvec)\big )^{\frac{1}{2}} \\
   = &\det\left(\Imat + \frac{1}{\lambda^2}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-1}\Pmat\A(\mvec )^{-1}\right)^{-\frac{1}{2}} \\
   = &\det\left(\Imat + \frac{1}{\lambda^2}\SIGMAN^{-\frac{1}{2}}\Pmat\A(\mvec)^{-1}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-\frac{\top}{2}}\right)^{-\frac{1}{2}}.
```

As a result, we arrive at the following expression for the marginalized posterior distribution

```math #Marginal2
   \rhopost(\mvec|\dvec) &\propto \det\left(\Imat + \frac{1}{\lambda^2}\SIGMAN^{-\frac{1}{2}}\Pmat\A(\mvec)^{-1}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-\frac{\top}{2}}\right)^{-\frac{1}{2}} \\
   & \quad \, \, \, \times \exp\left(-\frac{1}{2}\big (\lambda^2\|\A(\mvec)\overline{\uvec}(\mvec)-\q\|^{2} + \|\Pmat\overline{\uvec}(\mvec) - \dvec\|^{2}_{\SIGMAN^{-1}} + \|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}\big ) \right).
```

When ``\lambda \rightarrow \infty``, we have

```math #limitatin1
&\lim_{\lambda \rightarrow \infty}\det\big (\Imat + \frac{1}{\lambda^2}\SIGMAN^{-\frac{1}{2}}\Pmat\A(\mvec)^{-1}\A(\mvec)^{-\top}\Pmat^{\top}\SIGMAN^{-
  \frac{\top}{2}}\big )^{-\frac{1}{2}} = 1, \text{ and} \\
  &\lim_{\lambda \rightarrow \infty}\overline{\uvec}(\mvec) = \A(\mvec)^{-1}\q,
```

which leads to

```math #limitation2App
  \lim_{\lambda \rightarrow \infty} \rhopost(\mvec|\dvec) \propto \exp\big ({-\frac{1}{2}\|\Pmat\A(\mvec)^{-1}\q - \dvec\|^{2}_{\SIGMAN^{-1}}-\frac{1}{2}\|\mvec-\mp\|^{2}_{\SIGMAP^{-1}}}\big ).
```  

As expected, the marginal PDF of the penalty formulation (cf. equation #MarginalApp), converges to the conventional reduced formulation (cf. equation #CPost) as ``\lambda \rightarrow \infty``.

```math_def
\def\argmin{\mathop{\rm arg\min}}
\def\argmax{\mathop{\rm arg\max}}
\newcommand{\Pmat}{\mathbf{P}}
\newcommand{\A}{\mathbf{A}}
\newcommand{\q}{\mathbf{q}}
\newcommand{\uvec}{\mathbf{u}}
\newcommand{\post}{\text{post}}
\newcommand{\rhopost}{\rho_{\text{post}}}
\newcommand{\prior}{\text{prior}}
\newcommand{\noise}{\text{noise}}
\newcommand{\mis}{\text{mis}}
\newcommand{\like}{\text{like}}
\newcommand{\rholike}{\rho_{\text{like}}}
\newcommand{\rholikepen}{{\rho}_{\text{like}}^{\text{pen}}}
\newcommand{\overrholikepen}{\overline{\rho}_{\text{like}}^{\text{pen}}}
\newcommand{\rhoprior}{\rho_{\text{prior}}}
\newcommand{\vvec}{\mathbf{v}}
\newcommand{\obs}{\text{obs}}
\newcommand{\dvec}{\mathbf{d}}
\newcommand{\xvec}{\mathbf{x}}
\newcommand{\Hmat}{\mathbf{H}}
\newcommand{\Cmat}{\mathbf{C}}
\newcommand{\g}{\mathbf{g}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\dobs}{\mathbf{d}_{\text{obs}}}
\newcommand{\F}{\mathbf{F}}
\newcommand{\SIGMAN}{{\Gamma}_{\text{noise}}}
\newcommand{\SIGMA}{\Gamma}
\newcommand{\SIGMANij}{_{i,j}{\Gamma}_{\text{noise}}}
\newcommand{\SIGMAPDE}{{\Gamma}_{\text{pde}}}
\newcommand{\SIGMAP}{{\Gamma}_{\text{prior}}}
\newcommand{\SIGMAU}{{\Gamma}}
\newcommand{\mvec}{\mathbf{m}}
\newcommand{\Gmat}{\mathbf{G}}
\newcommand{\Wmat}{\mathbf{W}}
\newcommand{\Imat}{\mathbf{I}}
\newcommand{\Rmat}{\mathbf{R}}
\newcommand{\Rlike}{\mathbf{R}_{\like}}
\newcommand{\Rprior}{\mathbf{R}_{\text{prior}}}
\newcommand{\Rpost}{\mathbf{R}_{\text{post}}}
\newcommand{\Hpostt}{\tilde{\mathbf{H}}_{\text{post}}}
\newcommand{\Hliket}{\tilde{\mathbf{H}}_{\text{like}}}
\newcommand{\Smat}{\mathbf{S}}
\newcommand{\Mmat}{\mathbf{M}}
\newcommand{\Jmat}{\mathbf{J}}
\newcommand{\rvec}{\mathbf{r}}
\newcommand{\svec}{\mathbf{s}}
\newcommand{\nsrc}{n_{\text{src}}}
\newcommand{\nrcv}{n_{\text{rcv}}}
\newcommand{\nfreq}{n_{\text{freq}}}
\newcommand{\ngrid}{n_{\text{grid}}}
\newcommand{\niter}{n_{\text{iter}}}
\newcommand{\nsmp}{n_{\text{smp}}}
\newcommand{\mMAP}{\mathbf{m}_{\ast}}
\newcommand{\Lmis}{\Rmat_{\text{mis}}}
\newcommand{\nd}{n_{\text{data}}}
\newcommand{\Hp}{\Hmat_{\text{prior}}}
\newcommand{\fred}{f_\text{red}}
\newcommand{\gred}{g_\text{red}}
\newcommand{\fpen}{f_\text{pen}}
\newcommand{\gpen}{g_\text{pen}}
\newcommand{\Hlike}{\Hmat_{\text{like}}}
\newcommand{\Hprior}{\Hmat_{\text{prior}}}
\newcommand{\Hpost}{\Hmat_{\text{post}}}
\renewcommand{\mp}{\tilde{\mvec}}
\renewcommand{\SIGMAPDE}{\Gamma_{\text{PDE}}}
\renewcommand{\SIGMA}{\Gamma}
\newcommand{\RR}{\mathbb{R}}
\newcommand{\CC}{\mathbb{C}}
\newcommand{\ndata}{n_{\text{data}}}
\newcommand{\NN}{\mathcal{N}}
\newcommand{\II}{\mathcal{I}}
\newcommand{\rholikeover}{\overline{\rho}_{\text{like}}}
\newcommand{\rhopostover}{\overline{\rho}_{\text{post}}}
```