--- title: A stochastic quasi-Newton McMC method for uncertainty quantification of full-waveform inversion author: | Zhilong Fang^\*^, Felix J. Herrmann^\*^, and Chia Ying Lee^†^\ ^\*^Seismic Laboratory for Imaging and Modeling (SLIM), University of British Columbia, ^†^Department of Mathematics, University of British Columbia, bibliography: - zfang.bib runninghead: A stochastic quasi-Newton McMC method for uncertainty quantification of FWI --- \vspace{-1em} ## Abstract: \vspace{-1em} In this work we propose a stochastic quasi-Newton Markov chain Monte Carlo (McMC) method to quantify the uncertainty of full-waveform inversion (FWI). We formulate the uncertainty quantification problem in the framework of the Bayesian inference, which formulates the posterior probability as the conditional probability of the model given the observed data. The Metropolis-Hasting algorithm is used to generate samples satisfying the posterior probability density function (pdf) to quantify the uncertainty. However it suffers from the challenge to construct a proposal distribution that simultaneously provides a good representation of the true posterior pdf and is easy to manipulate. To address this challenge, we propose a stochastic quasi-Newton McMC method, which relies on the fact that the Hessian of the deterministic problem is equivalent to the inverse of the covariance matrix of the posterior pdf. The l-BFGS (limited-memory Broyden–Fletcher–Goldfarb–Shanno) Hessian is used to approximate the inverse of the covariance matrix efficiently, and the randomized source sub-sampling strategy is used to reduce the computational cost of evaluating the posterior pdf and constructing the l-BFGS Hessian. Numerical experiments show the capability of this stochastic quasi-Newton McMC method to quantify the uncertainty of FWI with a considerable low cost. \vspace{-1em} ## Introduction \vspace{-1em} Uncertainty is present in each procedure of seismic exploration, from acquisition to modeling and processing. Because of this, quantifying the uncertainty of inversion results has become increasingly important in order to inform decisions in industry. Traditionally, the uncertainty can be quantified in the framework of Bayesian inference, which formulates the posterior probability distribution ``\rho_{\post}(\v)`` of the velocity model ``\mathbf{v}`` as the conditional probability ``\rho(\v|\d_{\obs})`` of the velocity given the observed data ``\d_{\obs}``. The conditional probability ``\rho(\v|\d_{\obs})`` is proportional to the product of the prior on the model ``\rho_{\prior}(\v)`` and the probability of the observed data given the model ``\rho(\d_{\obs}|\v)`` -- i.e., we have: ```math #Bayesian \rho_{\text{post}}(\mathbf{v}) = \rho({\mathbf{v}|\mathbf{d}_{\text{obs}}}) \propto \rho_{\text{like}}({\mathbf{d}_{\text{obs}}|\mathbf{v}})\rho_{\text{prior}}(\mathbf{v} ). ``` However, quantifying the uncertainty of FWI based on the formula ([#Bayesian]) is challenging, because of the high dimension of the velocity model and the huge computational cost of calculating the likelihood probability density function ``\rho_{\like}({\d_{\obs}|\v})``. @MartinMcMC2012 proposed a stochastic Newton type McMC method using a low-rank approximated Gauss-Newton Hessian of the likelihood probability density function to improve the convergence speed. An application of this McMC method to solving a 3D seismology problem can be found in @GhattasMcMC2013. Unfortunately, the low-rank assumption of the Gauss-Newton Hessian may not be valid in the seismic exploration, because of the fully sampled sources and receivers. Additionally, challenges like the high computational cost of the calculation of the low-rank representative and the posterior pdf still remain. The limited-memory Broyden–Fletcher–Goldfarb–Shanno (l-BFGS) method is a popular quasi-Newton method for solving non-linear optimization problems. Instead of using the full Hessian or Gauss-Newton Hessian, the l-BFGS method constructs an approximated Hessian and its inverse based on the information of the model updates and gradients of the previous iterations, and thus it does not require additional computational costs. The application of the l-BFGS method on FWI can be found in @Overview2009. Recently, with the development of the randomized source sub-sampling method, @TristanFWI2013 proposed a stochastic l-BFGS method, in which only few sources were used per iteration to reduce the computational cost of evaluating the misfit function, gradient and quasi-Newton Hessian. The authors showed that instead of using all sources and few iterations, using few sources but many iterations could significantly improve the quality of the result. In our previous work [@fang2014EAGEfuq], we applied the randomized source sub-sampling strategy to the McMC method that @MartinMcMC2012 proposed and significantly reduced the computational cost of evaluating the posterior pdf. In this work, we apply the l-BFGS method and the randomized source sub-sampling strategy to the McMC method and propose a stochastic quasi-Newton McMC algorithm. This algorithm does not require additional computational cost for estimating the Hessian and reduces the computational cost of estimating the posterior pdf, gradient and Hessian by sub-sampling sources. In the numerical experiment, we use this method to quantify the uncertainty of FWI and obtain statistical parameters like standard deviation and confidence interval. \vspace{-1em} ## Metropolis-Hasting Method \vspace{-1em} The posterior distribution ([#Bayesian]) is complex and difficult to sample, as the forward operator requires a PDE solve for each source. Many sampling methods, such as McMC method, have been developed to sample complex posterior distributions using many fewer samples than the grid-based sampling method. In particular, the Metropolis-Hasting (M-H) method (Algorithm [#alg:1]) [@kaipio:2004] generates a chain of samples from the posterior pdf ``\rhopost(\v)`` by employing a proposal distribution ``q(\v_{k},\y)`` at ``k``th sample in the Markov chain. The proposal sample ``\y`` is firstly generated from the proposal distribution ``q(\v_{k},\y)`` and accepted or rejected by the M-H criterion. A simple choice of the proposal distribution is ``q(\v_{k},\y) = \frac{1}{(2\pi)^{n/2}}\exp\left[\frac{1}{2}\left(\|\|\v_{k}-\y\|\|^{2}\right)\right]``, which results in the well-known random walk Metropolis method. This proposal distribution is easy to sample, but since it does not contain any information of the true posterior distribution, the proposal sample ``\y`` may be accepted with a low probability, which leads to a slow convergence of the Markov chain. If a better proposal distribution is chosen, the accepted probability can be larger, but constructing such a good proposal distribution and sampling from it may require significantly larger computational cost. Therefore the main challenge of M-H method is to devise the proposal distribution ``q(\v_{k},\y)``, so that it can be easily sampled while still constituting a good approximation of the posterior pdf ``\rhopost(\v)``. #### Algorithm: {#alg:1} | Choose initial parameter ``\v_{0}`` | Compute ``\rhopost(\v_{0})`` based on formula ([#Bayesian]) | **for** ``k = 0,...,N-1`` **do** | 1. Draw sample ``\y`` from the proposal density ``q(\v_{k},\centerdot)`` | 2. Compute ``\rhopost(\y)`` | 3. Compute ``\alpha(\v_{k},\y) = \min\left(1,\frac{\rhopost(\y)q(\y,\v_{k})}{\rhopost(\v_{k})q(\v_{k},\y)}\right)`` | 4. Draw ``u`` ~ ``\mathcal{U}\left(\left[0,1\right]\right)`` | 5. **if** ``u < \alpha\left(\v_{k},\y\right)`` **then** | 6. Accept: Set ``\v_{k+1} = \y`` | 7. **else** | 8. Reject: Set ``\v_{k+1} = \v_{k}`` | 9. **end if** | **end for** : Metropolis-Hasting method to sample ``\rho_{\text{post}}``. \vspace{-1em} ## Stochastic Quasi-Newton McMC Algorithm with randomized source sub-sampling \vspace{-1em} In order to construct a good proposal distribution, an analysis of the posterior pdf ([#Bayesian]) is necessary. Assume the observed data ``\d_{\obs}`` has a Gaussian noise with zero mean and covariance matrix ``\Sigma_{\noise}``. Meanwhile, assume the prior model distribution can be also modeled as a Gaussian distribution with ``\mathbf{\overline{v}_{\prior}}`` mean and covariance matrix ``\Sigma_{\prior}``. The posterior pdf of velocity can be expressed as follows: ```math #PosteriorPDF \rho_{\text{post}}(\mathbf{v}) \propto \exp[-\frac{1}{2}\|f(\mathbf{v})-\mathbf{d}_{\text{obs}}\|^{2}_{{\Sigma}^{-1}_{\text{noise}}}-\frac{1}{2}\|\mathbf{v}-\overline{\mathbf{v}}_{\text{prior}}\|^{2}_{{\Sigma}^{-1}_{\text{prior}}}], ``` where, ``f(\v)`` is the forward operator. To devise the proposal distribution, we analyze the negative log function of the posterior pdf, ```math #logpost -\log(\rho_{\post}(\v)) = \frac{1}{2}\|f(\mathbf{v})-\mathbf{d}_{\text{obs}}\|^{2}_{{\Sigma}^{-1}_{\text{noise}}}+\frac{1}{2}\|\mathbf{v}-\overline{\mathbf{v}}_{\text{prior}}\|^{2}_{{\Sigma}^{-1}_{\text{prior}}}, ``` which is equivalent to the objective function of the deterministic problem with weighting matrix ``{\Sigma^{-1}_{\noise}}`` and ``{\Sigma^{-1}_{\prior}}``. In fact, the Hessian ``\H(\v_{k})`` of ``-\log(\rho_{\text{post}}(\v_{k}))`` at a certain point ``\v_{k}`` is the inverse of the covariance matrix of the approximated Gaussian distribution ``q(\v_{k},\centerdot)`` at the same point. Additionally, the mean of the approximated Gaussian distribution is equivalent to the result of one Newton type update: ``\v_{\text{mean}}=\v_{k} - \H_{k}^{-1}\g_{k}``. @MartinMcMC2012 suggested the use of the approximated Gaussian distribution ``\mathcal{N}(\v_{k} - \H_{k}^{-1}\g_{k}, \H_{k}^{-1})`` to speed up the convergence. In this work, instead of using the low-rank approximated Gauss-Newton Hessian to construct the proposal distribution as in @MartinMcMC2012, we use the l-BFGS Hessian to build the proposal distribution. As a result, we do not require the low-rank assumption and the additional computational cost associated with the low-rank representative of the Gauss-Newton Hessian in each iteration. In order to construct the l-BFGS hessian close enough to the true Hessian, we start the construction from a pseudo Gauss-Newton Hessian [@Choi01102008], which is a diagonal matrix with properly scaling. The pseudo Hessian is then continuously updated by the addition of two 1-rank matrices in each iteration. Additionally, during each iteration, we use the randomized source sub-sampling method to reduce the computational cost of evaluating the posterior pdf, gradient and l-BFGS Hessian, and these limit the number of PDE solves. @MichaelSO2012 provided an extensive comparison and analysis between the sub-sampling gradient and the full gradient. In our previous work [@fang2014EAGEfuq], we also showed that the randomized source sub-sampling method only introduced a small relative error (Figure [#fig:subvsall]). After constructing the local l-BFGS Hessian ``\H_{k}`` and gradient ``\g_{k}``, we define the local Gaussian ```math #LocalGauss q(\v_{k},\y)\propto \exp\left(-\frac{1}{2}\left(\y-\v_{k}+\H_{k}^{-1}\g_{k}\right)^{T}\H_{k}\left(\y-\v_{k}+\H_{k}^{-1}\g_{k}\right)\right) ``` as the proposal distribution. The proposal sample ``\y`` from this distribution can be generated as follows: ```math #GenerateY \y = \v_{k}-\H_{k}^{-1}\g_{k} + \H_{k}^{-1/2}\gamma, ``` where ``\gamma`` is a random vector from the normal distribution ``\mathcal{N}(0,\mathbf{I})``. Finally, we use the M-H criterion to accept or reject the proposal sample ``\y``. The pseudocode for the stochastic quasi-Newton McMC algorithm is given in Algorithm [#alg:2]. #### Algorithm: {#alg:2} | Choose initial ``\v_{0}`` | Compute ``\rhopost(\v_{0})``, ``\g(\v_{0})``, ``\H(\v_{0})`` (``\H`` is the l-BFGS Hessian) | **for** ``k = 0,...,N-1`` **do** | 1. Define ``q(\v_{k},\y)`` based on formula ([#LocalGauss]) | 2. Draw sample ``\y`` from the proposal density ``q(\v_{k},\centerdot)`` with | formula ([#GenerateY]) | 3. Compute ``\rhopost(\y)``, ``\g(\y)``, ``\H(\y)`` with a random subset ``\mathcal{I}_{sub}`` | of all sources | 4. Compute ``\alpha(\v_{k},\y) = \min\left(1,\frac{\rhopost(\y)q(\y,\v_{k})}{\rhopost(\v_{k})q(\v_{k},\y)}\right)`` | 5. Draw ``u`` ~ ``\mathcal{U}\left(\left[0,1\right]\right)`` | 6. **if** ``u < \alpha\left(\v_{k},\y\right)`` **then** | 7. Accept: Set ``\v_{k+1} = \y`` | 8. **else** | 9. Reject: Set ``\v_{k+1} = \v_{k}`` | 10. **end if** | **end for** : Stochastic Quasi Newton McMC Algorithm to sample ``\rho_{\text{post}}``. ### Figure: Ratio of pdf using sub-set sources over pdf using all sources {#fig:subvsall } ![](figs/randomvsall.png){width=90%} Ratio of pdf using sub-set sources over pdf using all sources. The relative error is below 3%. \vspace{-1em} ## Numerical experiment \vspace{-1em} ### Figure: True model and initial model {#fig:TrueAndIni } ![](figs/bgtrue.png){#fig:truemodel width=90%}\ ![](figs/bginitial.png){#fig:inimodel width=same} (a) True model and (b) Initial model. The model is 2 km deep and 4.5 km wide. There is no special structure in the initial model. The prior model is the same as the initial model. ### Figure: MAP and standard deviation {#fig:MAPandSTD } ![](figs/MAP.png){#fig:MAPmodel width=90%}\ ![](figs/STD.png){#fig:std width=same} (a) MAP model and (b) Standard Deviation map. The MAP model is the model that maximize the posterior pdf. The standard deviation map shows the standard deviation of the velocity at each grid. Larger uncertainties are found at the deep areas, strong velocity contrast areas and boundaries. ### Figure: Confidence interval {#fig:Confidence .wide} ![](figs/CI1.png){#fig:CI1 width=33%} ![](figs/CI2.png){#fig:CI2 width=same} ![](figs/CI3.png){#fig:CI3 width=same} Confidence interval with confidence level ``\alpha = 90\%`` at different horizontal positions (a) x = 1490 m, (b) x = 2240m, and (c) x = 2990 m. Red line - true velocity, yellow line - inverted velocity, blue line - confidence interval. Wider confidence intervals are found at the deep areas and high velocity contrast areas. To illustrate the capability of the stochastic quasi-Newton McMC method to quantify the uncertainty of FWI with a considerable small cost, we consider the following experiment on a domain of 2 km by 4.5 km shown in Figure [#fig:truemodel]. The synthetic observed data ``\d_{\obs}`` is generated by solving the Helmholtz equation using an optimal 9-points finite difference method [@Jo1996FD] in 15 equally-spaced frequencies ranging from 3 Hz to 17 Hz. Both sources and receivers are located at depth ``z = 20`` m with 50 m and 10 m horizontal sampling interval, respectively. The source wavelet is a Ricker wavelet with central frequency of 12 Hz. For this numerical experiment, we set the noise covariance matrix to a diagonal matrix whose standard deviation is 0.1. We assume we do not have any special prior information, thus we set the prior model ``\overline{\v}_{\prior}`` to the initial model (c.f. Figure [#fig:inimodel]), which is obtained by smoothing the true model followed by a horizontal stack. The prior covariance matrix is set as a diagonal matrix with relative standard deviation of 0.1. In the uncertainty quantification procedure, we first invert the maximum a posterior (MAP) estimate of the posterior distribution. As a custom strategy of FWI, We invert the MAP estimate in consecutive (five) frequency bands and use the previous result as a warm start for the inversion of the next frequency band. We use the stochastic l-BFGS method [@TristanFWI2013] to obtain the MAP estimate ``\v_{\text{MAP}}`` plotted in Figure [#fig:MAPmodel]. Then we start from the ``\v_{\text{MAP}}`` and use the stochastic quasi-Newton McMC algorithm (Algorithm [#alg:2]) to sample the posterior distribution ``\rho_{\post}``. During each iteration, only 5 of all 91 shots are randomly selected to calculate the posterior pdf, gradient and l-BFGS Hessian. We quantify the uncertainties based on these samples and obtain the standard deviation of velocity at each grid (Figure [#fig:std]) and the confidence interval (Figure [#fig:Confidence]) at three different horizontal position (x = 1490 m, 2240 m, and 2990 m). ##Discussion The l-BFGS Hessian is used to construct the proposal distribution, so the computational cost in each iteration only involves the calculation of the posterior pdf and the gradient. Comparing with the random walk Metropolis method, the stochastic quasi-Newton McMC method only requires an additional computational cost for generating the gradient during each iteration, while it is able to construct a much better proposal distribution containing more information about the true posterior distribution to speed up the convergence of the Markov chain. Comparing with the original Newton type McMC method [@MartinMcMC2012], although the l-BFGS Hessian may be less accurate, the additional computational cost for estimating the low-rank approximation is eliminated, which results in a significant computational cost reduction. Moreover, without the requirement of the low-rank assumption of the misfit Gauss-Newton Hessian, this method will not suffer from the instability introduced by violating the low-rank assumption. As a result this method may be more suitable for seismic exploration. Moreover, since the randomized source sub-sampling method is used, in each iteration, only 5 randomly selected sources are used to calculate the approximated posterior pdf and gradient, and thus the computational cost is only 1/18 of that using all sources, effectively speeding up the computation by a factor of 18. In addition, the total computational cost for each iteration is considerably less than that of only evaluating the posterior pdf with all sources, which means the stochastic quasi-Newton McMC is even cheaper than the random walk Metropolis method with all sources. Due to the computational capability, we only generate 1000 samples and accept 289 samples, which results in the noise in the standard deviation map (Figure [#fig:std]). Comparing results of the MAP estimate (Figure [#fig:MAPmodel]) and standard deviation (Figure [#fig:std]), we observe that the velocity of deep areas, strong contrast areas and boundaries has larger uncertainties than that of shallow areas, weak contrast areas and central areas, as expected. A similar observation can be drawn for confidence interval (Figure [#fig:Confidence]). Here the probability is more concentrated at shallow areas and low velocity contrast areas than deep areas and strong velocity contrast areas. Finally, the artifacts associated with the poor prior model still can be observed in the MAP estimate, which implies us the importance of selecting a good prior model. For the situation that no special prior information exists in advance, one alternative solution may be to update the prior model after the inversion of each frequency band. \vspace{-1em} ## Conclusion In this work, we formulate the uncertainty quantification problem of FWI in the Bayesian inference framework, and propose a stochastic quasi-Newton McMC method to solve this problem. Using the l-BFGS Hessian as the inverse of the covariance matrix, we release the low-rank assumption and reduce the computational cost for constructing the proposal distribution of the Metropolis-Hasting method. Additionally, we use the randomized source sub-sampling strategy to speed up the evaluation of the posterior pdf, gradient and Hessian. As a result, we obtain a fast McMC method to quantify the uncertainty of FWI. While initial results are encouraging, several challenges remain such as the credence of the l-BFGS as a good approximation of the true Hessian. \vspace{-1em} ## Acknowledgements \vspace{-1em} This work was financially supported in part by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN 261641-06) and the Collaborative Research and Development Grant DNOISE II (CDRP J 375142-08). This research was carried out as part of the SINBAD II project with support from the following organizations: BG Group, BGP, BP, Chevron, ConocoPhillips, CGG, ION GXT, Petrobras, PGS, Statoil, Total SA, WesternGeco, Woodside. We also thank BG Group for providing the synthetic model we used in our example. ## References ```math_def \newcommand{\post}{\text{post}} \newcommand{\prior}{\text{prior}} \newcommand{\noise}{\text{noise}} \newcommand{\like}{\text{like}} \renewcommand{\v}{\mathbf{v}} \newcommand{\obs}{\text{obs}} \renewcommand{\d}{\mathbf{d}} \renewcommand{\H}{\mathbf{H}} \newcommand{\g}{\mathbf{g}} \newcommand{\y}{\mathbf{y}} \newcommand{\rhopost}{\rho_{\text{post}}} ```