---
title: Deep Bayesian inference for seismic imaging with tasks
runninghead: Inference for imaging with tasks
author: |
    Ali Siahkoohi^1^, Gabrio Rizzuti^2^, and Felix J. Herrmann^1^ \
    ^1^ School of Computational Science and Engineering, Georgia Institute of Technology \
    ^2^ Georgia Institute of Technology; Now at Utrecht University \
    Email: `alisk@gatech.edu, g.rizzuti@uu.nl, felix.herrmann@gatech.edu`
bibliography:
    - paper.bib

---

## Abstract:

We propose to use techniques from Bayesian inference and deep neural
networks to translate uncertainty in seismic imaging to uncertainty in
tasks performed on the image, such as horizon tracking. Seismic imaging
is an ill-posed inverse problem because of bandwidth and aperture
limitations, which is hampered by the presence of noise and
linearization errors. Many regularization methods, such as
transform-domain sparsity promotion, have been designed to deal with the
adverse effects of these errors, however, these methods run the risk of
biasing the solution and do not provide information on uncertainty in
the image space and how this uncertainty impacts certain tasks on the
image. A systematic approach is proposed to translate uncertainty due to
noise in the data to confidence intervals of automatically tracked
horizons in the image. The uncertainty is characterized by a
convolutional neural network (CNN) and to assess these uncertainties,
samples are drawn from the posterior distribution of the CNN weights,
used to parameterize the image. Compared to traditional priors, it is
argued  in the literature that these CNNs introduce a flexible inductive
bias that is a surprisingly good fit for a diverse set of problems,
including medical imaging, compressive sensing, and diffraction
tomography. The method of stochastic gradient Langevin dynamics is
employed to sample from the posterior distribution. This method is
designed to handle large scale Bayesian inference problems with
computationally expensive forward operators as in seismic imaging. Aside
from offering a robust alternative to maximum a posteriori estimate that
is prone to overfitting, access to these samples allow us to translate
uncertainty in the image, due to noise in the data, to uncertainty on
the tracked horizons. For instance, it admits estimates for the
pointwise standard deviation on the image and for confidence intervals
on its automatically tracked horizons.

## Introduction

Due to the presence of shadow zones, coherent linearization errors, and
noisy finite-aperture measured data, seismic imaging involves an
ill-conditioned linear inverse problem [@lambare1992iterative;
@schuster1993least; @nemeth1999least]. Relying on a single estimate for
the model may be subject to data overfit [@malinverno2004expanded] and
negatively impacts the quality of the obtained seismic image and tasks
performed on it. Casting the seismic imaging problem into a
probabilistic framework allows for a more comprehensive description of
its solution space [@tarantola2005inverse]. The "solution" of the
inverse problem is then a probability distribution over the model space
and is commonly referred to as the posterior distribution.

Aside from the computational challenges associated with uncertainty
quantification (UQ) in geophysical inverse problems
[@malinverno2004expanded; @tarantola2005inverse; @malinverno2006two;
@MartinMcMC2012; @chevron2017; @fang2018uqfip; @stuart2019two;
@zhao2019gradient; @kotsi2020], the choice of prior distributions in
Bayesian frameworks is crucial. Recent attempts mostly rely on
handcrafted priors, i.e., priors chosen solely based on their simplicity
and applicability. For example, restricting feasible solutions to
layered media with specific orientations
[@doi:10.1046/j.1365-246X.2002.01847.x; @zhu2018seismic;
@visser2019bayesian], or satisfying regularity conditions related to
model parameters or derivatives thereof [@malinverno2004expanded;
@malinverno2006two; @vanLeeuwen2010IJGswi; @herrmann11GPelsqIm;
@MartinMcMC2012;  @lu2015l1; @tu2014fis; @zhu2016bayesian; @Ely2018;
@fang2018uqfip; @stuart2019two; @zhao2019gradient; @izzatullah2020;
@kotsi2020]. While effective in controlled settings, handcrafted priors
might introduce unwanted bias to the solution. Recent deep-learning
based approaches [@mosser2018stochastic; @zhang2019regularized;
@fang2020deep; @Zhilongsegam2020-3428013.1; @liu2020deep;
@Mossereage2020; @sun2020extrapolated; @Youzuo2020; @kazei2021mapping;
@kothari2021trumpets; @kumar2021SEGeuq; @siahkoohi2021Seglbe], on the
other hand, learn a prior distribution from available data[^1]. While
certainly providing a better description of the available prior
information when compared to generic handcrafted priors, they may affect
the outcome of Bayesian inference more seriously when
out-of-distribution data is considered, e.g., when the training data is
not fully representative of a given scenario. Unfortunately, unlike
deep-learning based inversion approaches in other imaging modalities,
e.g., medical imaging [@adler2018deep; @welling2019; @pmlr-v119-asim20a;
@hauptmann2020deep; @sriram2020end; @mukherjee2021end], we generally do
not have access to high-fidelity information about the Earth's
subsurface. This, together with the Earth's strong heterogeneity across
geological scenarios, might limit the scope of data-driven approaches
that heavily rely on pretraining [@siahkoohi2019transfer;
@kaur2020improving; @Willett2020; @Youzuo2020_physics_consistent;
@sun2020elastic; @zhang2020SEGtli; @barbano2021unsupervised;
@siahkoohi2020ABIpto; @qu2021training; @vrolijk2021source].

[^1]: We use the word "data" interchangeably. In the context of inverse
    problems "data" refers to observed data. In the context of machine
    learning data-driven priors refer to priors derived from available
    samples from an unknown distribution. These are also commonly
    referred to as "data". The meaning of the word "data" will be clear
    from the context.

In this work, we take advantage of a novel prior recently deployed in
computer vision and geophysics [@Lempitsky; @arridge_oktem_2019;
@Cheng_2019_CVPR; @gadelha2019shape; @liu2019deep; @wu2019parametric;
@dittmer2020regularization; @Fantong2020; @shi2020deep;
@siahkoohi2020EAGEdlb; @siahkoohi2020SEGuqi; @siahkoohi2020SEGwdp;
@tolle2021mean], known as the deep prior, which utilizes the inductive
bias [@mitchell1980need] of untrained convolutional neural networks
(CNNs) as a prior. This approach is tantamount to restricting feasible
models to the range of an untrained CNN with a fixed input and randomly
initialized weights. Via this reparameterization the weights of the CNN
become the new unknowns in seismic imaging and this change of variable
leads to a "prior" on the image space that excludes noisy artifacts, as
long as overfitting is prevented [@Lempitsky]. This has the potential
benefit of being less restrictive than handcrafted priors while not
needing training data as approaches based on using pretrained networks
[@Lempitsky]. To formally cast the deep prior into a Bayesian framework,
we impose a Gaussian distribution on the CNN weights, which is a common
regularization strategy in training deep CNNs [@krogh1992simple;
@Goodfellow-et-al-2016]. To perform uncertainty quantification for
seismic imaging, we sample from the posterior distribution of the CNN
weights by running preconditioned stochastic gradient Langevin dynamics
[SGLD, @welling2011bayesian; @li2016preconditioned], a gradient-based
Markov chain Monte Carlo (MCMC) sampling method developed for Bayesian
inference of deep CNNs with large training datasets.

A crucial objective of our study is translating the uncertainty in
seismic imaging to uncertainty in downstream tasks such as horizon
tracking, semantic segmentation, and tracking CO``_2`` plumes in carbon
capture and sequestration projects. Horizon tracking, which this papers
focuses on, is a task performed after imaging that leads to a
stratigraphic model. Horizon trackers use well data and seismic images
to delineate stratigraphy and are typically sensitive to structural and
stratigraphic unconformities. In these challenging areas, the horizons
do not continuously extend spatially, e.g., may be discontinuous due to
vertical displacement via faults, hence tracking horizons across
unconformities may not be trivial. Failure to include uncertainty on
tracked horizons have major implications on the identification of risk.
Since the accuracy of horizon tracking is directly linked to the quality
of the seismic image, we systematically incorporate uncertainties of
seismic imaging into horizon tracking. We achieve this by feeding
samples from the imaging posterior to an automatic horizon tracker
[@wu2018least] and obtain an ensemble of likely horizons in the image.
Compared to conventional imaging and manual tracking of horizons, our
approach allows us to rigorously quantify uncertainty in the location of
the horizons due to noise in shot records and modeling errors, e.g.,
linearization errors. Our probabilistic framework also admits
nondeterministic horizon trackers, e.g., uncertain control points or
multiple human interpreters. There are parallels between the
probabilistic framework we propose for quantifying uncertainty in
downstream tasks and the interrogation theory [@interrogation]\. The
purpose of this theory is to answer questions about an unknown quantity
by designing experiments (inverse problems) that facilitate answering
the question. The probabilistic framework we developed can be described
as an application of interrogation theory in that the seismic survey and
shot records are provided with no need to design further experiments,
and the question involves quantifying uncertainty in horizon tracking.
Our probabilistic framework differs fundamentally from other recently
developed automatic seismic horizon trackers based on machine learning
[see e.g., @peters2019; @geng2020deep; @peters2020eage;
@shi2020waveform] because horizon uncertainty is ultimately driven by
data (through the intermediate imaging distribution), and not from label
(control point) uncertainty alone.

In the following sections, we first mathematically formulate deep-prior
based seismic imaging, by introducing the likelihood function and the
deep prior approach. Next, we describe our proposed SGLD-based sampling
approach and its challenges. Subsequently, we introduce a framework to
tie uncertainties in imaging to uncertainties in horizon tracking, which
allows for deterministic and nondeterministic horizon tracking. We
present two realistic examples derived from real seismic image volumes
obtained in different geological settings. These numerical experiments
are designed to showcase the ability of the proposed deep-prior based
approach to produce seismic images with limited artifacts. We conclude
by demonstrating our probabilistic horizon tracking approach, which
includes estimates for confidence intervals associated with the imaged
horizons in the two aforementioned examples.

## Theory

The goal of this paper is to understand how errors in the data due to
noise and linearization assumptions affect the uncertainty of seismic
images and typical tasks carried out on these images. We begin with
an introduction of the linearized forward model, which forms the basis
of seismic imaging via reverse-time migration and discuss Bayesian
imaging with regularization via so-called deep priors.

### Seismic imaging

In its simplest acoustic form, reverse-time migration follows directly
from linearizing the acoustic wave equation around a known, slowly
varying background model---i.e., the spatial distribution of the squared
slowness. Traditionally, the process of seismic imaging is concerned
with estimating the short-wavelength components of the squared-slowness,
denoted by ``\delta \B{m}``, from ``{n_s}`` processed shot records
collected in the vector ``\B{d} = \left \{\B{d}_{i}\right
\}_{i=1}^{n_s}``. In most cases, these indirect measurements are
recorded along the surface or ocean bottom with sources ``\left
\{\B{q}_{i}\right \}_{i=1}^{n_s}`` located at or near the surface. The
placement of the sources and receiver near the surface leads to more
uncertainty in the deeper areas of the image.

The unknown ground truth perturbation model ``\delta \B{m}^{\ast}`` is
linearly related to the data via

```math #linear-fwd-op
\B{d}_i = \B{J}(\B{m}_0, \B{q}_i)
    \delta \B{m}^{\ast} + \boldsymbol{\epsilon}_i, \quad
    \boldsymbol{\epsilon}_i \sim \mathrm{N} (\B{0}, \sigma^2
    \B{I}),
```

where ``\B{J}(\B{m}_0, \B{q}_i)`` corresponds to the linearized Born
scattering operator for the ``i\text{th}`` source and the background
squared slowness model ``\B{m}_0``. Because of possible errors in the
processed data, the presence of noise, and linearization errors, the
above expression contains the noise term ``\boldsymbol{\epsilon}_i``.
While other choices can be made, we assume this noise to be distributed
according to a zero-centered Gaussian distribution with known covariance
matrix ``\sigma^2 \mathbf{I}``\. For small ``\sigma`` and a
kinematically correct background model ``\B{m}_0``, the above linear
relationship can be inverted by minimizing

```math #imaging-opt
\min_{\delta \B{m}} \sum_{i=1}^{n_s} \big \|
    \B{d}_i- \B{J}(\B{m}_0, \B{q}_i)
    \delta \B{m} \big \|_2^2.
```

While this approach is in principle capable of producing high-fidelity
true-amplitude images [@valenciano2008imaging; @dong2012least;
@zeng2014least], the noise term is in practice never negligible and may
adversely affect the image quality [@nemeth1999least] especially in
situations where the source spectrum is narrow band and the aperture
limited. Therefore not only the problem in equation #imaging-opt
requires regularization but also calls for a statistical inference
framework that allows us to draw conclusions in the presence of
uncertainty.

### Probabilistic imaging with Bayesian inference

To account for uncertainties in the image induced by the random noise
term ``\boldsymbol{\epsilon}_i`` in equation #linear-fwd-op\, we follow
the seminal work of @tarantola2005inverse and cast our noisy imaging as
a Bayesian inverse problem. Instead of calculating a single image by
solving equation #imaging-opt\, we assign probabilities to a family of
images that fit the observed data to various degrees. This distribution
is known as the posterior distribution. In this Bayesian framework, the
solution to the inverse problem, i.e., the image, and the noise in the
observed data are considered random variables. According to Bayes' rule,
the conditional posterior distribution, denoted by ``p_{\text{post}}``,
states that

```math #bayes
p_{\text{post}} \left (\delta \B{m} \mid \B{d} \right)
    \propto  p_{\text{like}} (\B{d} \mid \delta
    \B{m})\, p_{\text{prior}} (\delta \B{m}).
```

In this expression, ``p_{\text{like}} `` is the likelihood function,
which is related to the probability density function (PDF) of the noise,
and ``p_{\text{prior}}`` is the prior PDF of the image, which encodes
prior beliefs on the unknown perturbations ``\delta\B{m}``. This prior
distribution assigns probabilities to all potential seismic images
before incorporating the data via the likelihood. The constant of
proportionality in equation #bayes corresponds to the PDF of the
observed data, which is independent of ``\delta \B{m}``. Based on the
distribution of the noise, the likelihood term measures how well the
forward modeled data (equation #linear-fwd-op) and observed data
agree.

As stated by Bayes' rule, the posterior PDF of ``\delta \B{m}``, denoted
by ``p_{\text{post}} \left (\delta \B{m} \mid \B{d} \right)``, is
proportional to the product of the likelihood and the prior PDF, given
observed data.  The log-likelihood function takes the following form:

```math #imaging-likelihood
 - \log p_{\text{like}} \left ( \B{d}
    \mid \delta \B{m} \right ) & = -\sum_{i=1}^{n_s} \log p_{\text{like}}
    \left ( \B{d}_{i}\mid \delta \B{m} \right ) \\
& = \frac{1}{2 \sigma^2} \sum_{i=1}^{n_s}\big  \| \B{d}_i-
    \B{J}(\B{m}_0, \B{q}_i) \delta \B{m} \big \|_2^2 +
    \text{const},
```
where the constant term is independent of ``\delta \B{m}``. For the
uncorrelated Gaussian noise assumption,  this negative log-likelihood
function equals the squared ``\ell_2``-norm of the residual scaled by
the noise variance ``\sigma^2``.

Aside from depending on the residual, i.e., the difference between
observed and modeled data, for each shot record, the choice of the prior
influences the posterior distribution. Before the advent of data-driven
methods involving generative neural networks, the definitions of priors
were mostly handcrafted and often based on somewhat ad hoc Gaussian or
Laplacian distributions in the physical or in some transformed domain
[@vanLeeuwen2010IJGswi; @herrmann11GPelsqIm; @lu2015l1; @tu2014fis].
While these approaches have proven to be useful and are theoretically
well understood [@donoho2006compressed]\, there is always a risk of a
biased outcome something we would like to avoid. On the other hand,
using pretrained generative networks as priors has proven to be
effective [@pmlr-v70-bora17a; @mosser2018stochastic; @pmlr-v119-asim20a;
@fang2020deep; @siahkoohi2020ABIpto]\. However, their success hinges on
the quality of pretraining and having access to a fully representative
training data that accurately captures the prior distribution. Since we
are dealing with highly complex heterogeneity of the Earth subsurface to
which we have limited access, we will stay away from data-driven methods
to train a neural network to act as a prior.

### Deep priors

Following recent work by @Lempitsky\, we avoid using the need to have
access to realizations of true perturbations by using untrained
generative CNNs as priors. We fix a random input latent variable and use
a randomly initialized [@pmlrv9glorot10a] CNN with a special
architecture [@Lempitsky] to reparameterize the unknown perturbations
``\delta\B{m}`` in terms of CNN weights. Given the shot data, we
minimize the data misfit with respect to the CNN weights on which we
impose a Gaussian prior. The CNN's architecture (see details in Appendix
A) and the Gaussian prior imposed on its weights act as a regularization
in the image space that avoids representing incoherent noisy artifacts,
as long as overfitting is prevented [@Lempitsky].

To be more specific, let ``{g} (\B{z}, \B{w})\in\R^{N}``, with the ``N``
the number of gridpoints in the image, denote a untrained, specially
designed, CNN [@Lempitsky] with fixed input ``\B{z} \sim \mathrm{N}(
\B{0}, \B{I})`` with the same size as the image and unknown weights
``\B{w}\in\R^M`` with ``M\gg N``. Restricting the unknown perturbation
model to the output of the CNN, i.e., ``\delta \B{m} = {g} (\B{z},
\B{w})``, corresponds to a nonlinear representation for the image and
the following expression for the likelihood function:

```math #deep-prior-likelihood
 - \log p_{\text{like}} \left ( \B{d} \mid \B{w} \right )
&= -\sum_{i=1}^{n_s} \log p_{\text{like}}
    \left ( \B{d}_{i}\mid \B{w} \right ) \\
& =  \frac{1}{2 \sigma^2} \sum_{i=1}^{n_s} \big  \| \B{d}_i-
    \B{J}(\B{m}_0, \B{q}_i) g(\B{z},
    \B{w}) \big \|_2^2 + \text{const},
```

with the constant term independent of ``\B{w}``. In essence, deep priors
correspond to a nonlinear "change of variables" where the unknowns are
the CNN weights and the image is constrained to the range of the CNN
output for a fixed random input. Compared to data-driven methods, no
training samples are needed. While the nonlinearity makes it more
difficult to minimize the likelihood term (the likelihood in equation
#imaging-likelihood\), a zero-centered Gaussian prior for the weights
with covariance ``\lambda^{-2}\B{I}`` suffices thanks to the
overparameterization of the CNN (``M\gg N``). With this Gaussian prior
on the weights, the posterior distribution for the weights given the
data reads

```math #deep-prior
p_{\text{post}} \left ( \B{w} \mid \B{d} \right )
\propto \left [ \prod_{i=1}^{{n_s}} p_{\text{like}} \left (
    \B{d}_{i} \mid \B{w} \right ) \right ]
    \mathrm{N} \big (\B{w} \mid \B{0}, \lambda^{-2}\B{I} \big )
```
where ``\mathrm{N} \big (\B{w} \mid \B{0}, \lambda^{-2}\B{I} \big )``
stands for the PDF of the zero-centered Gaussian prior. Given equation
#deep-prior-likelihood\, the negative log-posterior distribution becomes

```math #imaging-obj
 - \log p_{\text{post}} \left ( \B{w}
    \mid \B{d} \right )
& = - \left [ \sum_{i=1}^{{n_s}} \log p_{\text{like}} \left (
    \B{d}_{i}\mid\B{w} \right ) \right ] -
    \log \mathrm{N} \big (\B{w} \mid \B{0}, \lambda^{-2}\B{I} \big ) + \text{const} \\
& = \frac{1}{2 \sigma^2} \sum_{i=1}^{n_s}\big  \| \B{d}_i-
    \B{J}(\B{m}_0, \B{q}_i) {g} (\B{z},
    \B{w}) \big \|_2^2 + \frac{\lambda^2}{2}\big  \| \B{w} \big \|_2^2
    + \text{const}.
```

Compared to conventional formulations of Bayesian inference, knowledge
of the deep prior resides both in the likelihood term, through the
reparameterization of the image as the output of a CNN, and in the
traditional ``\lambda`` weighted ``\ell_2``-norm squared term.  This is
different from the traditional Bayesian settings where prior information
resides exclusively in the prior term. @Cheng_2019_CVPR provided a
theoretical Bayesian perspective on deep priors, describing them as
Gaussian process priors in classical Bayesian terms. Specifically,
@Cheng_2019_CVPR showed that for infinitely wide CNNs, i.e. CNNs with a
large number of channels, the inductive bias of the CNN architecture and
the Gaussian prior on its weights are equivalent to a stationary
Gaussian process prior in the image space. @Cheng_2019_CVPR also
explicitly made a connection between the kernel of this Gaussian process
and the architecture of a CNN, by characterizing the effects of
convolutions, non-linearities, up-sampling, down-sampling, and skip
connections, which provides insights on selecting an appropriate CNN
architecture. Independently, @dittmer2020regularization argue that the
weak form of our constrained formulation with deep priors yield the same
solutions for the correct Lagrange multiplier. This means there is a
direct connection between our formulation and unconstrained variational
approaches. The latter permit a straightforward Bayesian interpretation.

Aside from choosing the right CNN architecture [@Lempitsky;
@dittmer2020regularization], random initialization of its weight
[@pmlrv9glorot10a] and fixed input for the latent variable, the above
posterior depends on selecting a value for the tradeoff parameter
``\lambda>0``, which weighs the importance of the Gaussian prior against
the noise-variance weighted data misfit term in the likelihood function.
In the sections below, we will comment how to choose the value for
``\lambda``.

The above expression for the posterior in equation #imaging-obj forms
the basis of our proposed probabilistic imaging scheme based on Bayesian
inference. Before discussing how to sample from this distribution, we
first briefly describe how to extract various statistical properties
from this posterior distribution on the image. Specifically, we will
review how to obtain point estimates [@casella2021statistical],
including maximum likelihood estimate (MLE) , maximum a posteriori
estimate (MAP) and estimates for the mean and pointwise standard
deviation, and ``99\%`` confidence intervals.


### Estimation with Bayesian inference

Based on the expressions for the negative log-likelihood (equation
#deep-prior-likelihood) and posterior (equation #imaging-obj), we derive
expressions for different point and interval estimates
[@arridge_oktem_2019].

#### Maximum likelihood estimation

To establish a baseline for image estimates obtained without
regularization, we first consider point estimates for the image that
correspond to finding an image that best fits the observed data. Since
this estimate is obtained by maximizing the likelihood function with
respect to the unknown image, ``\delta \B{m}``, this estimate is known
as the MLE. The corresponding optimization problem can be written as

```math #MLE
\delta \B{m}_{\text{MLE}} & = \mathop{\rm arg\,min}_{\delta \B{m}}
    - \log p_{\text{like}}
    \left ( \B{d} \mid \delta \B{m} \right ) \\
& = \mathop{\rm arg\,min}_{\delta \B{m}}  \frac{1}{2 \sigma^2}
    \sum_{i=1}^{n_s} \big \| \B{d}_i- \B{J}(\B{m}_0, \B{q}_i) \delta \B{m} \big \|_2^2,
```

where the last equality follows from equation #imaging-likelihood\.
Observe that the MLE corresponds to the deterministic least-squares
solution, yielded by equation #imaging-opt\. Unfortunately, MLE images
are prone to overfitting [@casella2021statistical; @aster2018parameter]
that results in imaging artifacts [@nemeth1999least].

#### Maximum a posteriori estimation

Adding regularization to inverse problems, including seismic imaging, is
known to limit overfitting and is capable of filling in at least part of
the null space of the modeling operator. In case of regularization with
deep priors, this corresponds to finding the image that maximizes the
posterior distribution, i.e., we have

```math #MAP-w
\B{w}_{\text{MAP}} & = \mathop{\rm arg\,max}_{\B{w}}
    p_{\text{post}} \left ( \B{w} \mid \B{d} \right ) \\
& = \mathop{\rm arg\,min}_{\B{w}} \frac{1}{2 \sigma^2} \sum_{i=1}^{n_s}\big
    \| \B{d}_i- \B{J}(\B{m}_0, \B{q}_i) {g} (\B{z},
    \B{w}) \big \|_2^2 + \frac{\lambda^2}{2}\big  \| \B{w} \big \|_2^2.
```
This estimation for the weights ``\B{w}`` is known as the MAP estimate.
Given this estimate  ``\B{w}_{\text{MAP}}``, the corresponding estimate
for the image is obtained via

```math #MAP
\delta \B{m}_{\text{MAP}} = g (\B{z}, \B{w}_{\text{MAP}} ).
```

When compared with MAP estimates computed from traditional Bayesian
formulations of linear inverse problems, the estimate in equation #MAP-w
has several important differences. The above estimate depends on the
random initializations of the weights, ``\B{w}`` and latent variable
``\B{z}``, which is due to the nonlinearity introduced by the
reparameterization. This renders the above minimization non-convex,
i.e., its local minimum is no longer guaranteed to coincide with the
global minimum. While the objective is non-convex
[@NEURIPS2018_a41b3bb3], as a result of deep prior reparameterization,
because ``M\gg N``, first-order stochastic optimization methods
[@rmsprop; @kingma2014adam; @fromage2020] are able to minimize the
objective function in equation #MAP-w to small values of the residual
[@du2018gradient; @pmlr-v97-kunin19a]. Several other challenges include
increased number of iterations, establishment of a stopping criterion
when maximizing equation #MAP-w to prevent overfitting, and the
quantification of uncertainty. Despite these challenge, we argue that
invoking the deep prior outweighs the challenges since it offers a
better bias-variance trade-off and requires knowledge of only a single
hyperparameter. In addition, we refer to @siahkoohi2020SEGwdp for an
alternative formulation, which reduces the number of iterations and
therefore the number of evaluations of the computationally expensive
forward modeling operator.

#### Conditional mean estimation

So far, the MLE and MAP estimates involved a deterministic (at least for
fixed initialization of the network and latent variable) procedure
maximizing the likelihood or posterior. Since we have access to the
unnormalized posterior PDF, ``p_{\text{post}} \left (\delta \B{w} \mid
\B{d} \right)`` (equation #imaging-obj\), we have in principle ways
to retrieve information on the statistical moments of the posterior
distribution of the unknown perturbation including its mean and
pointwise standard deviation. However, contrary to the two estimates
discussed so far these point estimates can typically only be
approximated with samples drawn from the posterior.

We obtain access to samples from the posterior, ``p_{\text{post}} \left
(\delta \B{m} \mid \B{d} \right)``  via a "push-forward" of samples from
``p_{\text{post}}  \left ( \B{w} \mid \B{d} \right )`` based on the
deterministic map ``\delta \B{m} = {g} (\B{z}, \B{w})`` for fixed
``\B{z}`` [@bogachev2006measure]. As a result, for any sample of the
weights, ``\B{w}``, drawn from ``p_{\text{post}} \left ( \B{w} \mid
\B{d} \right )``, we have

```math #push-forward
g(\B{z}, \B{w}) \sim p_{\text{post}}
    \left (\delta \B{m} \mid \B{d} \right).
```
Assuming access to ``n_w`` samples from the posterior, ``p_{\text{post}}
\left ( \B{w} \mid \B{d} \right )``, the first moment, also known as the
conditional mean, can be approximated from these samples, ``\left \{
\B{w}_j \right \}_{j=1}^{n_{\mathrm{w}}} \sim p_{\text{post}} ( \B{w}
\mid \B{d} )``, via

```math #conditionalmean
\delta \B{m}_{\text{CM}}
& = \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m}
    \mid \B{d} \right)} \big [ \delta \B{m} \big ]\\
&= \mathbb{E}_{\B{w} \sim p_{\text{post}}
    \left (\delta \B{w} \mid \B{d} \right)} \big [ g( \B{z}, \B{w}) \big ] \\
& = \int p_{\text{post}} ( \B{w} \mid
    \B{d} ) g( \B{z}, \B{w}) \mathrm{d} \B{w}\\
&\approx \frac{1}{n_{\mathrm{w}}} \sum_{j=1}^{n_{\mathrm{w}}} g(
    \B{z}, \B{w}_j).
```

We describe the important step of obtaining these samples from the
posterior below.

Compared to the MAP estimate, the conditional mean, which corresponds to
the minimum-variance estimate [@anderson1979], is less prone to
overfitting [@mackay2003information]. This was confirmed empirically for
seismic imaging [@siahkoohi2020EAGEdlb; @siahkoohi2020SEGuqi]. In the
experimental sections below, we will provide further evidence of
advantages the conditional mean offers compared to MAP estimation.

#### Point-wise standard deviation estimation

In its most rudimentary form, uncertainties in the imaging step can be
assessed by computing the pointwise standard deviation, which expresses
the spread among the different unknown models explaining the observed
data. Given samples from the posterior, this quantity can be computed
via

```math #pointwise-std
\boldsymbol{\sigma}^2_{\text{post}} & = \mathbb{E}_{\delta \B{m}
    \sim p_{\text{post}} \left (\delta \B{m} \mid \B{d} \right) } \big [ (
    \delta \B{m} - \delta \B{m}_{\text{CM}})
    \odot (\delta \B{m} - \delta \B{m}_{
    \text{CM}}) \big ] \\
& \approx \frac{1}{n_{\mathrm{w}}}\sum_{j=1}^{n_{\mathrm{w}}} \big (g(
    \B{z}, \B{w}_j) - \delta
    \B{m}_{\text{CM}} \big ) \odot \big ( g( \B{z},
    \B{w}_j) - \delta \B{m}_{\text{CM}} \big ).
```

In this expression, ``\boldsymbol{\sigma}_{\text{post}}`` is the
estimated pointwise standard deviation and ``\odot`` represents
elementwise multiplication. Again, the expectations approximated in
equations #conditionalmean and #pointwise-std require samples from the
posterior distribution, ``p_{\text{post}} \left (\delta \B{m} \mid \B{d}
\right)``.

#### Confidence intervals

As described above, the pointwise standard deviation is a quantity that
summarizes the spread among the likely estimates of the unknown. Using
this quantity, we can put error bars on the unknown in which case we
assign probabilities (confidence) to the unknowns being in a certain
interval. The interval is obtained by treating the pointwise posterior
distribution as a Gaussian distribution, where the mean and standard
deviation at each points are equal to the value of the conditional mean
estimate and pointwise standard deviation at that point, respectively.
Given a desired confidence value, e.g., ``99\%``\, sample mean
``\boldsymbol{\mu}``, and sample variance ``\boldsymbol{\sigma}^2``, the
confidence interval is ``\boldsymbol{\mu} \pm 2.576\,
\boldsymbol{\sigma}`` where ``99\%`` of samples fall between the left
(``\boldsymbol{\mu} - 2.576\, \boldsymbol{\sigma}``\) and right
(``\boldsymbol{\mu} + 2.576\, \boldsymbol{\sigma}``) tails of the
Gaussian distribution [@friedman2001elements].

## Sampling from the posterior distribution

Extracting statistical information from the posterior distribution, such
as the point and interval estimates introduced in the previous section,
typically requires access to samples from the posterior distribution. In
the following section, we first show that approximations to the point
and interval estimates are instances of Monte Carlo integration, given
samples from the posterior distribution. Next, we shift our attention to
constructive techniques to draw these samples efficiently by introducing
preconditioning and crucial strategies to select the stepsize. Finally,
we describe an empirical verification of convergence of the Markov
chains that we will use to verify our sampling approach.

### Monte Carlo sampling

For most applications the posterior PDF is not directly of interest, but
we need to evaluate expectations involving the posterior distribution
instead. Given samples from the posterior,``\left \{ \B{w}_j \right
\}_{j=1}^{n_{\mathrm{w}}} \sim p_{\text{post}} ( \B{w} \mid \B{d} )``,
these expectations with respect to arbitrary functions can be
approximated by

```math #monte-carlo
\mathbb{E}_{\B{w} \sim p_{\text{post}} \left (\delta \B{w} \mid \B{d} \right)}
    \big [ f (\B{w} ) \big ] \approx \frac{1}{n_w}
    \sum_{j=1}^{n_w} f (\B{w}_{j} ).
```

Below we describe our proposed MCMC approach for obtaining samples from
the posterior.

### Sampling via stochastic gradient Langevin dynamics

Drawing samples from posterior distributions associated with imaging
problems of high dimensionality (``M,\, N`` large) and expensive forward
operators (e.g., demigration operators) is challenging
[@welling2011bayesian; @MartinMcMC2012]. Among the different approaches,
MCMC is a well-studied technique capable of drawing samples via a
sequential random-walk procedure. This process requires evaluation of
the posterior PDF at each step. The need for repeated evaluations of the
forward operator, the correlation between consecutive samples
[@gelman2013bayesian]\, and the high dimensionality of the problem are
the chief computational challenges for these methods. Despite these
difficulties, MCMC methods have been applied successfully in imaging
problems [@curtis2001prior; @fang2018uqfip; @herrmann2019NIPSliwcuc;
@zhao2019gradient; @kotsi2020; @siahkoohi2020EAGEdlb;
@siahkoohi2020SEGuqi].

Aside from problems related to the required length of the Markov Chains,
computing the misfit over all ``n_s`` sources in the likelihood term of
the posterior PDF (equation #imaging-obj\) is problematic since this
calls for many evaluations of the linearized Born scattering operator.
To address this issue, we use techniques from stochastic optimization
[@robbins1951stochastic; @nemirovski2009robust; @li2018plane] where the
gradients are evaluated for a single randomly selected source (without
replacement) at each iteration. For first-order methods, this technique
is known as stochastic gradient descent [SGD; @robbins1951stochastic]
and widely used in the machine learning and wave-based inversion
communities [@vanLeeuwen2010IJGswi; @haber2012effective; @herrmann2012;
@rmsprop; @kingma2014adam; @lu2015; @tu2015; @li2018].

While SGD bring down the computational costs, it is a stochastic
optimization algorithm for finding the mode of the posterior
distribution and it does not provide samples from the posterior
distribution. In order to do that, we have to add a carefully calibrated
noise term to the gradients. This additional noise term induces a random
walk from which samples from the posterior distribution can be drawn
under certain conditions [@welling2011bayesian]. Adding this noise term
also avoids converge of the iterations to the MAP estimate
[@welling2011bayesian]. In this paper, we adapt an approach known as
stochastic gradient Langevin dynamics [SGLD, @welling2011bayesian],
which is designed to reduce the number of necessary individual
likelihood evaluations at each iteration. SGLD was originally developed
for Bayesian inference on deep neural networks trained on large-scale
datasets. Compared to the original formulation of Langevin dynamics
[@neal2011mcmc], SGLD works on randomly selected subsets of shot data,
which makes it computationally more efficient and achievable at least in
2D imaging problems. Asymptotically, SGLD provides accurate samples from
the target distribution [@welling2011bayesian; @pmlr-v32-satoa14;
@raginsky2017non; @NIPS2018_8048;@NEURIPS2020_b5b8c484]---in our case,
the imaging posterior distribution. It differs from variational
inference [@jordan1999introduction] in that no surrogate distribution is
formulated and matched to the distribution of interest.

Following the work of @welling2011bayesian\, SGLD iterations for the
negative log-posterior involve at iteration ``k`` the following update
for the network weights of the deep prior:

```math #sgld
& \B{w}_{k+1} = \B{w}_{k} - \frac{\alpha_k}{2} \B{M}_k
    \nabla_{\B{w}} \left ( \frac{n_s }{2 \sigma^2} \big \| \B{d}_i-
    \B{J}(\B{m}_0, \B{q}_i) {g} (\B{z},
    \B{w}_k) \big \|_2^2  + \frac{\lambda^2}{2} \big \| \B{w}_k \big \|_2^2
    \right ) + \boldsymbol{\eta}_k, \\
& \boldsymbol{\eta}_k \sim \mathrm{N}( \B{0}, \alpha_k \B{M}_k),
```

where the index, ``i\subset \{1, \ldots, n_s\}``, is chosen randomly
without replacement at each iteration. Once all the shots are drawn, we
start all over by redrawing indices, without replacement, from
``i\subset \{1, \ldots, n_s\}``. We repeat this process for ``K`` steps
(see Algorithm 1), where ``K`` can be arbitrarily large. To ensure and
speedup convergence, the stepsizes ``\alpha_k`` and the adaptive
preconditioning matrix ``\B{M}_k`` need to be chosen carefully. The
additional zero-mean Gaussian noise term ``\boldsymbol{\eta}_k `` with
covariance matrix ``\alpha_k \B{M}_k`` distinguishes between the update
rule in equation #sgld and SGD optimization algorithm. It was shown by
[@welling2011bayesian] that the above iterations sample from the
posterior after a warmup phase, i.e., a certain number of iterations of
equation #sgld\. During the warmup stage, these iterations behave
similarly to those of the SGD algorithm but at some point transition to
the proper sampling phase [@welling2011bayesian]. Below we will comment
when that transition is likely to occur.

#### Stepsize selection

Convergence of stochastic optimization methods such as SGD and SGLD
relies on carefully designed stepsize strategies. Compared to SGD, SGLD
has the additional complication of having to balance random errors due
to randomly selecting shot records and the deliberate random "errors"
induced by the additional Gaussian noise term, ``\boldsymbol{\eta}_k``.
On the one hand, the iterations in equation #sgld need to make
sufficient progress during the warmup phase so that the samples (``=``
iterations ``\B{w}_k``) become independent of the chain's
initialization, i.e., the weights ``\B{w}_0`` at the start. On the other
hand, after warmup the Gaussian noise term, ``\boldsymbol{\eta}_k`` will
start to dominate the energy of the error in the gradient caused by the
stochastic approximation to the likelihood function (equation
#deep-prior-likelihood\). This can be explained by the fact that the
variance of error due to the stochastic gradient approximation is
proportional to the square of the stepsize [@robbins1951stochastic],
whereas the additive noise term is drawn from a Gaussian distribution
whose variance is proportional to the stepsize. Consequently, for small
stepsizes, it is expected that the error in gradients will be dominated
by additive noise [@welling2011bayesian], which effectively turns
equation #sgld to Langevin dynamics [@neal2011mcmc]. As a result,
similar to SGD, convergence can only be guaranteed when the stepsize in
equation #sgld decreases to zero. However, this would increase the
number of iterations to fully explore the posterior probability space.
We avoid this situation and follow @welling2011bayesian who propose the
following sequence of stepsizes:

```math #stepsize
\alpha_{k}= a (b+k)^{-\gamma},
```

where ``\gamma = \frac{1}{3}`` is the decay rate chosen according to
@teh2016consistency. The constants ``a,\ b`` in this expression control
the initial and final value of the stepsize. Below, we will comment how
to chose these constants and how to ensure that potential posterior
sampling errors [@NIPS2018_8048] are avoided.

#### Preconditioning

In addition to selecting proper stepsizes, the converge of the
iterations in equation #sgld depends on how strongly the different
weights of the deep prior are coupled to the data. Without
preconditioning, i.e., ``\B{M}_k = \B{I}``, SGLD updates all parameters
with one and the same stepsize. This leads to slow convergence of the
insensitive weights that are weakly coupled to the data. To avoid this
situation, @li2016preconditioned proposed an adaptive diagonal
preconditioning matrix extending the RMSprop optimization algorithm
[@rmsprop]. This preconditioner is deigned to speed up the initial
warmup and subsequent sampling stage of the iterations in equation
#sgld\. To define this preconditioning matrix, let ``\delta \B{w}``
denote the gradient of the negative log-posterior density (equation
#imaging-obj\) at the current estimate of weights ``\B{w}_k``, i.e.,

```math #gradient
\delta\B{w} = \nabla_{\B{w}} \left ( \frac{n_s }{2 \sigma^2}
    \big \| \B{d}_i- \B{J}(\B{m}_0, \B{q}_i) {g} (\B{z}, \B{w}_k) \big \|_2^2
    + \frac{\lambda^2}{2} \big \| \B{w}_k \big \|_2^2 \right ).
```

Given these gradients, define the following running pointwise sum on the
pointwise square of the gradients

```math #ruunning-grad
\B{v}_{k+1} =\beta \B{v}_{k} + \left (1 - \beta \right ) \delta\B{w} \odot \delta\B{w}
```
where the parameter ``\beta`` controls the relative importance of the
elementwise square of the gradient compared to the current iterate
``\B{v}_{k}``. The ``\B{v}_0`` is initialized as a vector with ``M``
zeros. By choosing,

```math #precond-mat
\B{M}_k = \text{diag} \left ( 1 \oslash \sqrt{\B{v}_{k+1}} \right )
```

with  ``\oslash`` elementwise division, the effective stepsize for
network weights with large (on average) gradients, i.e., large
sensitivities, is lowered whereas weights with small (on average)
gradients get updated with a larger effective stepsize. To avoid
division by zero, we add a small value to the denominator of equation
#precond-mat\. By introducing the preconditioning matrix ``\B{M}_k`` all
weights are updated similarly, which allows us to increase the stepsize.
Following @li2016preconditioned\, we set ``\beta = 0.99``. In addition
to leveling the playing field, for the gradients themselves the
preconditioning matrix also scales the essential additive noise term so
the random walk proceeds isotropically.

### Practical verification

While there exists a well established literature on how to verify
whether Markov chains produce accurate samples from the posterior
distribution [see @gelman_rubin_92], these methods are typically
impractical for our problem. We will adopt a more pragmatic approach to
assess the accuracy of the samples drawn from our Markov chains computed
with SGLD, as it will be explained in the following.

We first validate the accuracy of sampling from a single Markov chain by
computing confidence intervals. These intervals are computed from
posterior samples obtained via one MCMC chain using SGLD (equation
#sgld\). By definition, these confidence intervals provide the range
within which the weights and therefore the image are expected to fall.
This means that MAP estimates for the seismic image should ideally fall
within these confidence intervals computed from the posterior samples.
While the variability among MAP estimates is less than the variability
among true posterior samples, we still find this test of practical
importance. To verify this, we compute multiple MAP estimates (see
equations #MAP-w and #MAP\) for different independent random
initializations of the deep prior weights, ``\B{w}``. MAP estimates are
obtained via stochastic optimization using the RMSprop optimization
algorithm [@rmsprop], which uses the same preconditioning scheme
(equations #ruunning-grad and #precond-mat\) as SGLD. By checking whether
the different MAP estimates indeed fall within the computed confidence
interval, the accuracy of the samples from the posterior can at least be
verified qualitatively.

Ideally, different Markov chains initialized with different weights
should lead to similar statistics for samples of the posterior
distribution. We verify this empirically by running chains with
different independently randomly initialized weights, followed by visual
inspection of the conditional mean and pointwise standard deviation
derived from samples generated by the different chains. Deviations among
the estimates provides us with at least an indication of areas in the
image where we should be less confident on the inferred statistics.

### The SGLD Algorithm

The different steps of generating ``n_w=K/2`` samples from the posterior
from ``K`` iterations of SGLD (equation #sgld) are summarized in
Algorithm #alg\ for a given set of ``n_s`` processed shot records and
their respective source signatures, ``\left \{ \B{d}_{i}, \B{q}_{i}
\right \}_{i=1}^{n_s}``. Aside from shot data, Algorithm #alg\ requires
a smooth background model, ``\B{m}_0``, for the squared slowness and a
fixed realization for the latent variable ``\B{z} \sim \mathrm{N}(
\B{0}, \B{I})``. In addition to these input vectors, SGLD requires
hyperparameters to be set for the

- **Stepsize strategy.** Following @teh2016consistency, the decay rate
parameter in equation #stepsize is set to ``\gamma=\frac{1}{3}``. The
stepsize constants ``a,\ b`` in equation #stepsize are chosen separately
for each presented numerical experiment to ensure fast convergence in
the warmup phase. Specifically, we select ``a`` large enough to ensure
fast convergence while making sure the initial iterations do not diverge
due to large a stepsize. We selected ``b`` to be the same as the
stepsize that would yeld a good convergence for the MAP estimation
problem, i.e., SGLD iterations without the additive noise. This is to
ensure SGLD iterations get close enough to mode(s) of the distribution
toward the end. While selecting these parameters differently changes the
speed of converge, the accuracy of the resulting samples is empirically
verified for the chosen parameters.

- **Preconditioning.** As documented in the literature, we chose
  ``\beta=0.99`` in equation #ruunning-grad\.

- **Noise variance.** The variance ``\sigma^2`` of the noise assumed to
  be known.

- **Regularization parameter.** As with many inverse problems, the
  selection of the regularization parameter ``\lambda^{-2}`` is
  challenging. While sophisticated techniques [@aster2018parameter]
  exist to estimate this parameter, we tune the regularization parameter
  ``\lambda^{-2}`` by hand to limit the imaging artifacts visually.

- **Number of iterations and warmup.** We run ``10\,``k SGLD iterations
  (equation #sgld\) in total, and we adopt the general practice of
  discarding the first half of the obtained samples [@gelman_rubin_92].

Given the above inputs, Algorithm #alg\ proceeds by running ``K``
iterations during which simultaneous shot records, each made of a
Gaussian weighted source aggregate, are selected, followed by
calculations of the gradient (line ``3``), calculation of the
preconditioner (lines ``4-5``), stepsize (line ``6``), and update of the
weights (line ``8``). After ``K/2`` iterations, the updated weight also
serve as samples from the posterior [@gelman_rubin_92;
@welling2011bayesian].

### Algorithm: {#alg}
| **Input:** \vspace{.01em}
|    ``\left \{ \B{d}_{i}, \B{q}_{i}  \right \}_{i=1}^{n_s}``
    \hspace*{\fill} `// observed data and source signatures`\vspace{.01em}
|    ``\B{m}_0``
    \hspace*{\fill} `// smooth background squared-slowness model`\vspace{.01em}
|    ``\B{z} \sim \mathrm{N}( \B{0}, \B{I})``
    \hspace*{\fill} `// fixed input to the CNN`\vspace{.01em}
|    ``\lambda^{-2}``
    \hspace*{\fill} `// variance of Gaussian prior on CNN weights` \vspace{.01em}
|    ``\sigma^2``
    \hspace*{\fill} `// estimated noise variance` \vspace{.01em}
|    ``\beta``
    \hspace*{\fill} `// weighting parameter for constructing the preconditioning matrix` \vspace{.01em}
|    ``a, b``
    \hspace*{\fill} `// stepsize parameters in equation` #stepsize \vspace{.01em}
|    ``K``
     \hspace*{\fill} `// maximum MCMC steps` \vspace{.6em}
| **Initialization:** \vspace{.01em}
|    randomly initialize CNN parameters, ``\B{w}_{0}\in\R^M``
     \hspace*{\fill} @pmlrv9glorot10a \vspace{.01em}
|    initialize vector, ``\B{v}_0\in\R^M`` with zero  \vspace{.6em}
| 1. **for** ``k=0`` **to** ``K-1`` **do** \vspace{.4em}
| 2.    randomly draw ``i\subset \{1, \ldots, n_s\}``
    \hspace*{\fill} `// sample without replacement` \vspace{.1em}
| 3.    ``\delta\B{w} = \nabla_{\B{w}} \left (
    \frac{n_s }{2 \sigma^2} \big \| \B{d}_i- \B{J}(\B{m}_0,
    \B{q}_i) {g} (\B{z}, \B{w}_k) \big \|_2^2
    + \frac{\lambda^2}{2} \big \| \B{w}_k \big \|_2^2
    \right )``
    \hspace*{\fill} `// equation` #gradient \vspace{.1em}
| 4.    ``\B{v}_{k+1} =\beta \B{v}_{k} + \left (1 - \beta \right )
    \delta\B{w} \odot \delta\B{w}``
    \hspace*{\fill} `// equation` #ruunning-grad \vspace{.3em}
| 5.    ``\B{M}_k = \text{diag} \left ( 1 \oslash
    \sqrt{\B{v}_{k+1}} \right )``
    \hspace*{\fill} `// equation` #precond-mat \vspace{.3em}
| 6.    ``\alpha_{k}= a (b+k)^{-\gamma}``
    \hspace*{\fill} `// equation` #stepsize \vspace{.3em}
| 7.    ``\boldsymbol{\eta}_k \sim \mathrm{N}( \B{0}, \alpha_k \B{M}_k)``
    \hspace*{\fill} `// draw noise to add to gradient` \vspace{.3em}
| 8.    ``\B{w}_{k+1} = \B{w}_{k} - \frac{\alpha_k}{2} \B{M}_k
    \delta\B{w} + \boldsymbol{\eta}_k``
    \hspace*{\fill} `// update rule according to equation` #sgld \vspace{.3em}
| 9. **end for** \vspace{.6em}
| **Output:** ``\left \{ \B{w}_{k} \right \}_{k=K/2+1}^K``
    \hspace*{\fill} `// samples from the posterior` ``p_{\text{post}} ( \B{w} \mid \B{d} )`` \vspace{.3em}
:Seismic imaging posterior sampling with SGLD.

## Validating Bayesian inference

In this section, we validate our approach with synthetic examples. To
mimic a realistic imaging scenario where the ground truth is known, we
do this on a "quasi"-field data set, made out of noisy synthetic shot
data generated from a real migrated image. After demonstrating the
benefits of regularization with the deep prior, we compare the MAP
estimate with the conditional mean. The latter minimizes the Bayesian
risk, i.e., it minimizes in expectation the ``\ell_2``-norm squared
difference between the true image and inverted image, given shot data
[@anderson1979]. We conclude by reviewing the pointwise standard
deviation as a measure of uncertainty, which can be reaped from samples
drawn from the posterior.

### Problem setup

With few exceptions, synthetic models often miss realistic statistics of
the spatial distribution of the seismic reflectivity. To avoid working
with over simplified seismic images, we generate "quasi"-field shot data
derived from a 2D subset of the real prestack Kirchhoff time migrated
[Parihaka-3D] dataset [@Veritas2005; @WesternGeco2012] released by the
New Zealand government. We call our experiment "quasi" real because
synthetic data is generated from migrated field data that serves as a
proxy for the unknown true medium perturbations (Figure
#true-model\). Due to the nature of the migration algorithm used to
obtain the Parihaka dataset, the amplitudes in the extracted 2D subset
are not necessarily consistent with the seismic imaging forward model
presented in this paper (equation #linear-fwd-op\). For this reason, we
normalized the amplitudes of the extracted seismic image. Given these
perturbations, shot data is generated with the linearized Born
scattering operator for a made up, but realistic, smoothly varying
background model ``\B{m}_0`` for the squared slowness (Figure
#smooth-model). To ensure good coverage,  ``205`` shot records are
simulated and sampled with a source spacing of ``25\, \mathrm{m}``\.
Each shot is recorded over ``1.5`` seconds with ``410`` fixed receivers
sampled at ``12.5\, \mathrm{m}`` spread across full survey area. The
source is a Ricker wavelet with a central frequency of ``30\,
\mathrm{Hz}``\.

[Parihaka-3D]:https://wiki.seg.org/wiki/Parihaka-3D

### Figure: {#problem-steup .wide}
![](figures/Figure_1a.png){width=95% #true-model} \
![](figures/Figure_1b.png){width=95% #smooth-model}
:Problem setup. (a) A 2D subset of the Parihaka dataset, considered as
true model. (b) Made up smooth squared-slowness background model.

We also add a significant amount of band-limited noise to the shot data
by filtering Gaussian white noise with the source wavelet. The resulting
signal-to-noise ratio for all data is ``-8.74\, \mathrm{dB}``, which is
low. Figure #data shows an example of a single noise-free (Figure
#d-noise-free\) and noisy (Figure #d-colored-noise\) shot record.

### Figure: {#data .wide}
![](figures/Figure_2a.png){width=35% #d-noise-free}
![](figures/Figure_2b.png){width=35% #d-colored-noise}
:A shot record generated from an image extracted from the Parihaka
dataset. (a) Noise-free linearized data. (b) Linearized data with
band-limited noise.

Even though our example is in 2D, the number of parameters (the weights
of the deep prior network) is large (approximately ``40`` times larger
than image dimension), which results in many SGLD iterations. In a
setting where we are content with approximate Bayesian inference, i.e.,
where the validity of the Markov chains can be established qualitatively
in the way described earlier, we found that ten thousand iterations are
adequate. We adopt the general practice of discarding the first half the
MCMC iterations (about ``25`` passes over the data)
[@gelman2013bayesian], which leaves five thousand iterations dedicated
to posterior sampling phase [@zhu2018seismic]. The stepsize sequence is
chosen according to equation #stepsize with ``a,\ b`` chosen such the
stepsize decreases from ``10^{-2}`` to ``5 \times 10^{-3}``\.

### Imaging with versus without the deep prior

For reference, we first compare imaging results with and without
regularization. The latter is based on maximizing the likelihood
(equation #MLE) whereas the former involves maximizing the posterior
distribution (equation #MAP-w). To prevent overfitting of the MLE
estimate, the number of iterations is limited to the equivalent of only
four data passes (four loops over all shots). To ensure convergence, the
number of data passes (or epochs) for the MAP estimate was set to ``15``
(about three thousand iterations). Since the ground truth is known, the
optimal value for the ``\lambda^{-2}=5 \times 10^{-3}`` was found by
grid search and picking the value that visually limits imaging
artifacts. Results of minimizing the negative log-likelihood and
negative log-posterior are included in Figures #mle-colored and
#map-colored, respectively. We obtained these results with the RMSprop
optimization algorithm [@rmsprop], which uses the same preconditioning
scheme (equations #ruunning-grad and #precond-mat\) as SGLD that we will
use later to conduct Bayesian inference. Compared to vanilla SGD with a
fixed stepsize, RMSprop is an adaptive stepsize method conducive to the
preconditioner introduced in equations #ruunning-grad and #precond-mat\.
As with the SGLD updates, the gradient calculations involve a single
randomly selected shot record. As expected, compared to the MAP estimate
with signal-to-noise ratio (SNR) ``8.79\,``dB, the MLE estimate (SNR
``8.25\,``dB) lacks important details, e.g., weak reflectors in deeper
sections, and exhibits strong artifacts, including imaged reflectors
that are noisy and lack continuity. The latter is important since the
estimated seismic image will be used to automatically track horizons.

### Bayesian inference with deep priors

As the comparison between MLE and MAP estimates clearly showed,
regularization improves the image but important issues remain. First,
the use of deep priors can lead to overfitting even when a Gaussian
prior on the weights is included. As reported in the literature
[@Lempitsky; @Cheng_2019_CVPR], stopping early can be a remedy but a
stopping criterion remains elusive rendering this type of regularization
less effective. Second, the uncertainty is not captured by the MAP
estimation. As we will demonstrate, the ability to draw samples from the
posterior remedies these issues.

#### Conditional mean

As described earlier, samples from the posterior provide access to
useful statistical information including approximations to moments of
the distribution such as the mean. With the minor modifications proposed
by @li2016preconditioned to the RMSprop optimization algorithm, the
posterior distribution can be sampled with Algorithm #alg after a warmup
phase of about ``25`` data passes. The resulting samples for the weights
are then used, after push forward (see equation #push-forward), to
approximate the conditional mean, ``\delta \B{m}_{\text{CM}}``, by
computing the sum in equation #conditionalmean\. Compared to the MAP
estimate (Figures #map-colored and #cm-colored\), the ``\delta
\B{m}_{\text{CM}}`` (SNR ``9.66\,``dB) is tantamount to another
significant improvement especially for weaker reflectors in the deeper
part of the image and for reflectors denoted by the arrows.

While there has been a debate in the literature on the accuracy of the
MAP versus conditional mean estimates in the context of regularization
with handcrafted priors, such as total variation [@burger2014maximum],
we find that the conditional mean estimate negates the need to stop
early and is also more robust with respect to noise.

### Figure: {#mle-vs-map-vs-cm .wide}
![](figures/Figure_3a.png){width=90% #mle-colored} \
![](figures/Figure_3b.png){width=90% #map-colored} \
![](figures/Figure_3c.png){width=90% #cm-colored}
:Imaging with deep priors of a 2D subset of the Parihaka dataset (a)
MLE, i.e., minimizer of equation #imaging-likelihood with respect to
``\delta \B{m}``, with SNR ``8.25\,``dB. (b) The MAP estimate, i.e.,
minimizer of equation #imaging-obj\, followed by a mapping onto the
image space via ``g`` (equation #MAP\), with SNR ``8.79\,``dB. (c)
The conditional (posterior) mean estimate, ``\delta \B{m}_{\text{CM}}``,
with SNR ``9.66\,``dB. All figures are displayed with the same color clipping values.

#### Pointwise standard deviation and histograms

To assess variability among the different samples from the posterior, we
include a plot of the pointwise standard deviation
``\boldsymbol{\sigma}_{\text{post}}`` (equation #pointwise-std\) in
Figure #std-colored\. This quantity is a measure for uncertainty. To
avoid bias by strong amplitudes in the estimated image, we also plot the
stabilized division of the standard deviation by the envelope of the
conditional mean in Figure #std-normalized\. From these plots in Figure
#std-plots\, we observe that as expected uncertainty is large in areas
with a complex geology, e.g., along the faults and along the tortuous
reflectors, and in areas with relative poor illumination deep in the
image and near the edges. On the other hand, the shallow areas of the
image exhibit low uncertainty, which is to be expected due to proximity
to the sources and receivers.

### Figure: {#std-plots .wide}
![](figures/Figure_4a.png){width=90% #std-colored} \
![](figures/Figure_4b.png){width=90% #std-normalized}
:Imaging uncertainty quantification for the Parihaka example. (a) The
pointwise standard deviation among samples drawn from the posterior,
``\boldsymbol{\sigma}_{\text{post}}``. (b) Normalized pointwise standard
deviation by the conditional mean estimate (Figure #cm-colored\).


To illustrate how the posterior regularized by the deep prior is
informed by the likelihood, we also calculated histograms at three
locations denoted by the white circles in Figure #std-colored\.
Histograms from the prior are calculated by randomly sampling network
weights from the prior distribution, i.e., ``\mathrm{N}(\B{w} \mid
\B{0}, \lambda^{-2}\B{I})``\, followed by computing the deep prior
network's output for a random but fixed ``\B{z}``. The resulting
histograms are plotted in light gray in Figure #marginals\. Similarly,
histograms for the posterior are computed (equation #push-forward\)
from samples of the posterior for the weights. These are plotted in dark
gray. As expected, the histograms for the posterior are considerably
narrower than those of the prior, which means that the posterior is
informed by the shot data. We also see that the width of the histograms
increases in areas with larger variability. For comparison, we added the
conditional mean estimates with dashed vertical line. When compared with
the ground truth values denoted by the solid vertical lines, we observe
that the ground truth falls inside of the nonzero pointwise posterior
interval, which confirms the benefits of the prior.

### Figure: {#marginals .wide}
![](figures/Figure_5a.png){width=85% #hist-29x25-colored} \
![](figures/Figure_5b.png){width=85% #hist-62x94-colored} \
![](figures/Figure_5c.png){width=85% #hist-182x139-colored}
:Pointwise prior (light gray) and posterior (dark gray) histograms
along with the true perturbation values (solid black line) and
conditional mean (dashed black line) for points located at (a)
``(0.725\, \mathrm{km}, \ 0.312\, \mathrm{km})``, (b) ``(1.550\,
\mathrm{km}, \ 1.175\, \mathrm{km})``, and (c) ``(4.550\, \mathrm{km},\
1.738\, \mathrm{km})``.

### Accuracy and convergence verification

Drawing samples from the posterior distribution via Markov chains can be
subject to errors when the chain is not long enough
[@gelman2013bayesian]. Unfortunately, the required length of the chain
is often infeasible in practice, certainly when the forward operators is
expensive to calculate as is the case with our imaging examples. As
explained earlier, we qualitative verify the accuracy of the Bayesian
inversion by comparing MAP estimates with confidence intervals
[@friedman2001elements] and by running different Markov chains
[@fang2018uqfip].

#### Consistency with empirical confidence intervals

As a first assessment of the accuracy of the MCMC sampling, we computed
the relative errors of ``15`` MAP estimates with respect to the ground
truth image obtained for a single fixed ``\B{z}`` but different
initializations of the network weights. The decay of the relative
``\ell_2``-norm error for each run over ``3\mathrm{k}`` iterations are
plotted in Figure #error-log-colored and show relatively small
variations from random realization to random realization. Vertical
profiles of the MAP estimates at two lateral positions confirm this
behavior. With few exceptions, these different MAP estimates fall well
within the shaded ``99\%`` confidence intervals plotted in Figures
#trace-80-colored and #trace-160-colored\. The confidence intervals
themselves were derived from samples of the posterior. Except for
perhaps the deeper part of the model, we can be confident that the
Bayesian inference is reasonable certainly in the light of the
nonlinearity of the deep prior itself.

### Figure: {#confidence-intervals .wide}
![](figures/Figure_6a.png){width=75% #error-log-colored} \
![](figures/Figure_6b.png){width=35% #trace-80-colored}
![](figures/Figure_6c.png){width=35% #trace-160-colored}
:Confidence intervals empirical verification. (a) Relative error in the
estimated perturbation model for ``15`` different initialization of the
deep prior, with respect to the ground truth image. Traces of ``99 \%``
confidence interval and ``15`` realizations of the MAP estimate,
``\delta \B{m}_{\text{MAP}}``, at (b) ``2.0\, \mathrm{km}`` and (c)
``4.0\, \mathrm{km}`` horizontal location.

#### Chain to chain variations

To further assure our Bayesian inference is accurate, we conducted a
second experiment comparing estimates for the conditional mean and
confidence intervals for three different Markov chains computed from
independent random initialization for the weights and ``\B{z}`` fixed.
Since we can not afford to run the Markov chains to convergence, we
expect slightly different results for the conditional mean and
confidence intervals. As observed from Figure #mcmc-convergence\, this
is indeed the case but the variations are relatively minor and confined
to the deeper part of the image. This qualitative observation, in
conjunction with the behavior of the MAP estimates, suggests that we can
be confident that the presented Bayesian inference is reasonably
accurate certainly given the task of horizon tracking at hand.

### Figure: {#mcmc-convergence .wide}
![](figures/Figure_7a.png){width=35% #mean-80-colored}
![](figures/Figure_7b.png){width=35% #mean-160-colored} \
![](figures/Figure_7c.png){width=35% #std-80-colored}
![](figures/Figure_7d.png){width=35% #std-160-colored}
:MCMC convergence diagnosis. Conditional posterior mean with three
independent Markov chains at (a) ``2.0\, \mathrm{km}`` and (b) ``4.0\,
\mathrm{km}``. Confidence intervals at (c) ``2.0\, \mathrm{km}`` and (d)
``4.0\, \mathrm{km}``\.

## Probabilistic horizon tracking

Typically, seismic images serve as input to a decision process involving
identification of certain attributes within the image preferably
including an assessment of their uncertainty. With very few exceptions
[@Ely2018], these assessments of risk are not based on a systematic
approach where errors in shot data are propagated to uncertainty in the
image and subsequent tasks. To illustrate how the proposed Bayesian
inference can serve to assess uncertainty on downstream tasks, we
consider horizon tracking, where reflector horizons are extracted
automatically from seismic images given a limited number of user
specified control points. Typically, these control points are either
derived from well data available in the area or from human
interpretation. The horizon tracking can be deterministic, i.e.,
horizons are determined uniquely given a seismic image, or more
realistically, it may be nondeterministic, i.e., multiple possible
horizons explaining a single image. In either case, the task of
delineating the stratigraphy automatically with no to little
intervention by interpreters is challenging certainly in areas where the
geology is complex, e.g., near faults. To resolve these complex areas
high quality images including information on uncertainty are essential.

To set the stage to put tasks conducted on seismic images on a firm
statistical footing, we will make the assumption that these tasks are
only informed by the estimated image and not by the shot data
explicitly. This means when given an estimated image the task, e.g., of
horizon tracking, is assumed to be statistically independent of the shot
data. Formally, this conditional independence can be expressed as

```math #conditional-independence
( \B{h} \mathrel{\text{$\perp\mkern-10mu \perp$}}
    \B{d})  \mid \delta \B{m},
```

where the random variable ``\B{h}`` encodes tracked horizons. The symbol
``\mathrel{\text{$\perp\mkern-10mu \perp$}}`` represents conditional
independence [@Dawid1979] in this case between the estimated horizons
and shot records, given the seismic image, i.e., the seismic images,
``\delta \B{m}``, obtained from the shot records, ``\B{d}``, contain all
the needed information to predict horizons, ``\B{h}``. The assumed
statistical independence implies that the tracked horizons, ``\B{h}``,
can be predicted unequivocally from estimated images. Because of the
independence, shot data does not bring forth additional information on
the horizons. For the remainder of this paper, we denote the task on the
image by ``\mathcal{H}``, which for horizon tracking implies
``\B{h}=\mathcal{H}(\delta\B{m})``.

### Bayesian formulation

Given the mapping from image to horizons, let ``p_{\mathcal{H}} \left
(\B{h} \mid \delta \B{m} \right )`` represent the conditional PDF of
horizons given an estimate for the seismic image. This distribution is
characterized by the nondeterministic behavior of ``\mathcal{H}`` and
assigns probabilities to horizons in the image, ``\delta \B{m}``. In the
case where ``\mathcal{H}`` represents automatic horizon tracking this
mapping requires control points as an additional input. In this context,
sampling from the conditional distribution is equivalent to performing
automatic horizon tracking with different realizations of the control
points. Alternatively, when ``\mathcal{H}`` represents actions by human
interpreters, samples from ``p_{\mathcal{H}} \left (\B{h} \mid \delta
\B{m} \right )`` can be thought of as horizons tracked by different
individuals.

Provided samples from ``p_{\mathcal{H}} \left (\B{h} \mid \delta \B{m}
\right )`` and assuming conditional independence between ``\B{h}`` and
``\B{d}`` given the seismic image described in equation
#conditional-independence, we can perform Bayesian inference with the
posterior distribution of horizons, denoted by ``p_{\text{post}} \left
(\B{h} \mid \B{d} \right )``\. Generally speaking, for any arbitrary
function of horizons, ``f``, expectations with respect to
``p_{\text{post}} \left (\B{h} \mid \B{d} \right )`` can be computed as
follows:

```math #horizon-inference
\mathbb{E}_{\B{h} \sim p_{\text{post}} \left (\B{h} \mid \B{d} \right )}
    \ \left [ f \left (\B{h} \right) \right ] & = \int f \left (
    \B{h} \right) p_{\text{post}} \left (\B{h} \mid \B{d} \right )
    \mathrm{d} \B{h}
= \iint f \left (\B{h} \right) p \left (\B{h}, \delta \B{m} \mid \B{d} \right )
    \mathrm{d} \B{h}\, \mathrm{d} \delta \B{m} \\
& = \iint f \left (\B{h} \right) p_{\mathcal{H}}
    \left (\B{h} \mid \delta \B{m}, \B{d} \right )
    p_{\text{post}} \left ( \delta
    \B{m} \mid \B{d} \right )
    \mathrm{d} \B{h}\, \mathrm{d} \delta \B{m} \\
& = \iint f \left (\B{h} \right) p_{\mathcal{H}}
    \left (\B{h} \mid \delta \B{m} \right )
    p_{\text{post}} \left ( \delta
    \B{m} \mid \B{d} \right )
    \mathrm{d} \B{h}\, \mathrm{d} \delta \B{m} \\
& = \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \left [ \int f \left (\B{h} \right)
    p_{\mathcal{H}} \left (\B{h} \mid \delta \B{m} \right )
     \mathrm{d} \B{h} \right ] \\
& = \underbrace {\mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \underbrace {\mathbb{E}_{
    \B{h} \sim p_{\mathcal{H}} \left (\B{h} \mid \delta \B{m} \right ) } \left [
    f \left (\B{h} \right) \right ]. }_{\substack{
    \text{ uncertainty in} \\ \text{ horizon tracking}}}}_{
    \substack{\text{ uncertainty in} \text{ seismic imaging}}} \\
```

The second equality in the first line of equation #horizon-inference
follows from the law of total probability[^2], the second line is
obtained by applying the chain rule of PDFs[^3] to the joint density ``p
\left (\B{h}, \delta \B{m} \mid \B{d} \right )``, and the third line
exploits the conditional independence assumption in equation
#conditional-independence\. Conceptually, equation #horizon-inference
states that we can decompose the uncertainty in horizon tracking into
two parts, namely uncertainty in imaging and uncertainty in the horizon
tracking task itself. Based on equation #horizon-inference\,
expectations over ``p_{\text{post}} \left (\B{h} \mid \B{d} \right )``
can be calculated via Monte Carlo integration using samples from
``p_{\text{post}} \left (\B{h} \mid \B{d} \right )``. Thanks to the
conditional independence assumption in equation
#conditional-independence\, we can sample from ``p_{\text{post}} \left
(\B{h} \mid \B{d} \right )`` by sampling the imaging posterior,
``p_{\text{post}} \left (\delta \B{m} \mid \B{d} \right )``, followed by
tracking the horizons in each seismic image. This step yields an
ensemble of possible horizons for each sampled image. Using samples
drawn from ``p_{\text{post}} \left (\B{h} \mid \B{d} \right )``, we
approximate the expectation in equation #horizon-inference by the sample
mean. In the following sections, we break equation #horizon-inference
down into two cases where the horizon tracker yields an unique set of
horizons or multiple sets of likely horizons, given one seismic image.

[^2]: ``p(x) = \int_{\mathcal{Y}} p(x, y)\,\mathrm{d}y``, where ``x \in
    \mathcal{X}`` and ``y \in \mathcal{Y}`` are two arbitrary random
    variables.
[^3]: ``p(x, y) = p(x \mid y)\, p(y),  \ \forall\, x \in \mathcal{X},\, y \in
    \mathcal{Y}``.

#### Case ``1``: horizons are unique given an image

In the simplest case, where horizon tracking uniquely determines the
horizons given a seismic image, the conditional PDF ``p_{\mathcal{H}}
\left (\B{h} \mid \delta \B{m} \right )`` corresponds to a delta
function, i.e., we have

```math #probH
p_{\mathcal{H}}(\B{h} \mid \delta \B{m}) = \delta_{\left [
    \B{h} = \mathcal{H} \left ( \delta \B{m} \right )
    \right ]} \left (\B{h} \right ), \\
```

where ``\mathcal{H}`` represent the deterministic horizon tracking map
and ``\delta(\cdot)`` stands for the delta Dirac distribution.
Substituting equation #probH into equation #horizon-inference yields

```math #independence-det
\mathbb{E}_{\B{h} \sim p_{\text{post}} \left (\B{h} \mid \B{d} \right )}
    \ \left [ f \left (\B{h} \right) \right ]
& = \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \left [ \int f \left (\B{h} \right)
    \delta_{\left [\B{h} = \mathcal{H} \left ( \delta \B{m}
    \right ) \right ]} \left (\B{h} \right ) \mathrm{d}
    \B{h} \right ], \\
& = \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \left [ f \left ( \mathcal{H} \left ( \delta
    \B{m} \right ) \right ) \right ], \\
& \approx \frac{1}{n_{\mathrm{w}}}\sum_{j=1}^{n_{\mathrm{w}}} f
    \left ( \mathcal{H} ( {\delta \B{m}}_j ) \right ), \\
```

where ``\left \{ {\delta \B{m}}_j \right \}_{j=1}^{n_{\mathrm{w}}} \sim
p_{\text{post}} ( \delta \B{m} \mid \B{d} )`` are ``n_{\mathrm{w}}``
samples from the posterior distribution. Equation #independence-det
essentially means that in case of a deterministic horizon tracker
uncertainty in imaging can be translated to uncertainty in horizon
tracking by simply drawing samples from the seismic imaging posterior
and tracking horizons in each image. This procedure results in samples
from the posterior distribution of horizons and inference of this
posterior distribution is done via the equation above.

#### Case ``2``: multiple likely horizons given an image

The probabilistic horizon tracking approach, described in equation
#horizon-inference\, also admits nondeterministic horizon trackers,
e.g., automatic horizon tracking with uncertain control points, e.g.,
points provided by multiple human interpreters. In this case, instead of
having one set of horizons for each seismic image, we have multiple
realizations of horizons that each agree with a seismic image, i.e.,
they are samples from ``p_{\mathcal{H}}(\B{h} \mid \delta \B{m})``. With
these samples, the inner expectation in equation #horizon-inference can
be estimated. Assuming that for each image we have ``n_h`` different
realizations of tracked horizons, namely, ``h_k^{(\delta \B{m})} \sim
p_{\mathcal{H}}(\B{h} \mid \delta \B{m}), \
k=1,\cdots, n_h``, equation #horizon-inference becomes

```math #sum-horizons-points
\mathbb{E}_{\B{h} \sim p_{\text{post}} \left (\B{h} \mid \B{d} \right )}
    \ \left [ f \left (\B{h} \right) \right ]
& = \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \mathbb{E}_{
    \B{h} \sim p_{\mathcal{H}} \left (\B{h} \mid \delta \B{m} \right ) } \left [
    f \left (\B{h} \right) \right ] \\
& \approx \mathbb{E}_{\delta \B{m} \sim p_{\text{post}} \left (\delta \B{m} \mid
    \B{d} \right ) } \left [ \frac{1}{n_h} \sum_{k=1}^{n_h}
    f \left ( h_k^{(\delta \B{m})}\right )
    \right ], \\
& \approx \frac{1}{n_h n_{\mathrm{w}}} \sum_{j=1}^{n_{\mathrm{w}}}
    \sum_{k=1}^{n_h} f \left ( h_k^{({\delta \B{m}}_j)}\right )
```

where ``h_k^{({\delta \B{m}}_j)}`` is the ``k``th sample from
``p_{\mathcal{H}}(\B{h} \mid {\delta \B{m}}_j)`` and ``{\delta
\B{m}}_j`` is the ``j``th sample from ``p_{\text{post}} \left (\delta
\B{m} \mid \B{d} \right)``. Because it has an extra sum over the
different realizations of tracked horizons for a fixed image, equation
#sum-horizons-points differs from equation #independence-det\. In the
ensuing sections, we show how equations #independence-det and
#sum-horizons-points can be used to calculate pointwise estimates for
the first two moments of the posterior distribution over horizons.

#### Uncertainty quantification in horizon tracking

It is often beneficial to express uncertainty in the form of confidence
intervals. For this purpose, equation #independence-det or
#sum-horizons-points is evaluated first for ``f(\B{h}) =\B{h}``. This
yields the conditional mean estimate for the horizons denoted by

```math #cm-horizons
\boldsymbol{\mu}_{\B{h}} = \mathbb{E}_{\B{h} \sim p_{\text{post}} \left (\B{h} \mid \B{d} \right )} [ \B{h} ].
```

Similarly, the pointwise standard deviation of the horizons can be
computed by choosing ``f(\B{h}) = (\B{h} - \boldsymbol{\mu}_{\B{h}})
\odot (\B{h} - \boldsymbol{\mu}_{\B{h}})``. This latter point estimate
can be calculated via

```math #std-horizons
\boldsymbol{\sigma}^2_{\B{h}} = \mathbb{E}_{\B{h} \sim p_{\text{post}} \left (\B{h} \mid \B{d} \right )} [ (\B{h} - \boldsymbol{\mu}_{\B{h}}) \odot (\B{h} - \boldsymbol{\mu}_{\B{h}}) ],
```

where ``\boldsymbol{\sigma}_{\B{h}}`` denotes the pointwise standard
deviation. The ``99\%`` confidence interval for horizons is the interval
``\boldsymbol{\mu}_{\B{h}} \pm 2.576\, \boldsymbol{\sigma}_{\B{h}}``.
Contrary to most existing automatic horizon trackers, the uncertainty
estimates we provide here are determined by uncertainties in the image
due to noise, and possibly linearization errors, in the shot records.

## Probabilistic horizon tracking---"ideal low-noise" Parihaka example

The main goal of this paper is to derive a systematic approach to
propagate uncertainty in imaging to the task at hand. For this purpose,
we first consider the relatively ideal case of uncertainty due to
additive random noise in the shot data. To illustrate how the presence
of this noise affects the task of horizon tracking, we apply the
proposed probabilistic framework to the Parihaka imaging example
discussed earlier. Because the seismic shot data for this example is
relatively low-frequency (``30\,``Hz source peak frequency) and the
geology relatively simple, horizons are not that challenging to track.
However, there is a substantial amount of noise in the shot data that we
need to contend with when tracking the imaged horizons. For the latter
task, we deploy the tracking approach introduced by @wu2018least\, which
requires the user to provide control points on the seismic horizons of
interest.

To setup this tasked imaging experiment, we select ``25`` horizons from
the conditional mean estimate (Figure #cm-colored\) calculated for
the Parihaka seismic imaging example. Next, control points are picked
for the selected horizons at various horizontal positions, separated by
``1\, \mathrm{km}``. We group the control points with the same
horizontal location, yielding five sets of control points. Figure
#control-points shows these five sets located at lateral positions
``0.5``, ``1.5``, ``2.5``, ``3.5``, and ``4.5\, \mathrm{km}``.

### Figure: {#control-points .wide}
![](figures/Figure_8.png){width=70%}
:Five sets of control points identifying ``25`` horizons of interest.

To separate the effect of errors in the shot data and variations amongst
provided control points, we consider noisy shot data first, followed by
the situation where there is uncertainty due to noise in the shot data
and due to variations in the control points.

### Uncertainty due to noise in shot data (case 1)

To calculate noise-induced uncertainties in horizon tracking (case 1),
we pass samples from the imaging posterior distribution,
``p_{\text{post}} \left (\delta \B{m} \mid \B{d} \right )``, to the
automatic horizon tracking software [@wu2018least]. Given the five sets
of selected control points (Figure #control-points\), the tracker
generates for each sample of the imaging posterior ``25`` horizons
according to equation #independence-det\. For each set of control
points, the conditional mean and  ``99\%`` confidence intervals are
calculated included in Figure #horizon-uq-det\. Each plot (Figures
#cp-40-colored -- #cp-200-colored) corresponds to tracked horizons with
confidence intervals derived from different sets of control points. As
expected, the results exhibit more uncertainty for horizons tracked in
the deeper parts of the image and close to boundaries, which is
consistent with the relative poor illumination in these areas. Moreover,
uncertainty in the tracked locations increases away from the control
points. This increase in uncertainty agrees with the inherent challenge
of automatic horizon tracking across areas of poor illumination, faults
and tortuous reflectors. This behavior is observed for each set of
control points.

### Figure: {#horizon-uq-det .wide}
![](figures/Figure_9a.png){width=50% #cp-40-colored}
![](figures/Figure_9b.png){width=50% #cp-80-colored} \
![](figures/Figure_9c.png){width=50% #cp-120-colored}
![](figures/Figure_9d.png){width=50% #cp-160-colored} \
![](figures/Figure_9e.png){width=50% #cp-200-colored}
:Uncertainty in horizon tracking only due noise in the shot data
(equation #independence-det). Control points are located at (a) ``500\,
\mathrm{m}``, (b) ``1500\, \mathrm{m}``, (c) ``2500\, \mathrm{m}``, (d)
``3500\, \mathrm{m}``, and (e) ``4500\, \mathrm{m}`` horizontal
location. Conditional mean estimates and ``99\%`` confidence intervals
are shown in solid and shaded colors, respectively.

Aside from shot noise induced uncertainty, variations in the control
point may also contribute to uncertainty in the horizon tracking. This
corresponds to case 2, which we consider in the next section.

### Uncertainty due to noise and uncertain control points (case 2)

The presence of noise in shot data is often not the only cause of
uncertainty within the task of (automatic) horizon tracking. Human
errors, or better variations in the selection of control points by
interpreters, may also contribute to uncertainty. To mimic differences
in selected control points, we impose a distribution over the control
points. For simplicity, in this example we assume that the five sets of
control points are equally likely to be accurate. This is to say, we are
equally certain of the accuracy of the picked control points. This can
be related to the case where we have access to wells in the seismic
survey area and we are certain of the well tying procedure. Other
sources of error such as uncertainty in location of the control points,
multiple human interpreters, etc, can also be incorporated in the
equation #sum-horizons-points, but will not be considered here. Given
the above assumption on the probability distribution, each realization
of the seismic image gives rise to multiple equally likely tracked
horizons for each of the five sets of control points.

Given these multiple tracked horizons for each image, pointwise first
and second moments of ``p_{\mathcal{H}} \left (\B{h} \mid \delta \B{m}
\right )`` are calculated via equations #cm-horizons and #std-horizons
by tracking horizons in each sample from the imaging posterior. The
horizon tracker yields five sets of horizons for each seismic image,
each obtained using one of the five sets of control points. Results of
this procedure are summarized in Figure #horizon-uq-nondet\. As in the
earlier examples, we observe an increase in uncertainty with depth and
at the boundaries. Contrary to small uncertainties near the control
points, we now observe uncertainty everywhere along the tracked
horizons, which suggests increased variability amongst horizons. The
variability comes from not trusting only one set of control points, but
incorporating information from all the fives sets of control points.

### Figure: {#horizon-uq-nondet .wide}
![](figures/Figure_10.png){width=75% #cm-std-colored}
:Uncertainty in horizon tracking due to a combination of uncertainties
in imaging and control points (equation #sum-horizons-points\).

## Probabilistic horizon tracking---"noisy" high-resolution Compass model example

While the seismic imaging examples considered so far were directly
obtained from migrated shot data of the Parihaka dataset, the
linearization errors, i.e., errors due to linearizing the wave equation
with respect to the background squared-slowness model, were ignored
because the quasi-real data was generated through the demigration
process for a made-up background model. Aside from this simplification,
the geology of the examples discussed so far was relatively simple and
imaged at low resolution. To account for a more realistic setting,
involving complex geology and high-resolution imaging, we will quantify
imaging and horizon tracking uncertainty given high-frequency but noisy
synthetic shot data. To mimic the complexities of field data, noisy shot
data---generated with nonlinear forward modeling on a 2D subset of the
Compass model [@Jones2012]---is used as input to the proposed imaging
scheme. We select the synthetic Compass model because it contains
realistic heterogeneity derived from both seismic and well data
collected in the North Sea [@Jones2012]\. Aside from a low
signal-to-noise ratio of ``-9.17\, \mathrm{dB}``, this example is
affected by linearization errors.

As before, we first describe the problem setup, followed by a comparison
between the unregularized MLE and deep-prior regularized MAP and
conditional mean estimates. After presenting results on uncertainty
quantification on the image, results of our probabilistic horizon
tracking framework will be discussed below.

### Problem setup

Split spread raw data, consisting of ``101`` shot records sampled with a
source spacing of ``25\, \mathrm{m}``\ and a receiver sampling ``12.5\,
\mathrm{m}``, is generated by solving the acoustic wave equation for the
2D subset of the Compass model. To mimic broadband data, the source is a
Ricker wavelet with a central frequency of ``40\, \mathrm{Hz}``\. Each
shot is ``1.6`` seconds long.

To image the shot data, we first derive a kinematically correct
background model via smoothing. Next, linearized data is created by
subtracting shot data simulated in this smooth background model from the
shot data simulated in the actual model. This data serves as input to
our imaging scheme.


### Imaging with uncertainty quantification

To arrive at the MLE and MAP estimates for the image, we follow the
procedure outlined before. The MLE estimate is obtained by minimizing
equation #deep-prior-likelihood with stochastic optimization limiting
the number of iterations to six passes over the ``101`` shots. Equation
#deep-prior is minimized with respect to ``\B{w}`` with the RMSprop
optimization algorithm [@rmsprop] for stepsize of ``10^{-3}`` and five
thousand iterations (about ``50`` passes over all shots). We stopped the
iterations after no further visual improvement to the image was
observed. The value for the tradeoff off parameter, ``\lambda^{-2}=3
\times 10^{-5}``, was set after extensive parameter testing guided by a
value that leads to the least amount of visual imaging artifacts.

Compared to the previous example, the MLE image estimate is of poor
quality (SNR ``2.80\,``dB) and contains many imaging artifacts stemming
from the noise and linearization errors. The MAP estimate, on the other
hand, is improved (SNR ``3.91\,\mathrm{dB}``\) thanks to regularization by the deep
prior but it does contain unrealistic artifacts and misses details
especially in the deeper parts of the image (juxtapose Figures
#BG_mle-colored and #BG_map-colored). By running Algorithm #alg for
ten thousand iterations, we compute five thousand samples from the
posterior distribution on the image following the stepsize strategy of
equation #stepsize with ``\gamma = \frac{1}{3}`` and ``a,\ b`` chosen
such the stepsize decreases from ``5 \times 10^{-3}`` to ``10^{-3}``\.
This took approximately ``18`` hours on a quad-core machine. The
resulting estimate for the conditional mean, included in Figure
#BG_cm-colored\, represents a considerable visual improvement with a SNR
of ``4.12\,``dB. Compared to the MLE and MAP estimates, the conditional
mean estimate exhibits more continuous reflectors and significantly
fewer artifacts. This example confirms that images yielded by the
conditional mean of inverse problems regularized by the deep prior are
relatively robust to noise.


### Figure: {#BG_imaging-results .wide}
![](figures/Figure_11a.png){width=70% #BG_mle-colored} \
![](figures/Figure_11b.png){width=70% #BG_map-colored} \
![](figures/Figure_11c.png){width=70% #BG_cm-colored}
:Imaging a 2D subset of the Compass model with deep priors. (a) MLE,
i.e., minimizer of equation #imaging-likelihood with respect to ``\delta
\B{m}``, with SNR ``2.80\,``dB. (b) The MAP estimate, i.e., minimizer of
equation #imaging-obj\, following by mapping onto the image space via
``g`` (equation #MAP\), with SNR ``3.91\,``dB. (c) The conditional
(posterior) mean estimate, ``\delta \B{m}_{\text{CM}}``, with SNR
``4.12\,``dB.

The available samples from the posterior also allows us to calculate an
estimate for the pointwise standard deviation of the image. This
quantity is plotted in Figure #BG_std-colored\. To avoid overprint by
the strong reflectors, we also included a plot of the normalized
standard deviation obtained by stabilized division by the conditional
mean. Both plots for the pointwise standard deviations show a distinct
correlation between difficult to image areas of complex geology, such as
channels, and areas affected by relatively poor illumination near the
edges and for the deep parts of the image. In the next section, we will
show how the samples from the posterior inform the task of horizon
tracking.

### Figure: {#BG_imaging-std .wide}
![](figures/Figure_12a.png){width=70% #BG_std-colored} \
![](figures/Figure_12b.png){width=70% #BG_std_normalized}
:Imaging uncertainty quantification corresponding to the Compass model.
(a) The pointwise standard deviation among samples drawn from the
posterior, ``\boldsymbol{\sigma}_{\text{post}}``. (b) Normalized
pointwise standard deviation by the conditional mean estimate (Figure
#BG_cm-colored\).


### Horizon tracking with uncertainty quantification

Similar to the previous horizon tracking example ``14`` horizons are
selected from the conditional mean estimate for the image (Figure
#BG_cm-colored\). As before, control points are picked for each horizon
at the horizontal locations ``62.5``\, ``562.5``\, ``1062.5``\,
``1562.5``\, and ``2062.5\, \mathrm{m}``. Control points with the same
horizontal position are grouped together, yielding five sets of control
points. As before, we distinguish between uncertainties related to noise
and now also linearization errors in the data and uncertainty related to
errors in the control parameters and in the shot data due to noise and
linearization errors.

#### Uncertainty due to noise and linearization errors in the shot data (case 1)

Tracked horizons plus ``99\%`` confidence intervals for the five sets of
control points are included in Figures #BG_cp-10-colored --
#BG_cp-350-colored\. These results were computed by sampling the
posterior distribution for horizon tracking, ``p_{\text{post}} \left
(\B{h} \mid \B{d} \right)``\, following the procedure described above.
Not unexpected, we consistently observe increases in uncertainty as we
move further away from the control points and deeper into the image.
This increase in the size of the confidence interval is due to the
increased variability amongst the samples of the imaging posterior
especially in regions that are more difficult to image, e.g., near the
boundaries of the image, at the deeper parts and near regions of complex
geology.

### Figure: {#BG_horizon-uq-det .wide}
![](figures/Figure_13a.png){width=50% #BG_cp-10-colored}
![](figures/Figure_13b.png){width=50% #BG_cp-90-colored} \
![](figures/Figure_13c.png){width=50% #BG_cp-170-colored}
![](figures/Figure_13d.png){width=50% #BG_cp-250-colored} \
![](figures/Figure_13e.png){width=50% #BG_cp-350-colored}
:Uncertainty in horizon tracking due uncertainties in imaging
(equation #independence-det\). Control points are located at (a) ``62.5\,
\mathrm{m}``, (b) ``562.5\, \mathrm{m}``, (c) ``1062.5\, \mathrm{m}``,
(d) ``1562.5\, \mathrm{m}``, and (e) ``2062.5\, \mathrm{m}`` horizontal
location. Conditional mean estimates and ``99\%`` confidence intervals
are shown in solid and shaded colors, respectively.


#### Uncertainty due to noise, linearization errors, and uncertain control points

Following the procedure described above, we also consider the effect of
randomness in the horizon tracking task itself. We mimic this by
imposing a distribution over the location of control points. As before,
we consider the case where we are equally confident in the location of
control points. The results for the conditional mean and the ``99\%``
confidence interval are presented in Figure #BG_horizon-uq-nondet\. As
expected, uncertainties increase consistently with depth, close to the
boundaries, and in areas of complex geology. Compared to the previous
example, these effects are more pronounced, which we argue is due to an
increase in linearization errors at later times in the shot data. This
increase leads to more uncertainty in deeper sections.

### Figure: {#BG_horizon-uq-nondet .wide}
![](figures/Figure_14.png){width=70% #BG_cm-std-colored}
:Uncertainty in horizon tracking due to a combination of uncertainties
in imaging and control points (equation #sum-horizons-points\).

## Discussion

The examples presented in this paper demonstrate the beneficial
regularization properties of deep priors as long as overfitting of noisy
data is avoided. Unfortunately, preventing overfit is challenging in
practice. To mitigate this issue, we proposed the use of conditional
mean estimates rather than maximum a posteriori (MAP) estimation even
though the former relies on sampling the posterior distribution, which
is computationally expensive. Based on our experience and other studies
[@Cheng_2019_CVPR; @wilson2020bayesian], estimates based on the
conditional mean are comparatively robust to overfitting and yield
superior results.

Even though having access to samples from the posterior has many
advantages, e.g., it gives us access to the conditional mean and
pointwise standard deviation estimates, its computational cost becomes
typically prohibitive for large dimensional problems with expensive
forward modeling operators. By using techniques from stochastic
optimization, we managed to partly offset these costs by avoiding exact
calculation of the multi-source data likelihood function. Similar to
stochastic optimization, sometimes employed to solve wave-equation based
optimization problems [@vanLeeuwen2010IJGswi; @haber2012effective;
@herrmann2012; @lu2015; @tu2015; @li2018], the gradient of the data
misfit is calculated for artificially constructed simultaneous source
experiments. This reduces the number of wave-equation solves for each
gradient calculation significantly. In the context of Langevin
dynamics---the theory undergirding our Markov chain Monte Carlo (MCMC)
method to sample from the posterior---this stochastic approximation
corresponds to the stochastic gradient Langevin dynamics (SGLD) method
proposed by @welling2011bayesian\. This approach, in combination with a
preconditioning scheme and stepsize schedule, eliminates computationally
prohibitive Metropolis-Hasting acceptance steps. By means of several
numerical experiments and empirical accuracy measures, we established
that the proposed SGLD algorithm is capable of drawing samples from the
posterior with a reasonable accuracy. Aside from providing a reasonable
assessment of the uncertainty, with pointwise standard deviation
increasing in complex areas or in areas of relatively poor illumination,
the samples from the posterior also allowed us to propagate
uncertainties due to errors in the shot data all the way to the task of
automatic horizon tracking or other tasks. With very few exceptions, we
are not aware of this type of work on relatively large scale problems
[@interrogation; @CHO2021108333].

While uncertainty quantification based on Bayesian inference certainly
has its merits, it comes at a significant computational price even for
2D problems. These computational costs are compounded by the fact that
the parameterization of seismic images in terms of a deep neural network
is highly overparameterized, making it more difficult to solve the
uncertainty quantification problem. Notwithstanding these challenges,
the use of deep priors has several distinct advantages. First, the
regularization comes from the inductive bias of the network architecture
itself, which is designed to favor natural images. Second, this approach
eliminates the need of having access to training data when compared to
method that really prior information encoded in pretrained networks.
Third, imposing a Gaussian prior on the networks weights is a common
regularization strategy [@krogh1992simple; @Goodfellow-et-al-2016].
Despite these advantages, the number of iterations needed by the SGLD
algorithm remains high and prohibitive for imaging problems in 3D. For
this reason, reducing the computational complexity of Bayesian inference
over the weights of deep prior networks remains an activate area of
research. For instance, @izmailov2020subspace proposes to project the
network weights onto a low-dimensional subspace and perform posterior
inference within this reduced subspace. Unfortunately, construction of
this reduced space can also be costly. For the purpose of applying this
work to 3D seismic imaging, one possibility is to combine this
dimensionality reduction approach with a different form of CNNs that
exhibit similar inductive biases to those that we used in this work
[@Lempitsky] but has fewer weights than the image dimensionality
[@heckel_deep_2019]. As a result, we may be able to exploit the
inductive bias of CNNs at a lower computational cost

Despite the fact that Bayesian inference based on MCMC methods, such as
SGLD, is well-studied and widely employed, it is challenging for
high-dimensional inverse problems that involve expensive forward
operator. Variational (Bayesian) inference [@jordan1999introduction;
@rezende2015variational], also known as distribution learning, can
potentially overcome these challenges. In this approach, a neural
network is trained to synthesize new samples from a target distribution
based on a collection of training samples. Typically these samples are
obtained by applying a series of learned nonlinear functions to random
realizations from a canonical distribution. Early work on variational
inference [@rizzuti2020SEGpub; @siahkoohi2020TRfuqf; @zhang2020seismic;
@kothari2021trumpets; @kovachki2021conditional; @kruse2021hint;
@kumar2021SEGeuq; @siahkoohi2021Seglbe; @siahkoohi2020ABIpto;
@zhao2020bayesian] shows encouraging results, which opens enticing new
perspectives on uncertainty quantification in the field of wave-equation
based inversion.

Uncertainty in solving (linear) inverse problems including seismic
imaging comes in two flavors. On the one hand, there is uncertainty
related to noise in the input (shot) data. On the other hand, there may
be modeling errors, such as linearization errors, which decrease with
the accuracy of the background velocity model. We plan to study how
these two types of uncertainty specifically affect the results in future
work, with a certain regard for the impact of the background velocity
model. Compared to the problem addressed in this paper, this would
entail multiple imaging experiments for different background velocity
models and is therefore more challenging. Recent developments in full
subsurface offset image volumes [@yang2020lrpo] and Fourier neural
operators [@li2021fourier] may prove essential in addressing this
problem.


## Conclusions

Bayesian inference on high-dimensional inverse problems with
computationally expensive forward modeling operators, has been, and
continues to be a major challenge in the field of seismic imaging. Aside
from obvious computational challenges, the selection of effective priors
is problematic given the heterogeneity across geological scenarios and
scales exhibited by elastic properties of the Earth's subsurface. To
limit the possibly heavy-handed bias induced by a handcrafted prior, we
propose regularization via deep priors. During this type of
regularization, seismic images are restricted to the range of an
untrained convolutional neural network with a fixed input, randomly
initialized. Compared to conventional regularization, which tends to
bias solutions towards sometimes restrictive choices made in defining
these prior distributions, nonlinear deep priors derive regularizing
properties from their overparameterized network architecture. The
reparameterization of the seismic image by means of a deep prior leads
to a Bayesian formulation where the prior is a Gaussian distribution of
the weights of the network.

As long as overfitting can be avoided, regularization with deep priors
is known to produce perceptually accurate results, an observation we
confirmed in the context of controlled seismic imaging experiments.
Unfortunately, preventing fitting the noise is difficult in practice. In
addition, there is always the question how errors, e.g.,
bandwidth-limited noise or linearization errors, propagate to
uncertainty in the image and to certain tasks to be carried out on the
image, which for instance includes the task of automatic horizon
tracking. To answer this question and to avoid the issue of overfitting,
we propose to sample from the imaging posterior distribution and use the
samples to compute the conditional mean estimate, which in our
experiments exhibited more robustness to noise, and to obtain confidence
intervals for the tracked horizons via our probabilistic horizon
tracking framework.

Even though drawing samples from the posterior is computationally
burdensome, it allows us to mitigate the imprint of overfitting while it
is also conducive to a systematic framework mapping errors in shot data
to uncertainty on the image and task at hand. By means of two imaging
experiments derived from imaged seismic data volumes, we corroborated
findings in the literature that the conditional mean estimate, i.e., the
average over samples from the posterior distribution on the image, is
more robust to overfitting than the maximum a posteriori estimate. The
latter is the product of deterministic inversion. Aside from improving
the image quality itself with the conditional mean estimate, access to
samples from the posterior also allows us to compute pointwise standard
deviation on the image and confidence intervals on automatically tracked
horizons.

With few exceptions as of yet, no systematic attempts have been made to
account for uncertainties in the task of horizon tracking due to errors
in the seismic imaging itself. These errors are caused by noise,
linearization approximations, and uncertainty in the horizon tracking
process itself, the latter being possibly related to differences in the
selection of control points by different interpreters as part of the
task of automatic horizon tracking.

To validate the proposed probabilistic tasked imaging framework, we
considered realistic scenarios that are representative of two different
geological settings. Our findings include: empirical verification of the
accuracy of the samples from the posterior distribution; establishment
of the conditional mean as a robust estimate for the image; reasonable
estimates for the pointwise standard deviation on the image, showing an
expected increase in variability in complex geological areas and in
areas with poor illumination; and finally confidence intervals for the
automatic horizon tracking given the uncertainty on the image and errors
in the selection of control points guiding automatic horizon tracking.

## Related material

The SGLD iterations (equation #sgld\) require computing gradient of the
negative-log posterior with respect to the CNN weights. This requires
actions of the linearized Born scattering operator and its adjoint. For
maximal numerical performance, the just-in-time [Devito]
[@devito-compiler; @devito-api] compiler was used for the wave-equation
based simulations. To have access to the automatic differentiation
utilities of [PyTorch], we expose Devito's matrix-free implementations
for the migration operator and its adjoint to PyTorch. In this way, we
are able to compute the gradients required by equation #sgld with
automatic differentiation while exploiting Devito's highly optimized
migration and demigration operators. For the CNN architecture, we
followed @Lempitsky\. For the automated horizon tracking we made use of
[software] written by @wu2018least\. For more details on our
implementation, please refer to our code on [GitHub](https://github.com/slimgroup/deep_inference_with_tasks).


[Devito]:https://www.devitoproject.org/
[software]:https://github.com/xinwucwp/mhe
[PyTorch]:https://pytorch.org/

## Acknowledgments

This research was carried out with the support of Georgia Research
Alliance and partners of the ML4Seismic Center.

## Appendix A

### Deep-prior network architecture

The CNN architecture proposed by @Lempitsky is a variation of the widely
used U-net architecture, as described by @ronneberger2015u. The U-net
architecture is composed of an encoder and a decoder module, where
information, i.e., intermediate values, from the encoder module is also
passed to the decoder through skip connections.

The following are the major differences between the deep prior
architecture [@Lempitsky] and the U-net architecture. Contrary to U-net,
the convolutional layers in the decoding module of the deep prior
architecture do not increase the dimensionality of the intermediate
values. Rather, dimensionality-preserving convolutional layers, that is,
stride-one convolutions, are augmented with user-defined interpolation
schemes to achieve upsampling. This enables the degree of smoothness in
the image space to be controlled by the interpolation kernel. Another
difference worth noting is the way intermediate values from the encoding
phase are incorporated into the decoding module through skip
connections. In the deep prior architecture, the intermediate values in
the encoding phase are passed through an additional convolutional layer
before being fed into the decoder module.

Figure #architecture_deep_prior illustrates the exact deep prior
architecture used in the Parihaka example in this paper, which closely
follows the architecture advocated by @Lempitsky\. The blocks in the
figure represent the intermediate values in the network, where the color
indicates the operation that produced these values. For instance, the
lightest gray blocks are intermediate values obtained from applying
two-dimensional convolutions, whereas the darkest gray block with dashed
white edges represents the result of a user-selected interpolation
method, which is in this instance nearest neighbor interpolation. For
further information regarding the rest of the colors, please refer to
the legend in Figure #architecture_deep_prior\. The left most shaded
block in Figure #architecture_deep_prior is the ``m \times n \times c``
input, which in the case of deep priors is a fixed random array, with
``m,\,n`` indicating its horizontal and vertical dimensions, and ``c``
which represents the number of channels. If a layer alters the dimension
of an intermediate value, the new dimensions are inserted adjacent to
the representative block. For instance, the leftmost convolutional layer
that takes in the input noise changes its dimensionality to
``\frac{m}{2} \times \frac{n}{2} \times 16``. All two-dimensional
convolutional layers use ``5 \times 5`` kernels. The convolutional
layers that reduce the horizontal and vertical dimensions have a stride
of two while all the other convolutional layers have a stride of one.
Lastly, we apply a fixed scaler to the output of the network such that
the range of the output values for the CNN with fixed input and randomly
drawn weights ``\B{w} \sim \mathrm{N} \big (\B{w} \mid \B{0},
\lambda^{-2}\B{I} \big )`` covers the a priori known range of amplitudes
in the unknown perturbation model.

### Figure: {#architecture_deep_prior .wide}
![](figures/Figure_15.png){width=100%}
:The architecture used for the Parihaka example.

The Compass example uses the same architecture as Figure
#architecture_deep_prior\, except that it includes an additional
downsampling layer in the encoder and an additional upsampling layer in
the decoder. We found the addition of these layers to be helpful since
the frequency content of the model in the Compass example was higher
and the dimensions larger.


```math_def
\newcommand{\R}{\mbox{$\mathbb{R}$}}
\newcommand{\B}{\mathbf}
```