--- title: Adjoint operators enable fast and amortized machine learning based Bayesian uncertainty quantification author: | Rafael Orozco, Ali Siahkoohi, Gabrio Rizzuti, Tristan van Leeuwen and Felix J. Herrmann\ bibliography: - report.bib --- ## ABSTRACT: Machine learning algorithms such as Normalizing Flows, VAEs, and GANs are powerful tools in Bayesian uncertainty quantification (UQ) of inverse problems. Unfortunately, when using these algorithms medical imaging practitioners are faced with the challenging task of manually defining neural networks that can handle complicated inputs such as acoustic data. This task needs to be replicated for different receiver types or configurations since these change the dimensionality of the input. We propose to first transform the data using the adjoint operator —ex: time reversal in photoacoustic imaging (PAI) or back-projection in computer tomography (CT) imaging — then continue posterior inference using the adjoint data as an input now that it has been standardized to the size of the unknown model. This adjoint preprocessing technique has been used in previous works but with minimal discussion on if it is biased. In this work, we prove that conditioning on adjoint data is unbiased for a certain class of inverse problems. We then demonstrate with two medical imaging examples (PAI and CT) that adjoints enable two things: Firstly, adjoints partially undo the physics of the forward operator resulting in faster convergence of an ML based Bayesian UQ technique. Secondly, the algorithm is now robust to changes in the observed data caused by differing transducer subsampling in PAI and number of angles in CT. Our adjoint-based Bayesian inference method results in point estimates that are faster to compute than traditional baselines and show higher SSIM metrics, while also providing validated UQ. ## STATEMENT OF PURPOSE: The efficiency and power of machine learning methods have accelerated the development in inverse problems of a variety of fields [@ongie2020deep]. Unfortunately, many machine learning methods are black boxes with failure cases that are difficult to interpret. This is one of the reasons that their adoption in safety critical settings is hampered. Our purpose is to improve machine learning (ML) for medical imaging by enabling uncertainty that is amortized. This is important since applying ML under distribution shifts or poor training can cause instabilities and even hallucinations leading to incorrect diagnoses [@antun2020instabilities, @bhadra2021hallucinations]. Uncertainty quantification (UQ) aims to aid this problem by communicating to practitioners when a model is confident in its result versus when it should not be trusted. We describe a practical UQ framework based on amortized variational inference (AVI) and adjoint operators. These two concepts marry powerful data-driven methods with physics knowledge allowing us to amortize over unseen data and also different imaging configurations. By amortizing, we mean that the framework generalizes over new data. The particular class of algorithms we study, can be trained given only examples of the parameters of interest ``\mathbf{x}`` and their corresponding simulated data ``\mathbf{y}``. In medical imaging, data ``\mathbf{y}`` can contain complex physical phenomena such as acoustic waves in photoacoustic imaging. To a certain extent, an ML-based algorithm learns to ``\mathit{undo}`` the complex physical phenomena when providing an estimate of ``\mathbf{x}``. This can lead to long training times which we would like to ameliorate. We are also interested in creating an algorithm that is robust to changes in the dimensionality of the observations ``\mathbf{y}`` caused by changes in the imaging configuration such as changing the number of transducers. To solve both problems, we propose preprocessing the data ``\mathbf{y}`` with the adjoint operator. We first show that the posterior given data is equivalent to the posterior given data preprocessed with the adjoint in the case of linear forward operators and Gaussian noise. Many medical imaging modalities fall in this category: Photoacoustic imaging, CT and MRI. Equipped with this theoretical result, we demonstrate two practical advantages of adjoint preprocessing. First, it accelerates training convergence of AVI with conditional normalizing flows. This is a welcome result since normalizing flows are notoriously costly to train (40 GPU weeks for the seminal GLOW normalizing flow [@kingma2018glow]). Second, the adjoint brings data to the model space and therefore standardizes data size, enabling us to learn a single amortized neural network that can sample the posterior for a variety of imaging configurations. We demonstrate these results using two medical imaging applications. These contributions fulfill our stated purpose by describing a practical UQ framework that has a theoretical basis. ## METHODS: ### Bayesian Uncertainty Quantification of Inverse Problems: \noindent At the forefront of uncertainty quantification (UQ) for inverse problems are Bayesian methods[@tarantola2005inverse]. First, we setup a forward problem. Given our quantity of interest ``\mathbf{x} \in \mathcal{X}`` called the model, the forward problem is described by a linear operator ``\mathbf{A}: \mathcal{X} \rightarrow \mathcal{Y}`` whose action on ``\mathbf{x}`` gives observations ``\mathbf{y} \in \mathcal{Y}``. Here, we consider linear problems with additive Gaussian noise term ``\pmb\varepsilon``---i.e., ```math #eq:forward \mathbf{y} = \mathbf{A}\mathbf{x} + \pmb\varepsilon. ``` Upon observing data ``\mathbf{y}``, traditional methods in inverse problems create a single point estimate of the ``\mathbf{x}`` that produced ``\mathbf{y}``. In the absence of noise and for invertible operators this point estimate is enough to describe the solution of the inverse problem. This approach fails to fully characterize the solution space in ill-posed problems [@hadamard1902problemes], where there is no guarantee of a unique solution. In a Bayesian framework, we instead try to find the set or \textit{distribution} of solutions that all explain the data. This set of solutions is encoded by the conditional distribution ``\mathbf{x}\sim p(\mathbf{x}\mid \mathbf{y})`` called the posterior distribution and is the end goal of Bayesian inference . It contains all information needed to estimate ``\mathbf{x}`` given ``\mathbf{y}`` while providing UQ. Calculating the posterior distribution can be done with two main type of algorithms: first, sampling based algorithms such as Markov chain Monte Carlo (MCMC) algorithms [@robert1999monte] and second, optimization based algorithms such as variational inference (VI) [@blei2017variational]. MCMC can be a costly method due to amount of sampling required [@gelman1995bayesian], especially in high-dimensional problems. We explore variational inference (VI) [@blei2017variational] to efficiently sample the posterior. ### Variational Inference for Posterior Distribution Learning: Our goal is to calculate a distribution so our problem is essentially density estimation of the posterior distribution. VI methods [@blei2017variational] turn the problem of density estimation by finding a parameterized distribution that best fits the desired distribution. The goodness of fit is typically measured by the Kullback-Leibler (KL) divergence. Among the various VI methods for density estimation, normalizing flows [@dinh2016density] have been shown to be flexible, efficient and powerful while working on a variety of distributions including conditional distributions [@kingma2018glow, @ardizzone2018analyzing]. For our case, the goal is to find the normalizing flow ``f_{\theta}`` parameterized by ``\theta`` that makes a learned conditional distribution ``p_{\theta}(\mathbf{x}\mid \mathbf{y})`` approximate the desired posterior distribution ``p(\mathbf{x}\mid \mathbf{y})``. We measure the goodness-of-fit with the KL-divergence making our optimization objective ```math #eq:KL \hat\theta = \underset{\mathbf{\theta} }{\operatorname{arg \, min}} \, \mathbb{KL}\, ( \, p_{\theta}(\mathbf{x}\mid \mathbf{y}) \, \,||\,\, p(\mathbf{x}\mid \mathbf{y}) ). ``` Previous work has been put into VI with normalizing flows that involves costly optimization for each incoming observed data ``\mathbf{y}`` [@whang2021composing, @sun2020deep, @rizzuti2020parameterizing, @zhao2022bayesian]. The optimization is costly because learned parameters of ``f_{\theta}`` typically parameterize neural networks that are costly to optimize. On top of that, these VI objectives requires online use of the forward operator ``\mathbf{A}`` and its adjoint ``\mathbf{A}^{\ast}`` during optimization. With this formulation, VI is not efficient enough to enable real-time inference. Real-time results are particularly important in medical imaging settings since they extend the abilities of a given modality for example by enabling the use of hand-held probes [@dima2012non, buehler2013real]. In general, minimizing turnover time makes the difference in providing a timely diagnosis [@bauer2013real]. ### Amortized Variational Inference with Conditional Normalizing Flows: A formulation of VI called amortized variational inference (AVI) [@kovachki2020conditional,@kruse2019hint, @orozco2021photoacoustic, @siahkoohi2022wave, @gershman2014amortized] aims to obtain fast inference on test data without having to re-optimize an objective. This is accomplished by optimizing a Kullbeck-Liebler (KL) divergence based objective that is ``\mathit{amortized}`` over different data ``\mathbf{y}`` sampled from the distribution ``p(\mathbf{y})``: ```math #eq:KLamort \hat\theta = \operatorname{arg \,min}_{\theta} \, \mathbb{E}_{\mathbf{y} \sim {p(\mathbf{y})}} \left[ \mathbb{KL} \, ( p(\mathbf{x}\mid \mathbf{y}) \,||\, p_{\theta}(\mathbf{x}\mid \mathbf{y}) ) \right] . ``` In this form, the objective can not be implemented since it would require access to the ground truth posterior distribution ``p(\mathbf{x}\mid\mathbf{y})``. One can optimize a simplified objective that only requires samples from the joint distribution ``\mathbf{x},\mathbf{y} \sim p(\mathbf{x},\mathbf{y})``. For a conditional normalizing flow (CNF) ``f_{\theta}``, the AVI objective becomes ```math #eq:KLamortML \hat\theta = \underset{\theta}{\operatorname{arg \, min}} \, \frac{1}{N} \sum_{n=1}^{N} \left( \lVert f_{\theta}(\mathbf{x}^{(n)};\mathbf{y}^{(n)}) \rVert_{2}^2 - \log \left| \det{ \mathbf{J}_{f_{\theta}}} \right| \right) ``` where ``N`` is the size of a training dataset and ``\mathbf{J}_{f_{\theta}}`` is the Jacobian of the CNF. This objective is derived in [@radev2020bayesflow] and in [@siahkoohi2022reliable] with a discussion on its relationship to the forward and backward KL-divergence. This objective is particularly simple to implement with CNFs since they allow for tractable computation of the determinant Jacobian term ``\det{ \mathbf{J}_{f_{\theta}}}`` by design. Our goal is to amortize our CNF for a variety of imaging configurations. Then we must consider that for a model ``\mathbf{x}`` of one size, a different imaging configuration ``i`` (here encoded into the forward operator as ``\mathbf{A_i}``) can change the size of ``\mathbf{y_i}``. These different data sizes can arise from changing receiver settings such as number of receivers or view angles. To handle a set of imaging configurations ``\{i = 1:M\}`` (where ``M`` is the number configurations), one would need to define a network ``f_{\theta_i}``, for all ``M`` configurations i.e. manually defining downsampling layers. Standardizing observations to a single size would allow practitioners to only need to define a single network ``f_{\theta}`` since now all inputs to the network are the same size regardless of their imaging configurations. The adjoint ``\mathbf{A^{\ast}_i}`` offers a physically consistent way of standardizing data inputs to a single size, namely the size of the model. #### Adjoint: Given a linear operator ``\mathbf{A}`` that acts on inner product spaces ``\mathcal{X} \rightarrow \mathcal{Y}``. The adjoint ``\mathbf{A}^{\ast}: \mathcal{Y} \rightarrow \mathcal{X}`` is the operator that makes the following true for any ``\mathbf{x}`` and ``\mathbf{y}``: ``\langle \mathbf{y} , \mathbf{A}\mathbf{x} \rangle = \langle \mathbf{A}^{*}\mathbf{y}, \mathbf{x} \rangle.`` Standardizing the size of incoming data is related to the concept of a summary statistic [@deans2002maximally, radev2020bayesflow] so can we interpret the adjoint as a physics-based summary statistic. We will show that on top of being robust over different imaging configurations, adjoints also accelerate the convergence of the training objective in Equation #eq:KLamortML]. Before demonstrating these two practical advantages of using the adjoint, we present our main contribution: a theoretical discussion showing when preprocessing data with the adjoint will not affect the result of posterior inference. ### Adjoint Preprocessing is Unbiased for Linear Problems with Gaussian Noise: As noted in the previous section, there are good reasons to use the adjoint operator as a preprocessor. This leads to an important question: that we phrase using the language of [@baptista2022bayesian]: can we condition on adjoint data without introducing bias into the inference procedure? We will answer this question in the affirmative. We use Proposition 1 from [@adler2022task] and specify a class of inverse problems that satisfies the proposition with adjoint preprocessing. #### Proposition 1:[@adler2022task] If ``\mathcal{B}`` is injective on the range of ``\Pi`` then ``p(\mathbf{x} \mid \mathcal{B}\,\Pi \, \mathbf{y} )`` will be equal to ``p(\mathbf{x} \mid \mathbf{y})`` if and only if the information lost by observing ``\Pi \mathbf{y}`` instead of ``\mathbf{y}`` is conditionally independent of ``\mathbf{x}`` given ``\Pi \mathbf{y}``: ```math #eq:prop1cond p(\mathbf{x} \mid \mathcal{B}\,\Pi \, \mathbf{y} ) = p(\mathbf{x} \mid \mathbf{y}) \iff \mathbf{x} \bigCI \mathbf{y} - \Pi \mathbf{y} \mid \Pi \mathbf{y}. ``` . #### Proposition 1a: Given data ``\mathbf{y}`` (created as in Equation (#eq:forward)), the posterior conditioned on adjoint-preprocessed data ``p(\mathbf{x}\mid \mathbf{A}^{\ast}\mathbf{y})`` will be equal to the original posterior ``p(\mathbf{x}\mid \mathbf{y})`` if the additive noise is Gaussian. \paragraph{Proof:} We use Proposition 1 from[@adler2022task] and set ``\mathcal{B}=\mathbf{A}^\ast``; and ``\Pi`` to be the orthogonal projector onto ``\mathrm{ran}(\mathbf{A})``; ``\Pi=\mathbf{A}\mathbf{A}^{+}`` where ``\mathbf{A}^{+}`` is the Moore-Penrose inverse. [@penrose1955generalized}. Since ``p(\mathbf{x} \mid \mathcal{B} \, \Pi \mathbf{y} ) = p(\mathbf{x} \mid \mathbf{A}^{\ast} \mathbf{A} \mathbf{A}^{+} \mathbf{y} ) = p(\mathbf{x} \mid \mathbf{A}^{\ast} \mathbf{y} )`` then our proof is complete if we show that the following conditional independence in (#eq:prop1cond) is true: ``\mathbf{x} \bigCI \mathbf{y} - \mathbf{A}\mathbf{A}^{+}\mathbf{y} \mid \mathbf{A}\mathbf{A}^{+}\mathbf{y}.`` To show this, we note that Gaussian noise can be decomposed as the sum of two independent components -- one that lives in ``\mathrm{ran}(\mathbf{A})`` (the range of ``\mathbf{A}``) and another that lives in ``\mathrm{ran}(\mathbf{A}_{})^\perp`` : ``\pmb\varepsilon = \pmb\varepsilon_{\mathbf{ran }} + \pmb\varepsilon_{\perp}``. Assuming this structure: ```math #eq:derivation \begin{array}{r} \,\,\mathbf{x} \bigCI \mathbf{\mathbf{y}} - \mathbf{A}\mathbf{A}^{+}\mathbf{y} \mid \mathbf{A}\mathbf{A}^{+}\mathbf{y} \nonumber\\ = \,\, \mathbf{x} \bigCI (\mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{\perp}} + \pmb\varepsilon_{\mathbf{ran}}) - \mathbf{A}\mathbf{A}^{+}(\mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{\perp}} + \pmb\varepsilon_{\mathbf{ran}}) \mid \mathbf{A}\mathbf{A}^{+}(\mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{\perp}} + \pmb\varepsilon_{\mathbf{ran}}) \nonumber\\ = \,\, \mathbf{x} \bigCI \mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{\perp}} + \pmb\varepsilon_{\mathbf{ran}} - \mathbf{A}\mathbf{A}^{+}\mathbf{A}\mathbf{x} - \pmb\varepsilon_{\mathbf{ran}} \mid \mathbf{A}\mathbf{A}^{+}\mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{ran}} \nonumber\\ = \,\, \mathbf{x} \bigCI \pmb\varepsilon_{\perp} \mid \mathbf{A}\mathbf{x} + \pmb\varepsilon_{\mathbf{ran}}. \end{array} ``` By the d-separation criterion, [@pearl1988probabilistic], the statement is true because ``\pmb\varepsilon_{\perp}`` is independent of all other elements, including ``\pmb\varepsilon_{\mathbf{ran}}`` as per the assumption. Thus we prove that for linear problems with Gaussian additive noise ``p(\mathbf{x} \mid \mathbf{A}^{\ast} \mathbf{y} ) = p(\mathbf{x} \mid \mathbf{y})`` meaning adjoint preprocessing does not change the result posterior inference. Assuming Gaussian additive noise covers several medical imaging modalities including our applications in photoacoustic imaging and CT[@gravel2004method]. Apart from non-Gaussian noise, our proof does not cover multiplicative noise or noise correlated with ``\mathbf{A}`` or ``\mathbf{x}``. We leave those for future work. ## RESULTS: We show three main results. First, that adjoint preprocessing accelerates the convergence of a conditional normalizing flow training to sample from the posterior distribution. Secondly, we demonstrate that the adjoint operator enables amortization over varying imaging configurations while using the same underling neural network. Finally, we validate the learned UQ by demonstrating posterior consistency through three tests. ### Adjoint Accelerates Convergence: We motivate adjoint preprocessing by hand-crafting a photoacoustic simulation and CNF architecture to compare the two scenarios, namely learning ``p_{\theta}(\mathbf{x} \mid \mathbf{y})`` by training on pairs ``\mathbf{D}= \{ (\mathbf{x}^{(n)}, \mathbf{y}^{(n)}) \}_{n=1}^{N} `` or adjoint preprocessed learning of ``p_{\theta}(\mathbf{x} \mid \mathbf{A}^{\ast}\mathbf{y})`` with ``\mathbf{D^{\ast}}= \{ (\mathbf{x}^{(n)}, \mathbf{A}^\ast\mathbf{y}^{(n)}) \}_{n=1}^{N} ``. As noted in our motivations, creating a CNF ``f_{\theta}`` that can accept raw data ``\mathbf{y}`` as an input is a laborious task but we do this for one imaging configuration to provide a fair comparison. This task involves defining a physical simulation and architecture that makes ``\mathbf{y} \in \mathbb{R}^{N_t \times N_{\mathrm{rec}}}`` (where ``N_t`` = number time steps and ``N_{\mathrm{rec}}`` = number of receivers) to have the same size as the model ``\mathbf{x} \in \mathbb{R}^{N_x\times N_y}`` (where (``N_x``,``N_y``) is the 2D grid size). To accomplish this, we to set the number of receivers to be the same as the number of grid points on a side ``N_{\mathrm{rec}}=N_y``. This makes the raw data matrix ``\mathbf{y}`` have the same number of column as the model ``\mathbf{x}``. Then we include a learned downsampling layer ``d_{\theta}`` that takes the data matrix and downsamples the number of rows to the number of grid points in the other side ``N_x``: ``d_{\theta}(\mathbf{y}) = \hat{\mathbf{y}} \in \mathbb{R}^{N_x\times N_y}``. Then we can proceed to train two CNF's where both underlying networks have the same architectures and are trained using the same hyperparameters. ### Figure: {#fig:convergence .wide} ![](figs/photo_obj_log.png){width=45%} ![](figs/photo_cm_log.png){width=45%} Convergence plots. (a) Posterior learning objective for data without (dashed) and with preprocessing with the adjoint ``\mathbf{A}^{\ast}``. The adjoint accelerates convergence. (b) MSE of the conditional mean yielding improved Bayesian inference with less training time (juxtapose solid and dashed line for raw and preprocessed data). Figure #fig:convergence shows that for equivalent architectures and training, learning ``p_{\theta}(\mathbf{x}|\mathbf{A}^{\ast}\mathbf{y})`` is accelerated compared to learning ``p_{\theta}(\mathbf{x}|\mathbf{y})``. We also plot the mean squared error (MSE) between the calculated conditional mean ``\mathbf{x_{cm}}`` and the ground truth ``\mathbf{x_{gt}}``. While MSE is not the training objective, it is still an important proxy since the conditional mean of the true posterior is the one that gives the lowest expected error [@banerjee2005optimality, @adler2018deep]. Here and throughout, the conditional mean is estimated from 64 generated samples of the posterior. The training logs in Figure #fig:convergence are averages of an unseen validation set ``N_{\mathrm{val}}``=192 created by a 10\% split from the training dataset of ``N``=2048 pairs. ### Adjoint Amortizes Different Imaging Configurations: Preprocessing observed data with the adjoint allows us to use the same network to learn the posterior for different imaging configurations. As proof of concept, we amortize over four photoacoustic imaging configurations consisting of data collected with 8, 16, 32, and 64 receivers. These different configurations are encoded by forward and adjoint operators ``\mathbf{A_{i}},\mathbf{A^{\ast}_{i}}`` with ``i\in\{8,16,32,64\}``. We train our CNF with ``N``=2048 examples for each imaging configuration ``\mathbf{D}_{\mathbf{i}}^{\ast}= \{ (\mathbf{x}^{(n)}, \mathbf{A_i^\ast}\mathbf{y_i}^{(n)}) \}_{n=1}^{N} ``. Training took 14 hours on P1000 4GB GPU. ### Figure: {#fig:amort .wide} ![](figs/amort_numbered.jpeg){#sfigpost:a width=80%} ![](figs/tv.jpeg){#fig:tv width=9.4%} (a) Amortized training of neural networks capable of sampling from posterior distributions for differently sized observations ``\mathbf{y_{i}}``. As the number of receivers is increased, the samples show posterior contraction, a Bayesian phenomenon [@ghosal2017fundamentals} that says increasing the amount of data should decrease the width of the posterior. (b) The baseline method (TV-projected gradient descent) fails to image vertical vessels. After training, we sample from the learned posterior for unseen test data examples, ``\mathbf{y_i}`` for varying numbers of receivers. As expected, we observe posterior contraction[@ghosal2017fundamentals}. Visually, this is confirmed in Figure #sfigpost:a since uncertainty (quantified via pointwise variance) goes down when we increase receivers from 16 to 64. To quantitatively capture the global variation of these UQ images, we use the sum of pointwise variances: Var Sum = ``\left|\mathrm{Var}\right|_1``. or equivalently the trace of the covariance matrix [@oja1999affine}. Importantly, uncertainty is high where we expect, namely for vessels that are close to being vertical. These vertical events are in the null-space of the forward operator because the receivers are only located at the top of the model. ### Figure: {#tab:timing .wide} ![](figs/table.png){width=100%} Point estimate timing and quality metric comparison We emphasize that our posterior sampling after training is fast. The computational cost is dominated by the single adjoint (0.8 sec on Intel Skylake CPU). The CNF is conditioned on ``\mathbf{A^\ast_i}\mathbf{y_i}`` then the user generates the desired quantity of posterior samples (10 millisec/sample on our GPU). The time to create a point estimate with the conditional mean ``\mathbf{x_{cm}}`` (2 seconds with 64 posterior samples) is favorable compared to traditional least-squares approaches that require several forward and adjoint evaluations. We compare our point estimate ``\mathbf{x_{cm}}`` with TV-projected gradient descent (TV-PGD) [@zhang2012total, @schwab2018real} ``\mathbf{x_{tv}}``. Timing and quality metrics averaged over a test set of 96 samples are in Figure #tab:timing\. We show better SSIM quality while producing our point estimate in less time. . ### Validation of Uncertainty Quantification: Since our problem does not have Gaussian priors, we can not analytically verify that our converged posteriors have reached the ground truth posterior. For validation purposes, we conduct three tests, namely via ``\textit{(i) posterior contraction}`` by testing the CNF for different data sizes to check whether the posterior demonstrates contraction to the ground truth when increasing the amount of observed data. This contraction ultimately points to posterior consistency [@ghosal2017fundamentals]; ``\textit{(ii) posterior calibration}`` by checking whether our UQ correlates with regions in the image with large errors. This important check of UQ is called calibration [@guo2017calibration] and is established qualitatively by visually juxtaposing errors with UQ in Figure \ref{sfigpost:a}. For a more quantitative test, we plot a calibration line by using ``\sigma``-scaling [@laves2020well]; ``\textit{(iii) simulation-based calibration (SBC)}`` by testing for uniformity in the rank statistic when comparing various samples drawn from the proposed posterior with the known prior [@talts2018validating,@radev2020bayesflow]. ### Figure: {#fig:tests .wide} ![](figs/photo_contraction.png){width=24%} ![](figs/photo_contraction_error.png){width=24%} ![](figs/photo_calibration.png){width=21.5%} ![](figs/photo_sbc.png){width=24%} Validation of uncertainty quantification (a) Posterior contraction when increasing the amount of data. (b) Posterior contraction towards the ground truth as measured by MSE. (c) Confirmed correlation with error in calibration plot. (d) Our posterior shows calibration with SBC test. The results of these three tests are included in Figure \ref{fig:tests} and show our learned posterior is consistent and approximates the true posterior. For this reason, we argue that a practitioner is justified in using our posterior for uncertainty quantification. ## NEW WORK TO BE PRESENTED: We have two main contributions. ``\textit{(i)}`` Theoretical discussion on when adjoint preprocessing gives an equivalent posterior. ``\textit{(ii)}`` Demonstration of advantages of adjoint preprocessing on two realistic medical imaging applications. (CT examples are omitted due to space concerns. They will be added in manuscript). ## CONCLUSIONS: For linear operators and Gaussian noise, we prove that adjoint preprocessing posterior is equivalent to the original posterior ``p(\mathbf{x}|\mathbf{A}^{\ast}\mathbf{y}) = p(\mathbf{x}|\mathbf{y})``. Although the distributions are ultimately the same, learning them poses different computational burdens for ML training. We showed that the adjoint accelerates convergence of CNFs for AVI. We expect the same gains to be enjoyed by other ML-based methods such as GANs, VAE's or recently developed diffusion models [@saharia2022photorealistic]. We also demonstrate that the adjoint allows us to train a single network to handle many different imaging configurations thus saving engineering costs associated with designing network architectures for individual configurations. Our amortized posterior gives physically meaningful uncertainties that we also validate. ## References ```math_def \def\argmin{\mathop{\rm arg\,min}} \newcommand{\Jmat}{\nabla\mathbf{F}} \newcommand{\dm}{\delta\mvec} \newcommand{\mvec}{\mathbf{m}} \newcommand{\bigCI}{\mathrel{\text{$\perp\mkern-10mu\perp$}}} ```