## Paper Decision

**Decision:** Accept \
**Comment:** We are pleased to inform you that your submission has been accepted.

This year each accepted paper will be presented as a 5-min youtube video. In addition, a number of selected papers will have an additional contributed talk, which will be presented as a 10-12min pre-recorded talk plus live Q&A in AABI sessions. Contributed talk acceptance will be notified by Dec 31st 2020.

Please submit the camera ready of the paper and fill in the following form for submitting your video presentation (youtube link) by Jan 10th 2021:
https://forms.gle/AP3DdpnjfYM2jmPu8

Thank you and congratulations again.
AABI 2021 Organizers


## Review 1

Interesting work

The paper proposes to initialize the the normalization flow via the weights from a pre-trained low-fidelity NF, so as to reduce the cost of forward operators. The numerical experiments show the effectiveness of the heuristics. The idea is interesting and natural. Probably the next step will be to extend it to a multi-fidelity NF framework.

**Rating:** 7: Good paper, accept \
**Confidence:** 3: The reviewer is fairly confident that the evaluation is correct


## Review 2


I'm uncertain if the paper meets the bar when it comes to novelty, clarity and quality of writing.

Contribution:

* methodological: a "multifidelity preconditioning scheme" for VI using NF as variational family:


1. a normalizing flow is trained using maximum-likelihood wrt "low fidelity" data samples ``(x,y)`` coming from a joint distribution ``p_{lf}(x,y)`` which is easy & cheap to sample from and sufficiently similar to the true joint distribution of the model ``p(y|x)p(x)`` to allow for benefits of transfer learning;

2. A NF variational approximation ``q(x|y)`` is initialized using the NF pre-trained in step 1, and refined in the VI setting using the high-fidelity but expensive model defining ``p(y|x)``, and prior distribution set to approxmiate posterior from step 1: ``p(x)=q(x|y)``. The addition of the pre-training step (which can also be thought of as amortization) yields significant computational savings wrt performing step 2 from scratch from each datapoint.

* application of the method to seismic data.

Correctness: I have serious doubts about using approximate posterior from step 1 as a prior for step 2. I believe that in order to perform "principled" Bayesian inference you should either use a known prior ``p(x)`` or learn ``q(x)`` in step 1 (non-conditional distribution) and use that as the prior for eq 6.

Novelty: the exact solution presented (using NFs in exactly this way) is novel to me, however carries strong similarities to amortized variational inference followed by gradient updates. The novelty still appears in the fact the target for the amortization in step 1 is not exact, and that the use of this method is motivated somewhat differently.

Regarding the above, see arxiv:1807.09356 and references: "For inference optimization, previous works have combined standard inference models with gradient updates (Hjelm et al., 2016; Krishnan et al., 2018; Kim et al., 2018)."

A slightly simpler but equivalent (I think) set up that comes to my mind inspired by the above:

1. learn an approximation to the posterior using a conditional normalizing flow ``q_\phi(x|y)`` using ``E_{\hat{\pi}(x,y)}[-\log q_\phi(x|y)]`` and approximate prior ``q(x)`` using ``E_{\hat{\pi}(x,y)}[-\log q(x)]``.
2. Initiate the variational approximation to the pre-trained ``q_\phi(x|y)`` and optimize the VI objective wrt ``\phi`` for a fixed data point ``y``.


Feedback for authors:

* overall, the quality of writing needs to be improved. Many elements are unclear, there are numerous odd words that shouldn't be there, paper seems to not have been carefully proofread.
* eq 2 is hard to read and understand, it's not immidiately clear how that's minimizing KL divergence. Please make the exposition a little more verbose.
* "Unlike the objective in Equation (2), training ``G_{\phi}`` does not involve inversion of the forward operator, ``F``." - where is the inversion of ``F`` necessary in eq 2 (there seems to be no ``F^{-1}``)? Doesn't ``F`` just need to be differentiable? Is that what you mean by "inversion"?
* "It is important to note that Equation (6) trains a NF speciﬁc to the observed data ``y``.": I would add this remark in the introduction.
* I'd put the low-fidelity posterior, from eq 5, in Fig 2c (as per figure in the appendix).
* "The comparison clearly shows the computational superiority of the proposed preconditioning approach.": you should at least say how long does pretraining take.
* The evaluation is lacking, I would expect a more quantitative analysis of the results (posteriors) obtained than just plotting and visual comparison.
* the discussion in section 4.2 is at times unclear to me.


I feel like this work would be a bit more suitable for a workshop like "ML for physical sciences" or a similar venue. Paper should be revised and proofread before publication.

**Rating:** 5: Marginally below acceptance threshold \
**Confidence:** 3: The reviewer is fairly confident that the evaluation is correct