--- title: Devito for large scale elastic modelling and anisotropic inversion. author: | Mathias Louboutin^1^, Fabio Luporini^2^, Philipp Witte^1^ , Rhodri Nelson^2^, George Bisbas^2^, Jan Thorbecke^3^, Felix J. Herrmann^1^ and Gerard Gorman^1^ \ ^1^School of Computational Science and Engineering, Georgia Institute of Technology \ ^2^ Imperial College London \ ^3^ TU-Delft bibliography: - sc20_paper.bib --- ## Abstract [Devito] is an open-source Python project based on domain-specific language and compiler technology. Driven by the requirements of rapid HPC applications development in exploration seismology, the language and compiler have evolved significantly since inception. Sophisticated boundary conditions, tensor contractions, sparse operations and features such as staggered grids and sub-domains are all supported; operators of essentially arbitrary complexity can be generated. To accommodate this flexibility whilst ensuring performance, data dependency analysis is utilized to schedule loops and detect computational-properties such as parallelism. In this article, the generation and simulation of MPI-parallel propagators (along with their adjoints) for the pseudo-acoustic wave-equation in tilted transverse isotropic media and the elastic wave-equation are presented. Simulations are carried out on industry scale synthetic models in a HPC Cloud system and reach a performance of 28TFLOP/s, hence demonstrating Devito's suitability for production-grade seismic inversion problems. ## Introduction Seismic imaging methods such Reverse Time Migration (RTM) and Full-waveform inversion (FWI) rely on the numerical solution of an underlying system of partial differential equations (PDEs), most commonly some manifestation of the wave-equation. In the context of FWI, the finite-difference (FDM) and the spectral-element (SEM) methods are most frequently used to solve the wave-equation, with FDM methods dominating within the seismic exploration community [@lyu2020]. Various forms of the wave-equation and modelling strategies for FWI are detailed in [@fichtner2011]. Despite the theory of FWI dating back to the 1980s [@tarantola], among the first successful expositions on real 3D data was presented in [@sirgue]. Other studies utilizing FDM within the FWI workflow include [@ratcliffe2011; @petersson2013]. The aforementioned studies approximate the underlying physics via the acoustic wave-equation; higher fidelity models solving the non-isotropic elastic wave-equation have been developed in, e.g., [@osti_1468379; @osti_1561580; @osti_1561581; @sava1; @sava2]. Owing to the flexibility of the mathematical discretizations that can be utilized, along with the ability to describe problems on complex meshes, there has also been a great deal of interest in utilizing SEM to solve inversion problems [@peter2011; @krebsdg]. The recent study [@trinh2019] presents an efficient SEM based inversion scheme using a viscoelastic formulation of the wave-equation. It is generally accepted that the PDE solver utilized within an inversion workflow must satisfy the following criteria [@virieuxmodelling]: - Efficient for multiple-source modelling - The memory requirement of the modelling - The ability of a parallel algorithm to use an increasing number of processors - Ability of the method to process models of arbitrary levels of heterogeneity - Reduce the nonlinearity of FWI - Feasibility of the extension of the modelling approach to more realistic physical descriptions of the media. It is with these specifications in mind that Devito, a symbolic domain specific language (DSL) and compiler for the automatic generation of finite-difference stencils, has been developed. Originally deigned to accelerate research and development in exploration geophysics, the high-level interface, previously described in detail in [@devito-api], is built on top of `SymPy` [@sympy] and is inspired by the underlying mathematical formulations and other DSLs such as FEniCS [@fenics] and Firedrake [@firedrake]. This interface allows the user to formulate wave-equations, and more generally time-dependent PDEs in a simple and mathematically coherent way. The [Devito] compiler then automatically generates finite-difference stencils from these mathematical expressions. One of the main advantages of [Devito] over other finite-difference DSLs is that generic expressions such as sparse operations (i.e. point source injection or localized measurements) are fully supported and expressible in a high-level fashion. The second component of [Devito] is its compiler (c.f [@devito-compiler]) that generates highly optimized C code. The generated code is then compiled at runtime for the hardware at hand. Previous work focused on the DSL and compiler to highlight the potential application and use cases of Devito. Here, we present a series of extensions and applications to large-scale three-dimensional problem sizes as encountered in exploration geophysics. These experiments are carried out in Cloud-based HPC systems and include elastic forward modelling using distributed-memory parallelism and imaging based on the tilted transverse isotropic (TTI) wave-equation ([@virieux; @thomsen1986; @zhang-tti; @duveneck; @louboutin2018segeow]). These proof of concepts highlight two critical features: first, the ability of the symbolic interface and the compiler to translate to large-scale adjoint-based inversion problems that require massive compute (since thousands of PDEs are solved) as well as large amounts of memory (since the adjoint state method requires the forward model to be saved in memory). Secondly, through the elastic modelling example, we demonstrate that [Devito] now fully supports and automates vectorial and second order tensorial staggered-grid finite-differences with the same high-level API previously presented for scalar fields defined on cartesian grids. This paper is organized as follows: first, we provide a brief overview of Devito and its symbolic API and present the distributed memory implementation that allows large-scale modelling and inversion by means of domain decomposition. We then provide a brief comparison with a state of the art hand-coded wave propagator to validate the performance previously benchmarked with the roofline model ([@patterson; @devito-compiler; @devito-api; @louboutin2016ppf]). Before concluding, results from the Cloud-based experiments discussed above are presented, highlighting the vectorial and tensorial capabilities of Devito. ## Overview of Devito Devito [@devito-api] provides a functional language built on top of `SymPy` [@sympy] to symbolically express PDEs at a mathematical level and implements automatic discretization with the finite-difference method. The language is by design flexible enough to express other types of non finite-difference operators, such as interpolation and tensor contractions, that are inherent to measurement-based inverse problems. Several additional features are supported, among which are staggered grids, sub-domains, and stencils with custom coefficients. The last major building block of a solid PDE solver are the boundary conditions which for finite-difference methods are notoriously diverse and often complicated. The system is, however, sufficiently general to express them through a composition of core mechanisms. For example, free surface and perfectly-matched layers (PMLs) boundary conditions can be expressed as equations -- just like any other PDE equations -- over a suitable sub-domain. It is the job of the [Devito](https://github.com/devitocodes/devito) compiler to translate the symbolic specification into C code. The lowering of the input language to C consists of several compilation passes, some of which introduce performance optimizations that are the key to rapid code. Next to classic stencil optimizations (e.g., cache blocking, alignment, SIMD and OpenMP parallelism), [Devito] applies a series of FLOP-reducing transformations as well as aggressive loop fusion. For a complete treatment, the interested reader should refer to [@devito-compiler]. ### Symbolic language and compilation In this section we illustrate the [Devito] language by demonstrating an implementation of the acoustic wave-equation in isotropic media ```math {#acou} \begin{cases} m \frac{d^2 u(t, x)}{dt^2} - \Delta u(t, x) = \delta(xs) q(t) \\ u(0, .) = \frac{d u(t, x)}{dt}(0, .) = 0 \\ d(t, .) = u(t, xr). \end{cases} ``` The core of the Devito symbolic API consists of three classes: - `Grid`, a representation of the discretized model. - `(Time)Function`, a representation of spatially (and time-) varying variables defined on a `Grid` object. - `Sparse(Time)Function` a representation of (time-varying) point objects on a `Grid` object, generally unaligned with respect to the grid points, hence called "sparse". ```python from devito import Grid grid = Grid(shape=(nx, ny, nz), extent=(ext_x, ext_y, ext_z), origin=(o_x, o_y, o_z)) ``` where `(nx, ny, nz)` are the number of grid points in each direction, `(ext_x, ext_y, ext_z)` is the physical extent of the domain in physical units (i.e `m`) and `(o_x, o_y, o_z)` is the origin of the domain in the same physical units. The object `grid` contains all the information related to the discretization such as the grid spacing. We use `grid` to create the symbolic objects that will be used to express the wave-equation. First, we define a spatially varying model parameter `m` and a time-space varying field `u` ```python from devito import Function, TimeFunction m = Function(name="m", grid=grid, space_order=so) u = TimeFunction(name="u", grid=grid, space_order=so, time_order=to) ``` where `so` is the order of the spatial discretization and `to` the time discretization order used when generating the finite-difference stencil. Next, we define point source objects (`src`) located at the physical coordinates ``x_r``, and the receiver (measurement) objects (`d`) located at the physical locations ``x_r`` ```python from devito import Function, TimeFunction src = SparseFunction(name="src", grid=grid, npoint=1, coordinates=x_s) d = SparseTimeFunction(name="d", grid=grid, npoint=1, nt=nt, coordinates=x_r) ``` The source term is handled separately from the PDE as a point-wise operation called injection, while measurement is handled via interpolation. By default, [Devito] initializes all `Function` data to 0, and thus automatically satisfies the zero Dirichlet condition at `t=0`. The isotropic acoustic wave-equation can then be implemented in [Devito](https://github.com/devitocodes/devito) as follows ```python from devito import solve, Eq, Operator Equation= m * u.dt2 - u.laplace stencil = [Eq(u.forward, solve(eq, u.forward))] src_eqns = s1.inject(u.forward, expr=s1 * dt**2 / m) d_eqns = d.interpolate(u) ``` To trigger compilation one needs to pass the constructed equations to an `Operator`. ```python from devito import Operator op = Operator(stencil + src_eqns + d_eqns) ``` he first compilation pass processes equations individually. The equations are lowered to an enriched representation, while the finite-difference constructs (e.g., derivatives) are translated into actual arithmetic operations. Subsequently, data dependency analysis is used to compute a performance-optimized topological ordering of the input equations (e.g., to maximize the likelihood of loop fusion) and to group them into so called "clusters". Basically, a cluster will eventually be a loop nest in the generated code, and consecutive clusters may share some outer loops. The ordered sequence of clusters undergoes several optimization passes, including cache blocking and FLOP-reducing transformations. It is then further lowered into an abstract syntax tree, and it is on such representation that parallelism is introduced (SIMD, shared-memory, MPI). Finally, all remaining low-level aspects of code generation are handled, among which the most relevant is data management (e.g., definition of variables, transfers between host and device). The output of the [Devito](https://github.com/devitocodes/devito) compiler for the running example used in this section is available at [CodeSample](https://github.com/mloubout/SC20Paper/tree/master/codesamples) in `acou-so8.c`. ### Distributed-memory parallelism We here provide a succinct description of distributed-memory parallelism in Devito; the interested reader should refer to the MPI tutorial [mpi-notebook](https://github.com/devitocodes/devito/blob/v4.2/examples/mpi/overview.ipynb) for thorough explanations and practical examples. Devito implements distributed-memory parallelism on top of MPI. The design is such that users can almost entirely abstract away from it and reuse non-distributed code as is. Given any [Devito] code, just running it as ```python DEVITO_MPI=1 mpirun -n X python ... ``` triggers the generation of code with routines for halo exchanges. The routines are scheduled at a suitable depth in the various loop nests thanks to data dependency analysis. The following optimizations are automatically applied: - redundant halo exchanges are detected and dropped; - computation/communication overlap, with prodding of the asynchronous progress engine by a designated thread through repeated calls to `MPI_Test`; - a halo exchange is placed as far as possible from where it is needed to maximize computation/communication overlap; - data packing and unpacking is threaded. Domain decomposition occurs in Python upon creation of a `Grid` object. Exploiting the MPI Cartesian topology abstraction, Devito logically splits a grid based on the number of available MPI processes (noting that users are given an "escape hatch" to override Devito's default decomposition strategy). `Function` and `TimeFunction` objects inherit the `Grid` decomposition. For `SparseFunction` objects the approach is different. Since a `SparseFunction` represents a sparse set of points, [Devito] looks at the physical coordinates of each point and, based on the `Grid` decomposition, schedules the logical ownership to an MPI rank. If a sparse point lies along the boundary of two or more MPI ranks, then it is duplicated in each of these ranks to be accessible by all neighboring processes. Eventually, a duplicated point may be redundantly computed by multiple processes, but any redundant increments will be discarded. When accessing or manipulating data in a [Devito] code, users have the illusion to be working with classic NumPy arrays, while underneath they are actually distributed. All manner of NumPy indexing schemes (basic, slicing, etc.) are supported. In the implementation, proper global-to-local and local-to-global index conversion routines are used to propagate a read/write access to the impacted subset of ranks. For example, consider the array ```python A = [[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12], [13, 14, 15, 16]]) ``` which is distributed across 4 ranks such that `rank0` contains the elements reading `1,2,5,6`, `rank1` the elements `3,4,7,8`, `rank2` the elements `9,10,13,14` and `rank3` the elements `11,12,15,16`. The slicing operation `A[::-1,::-1]` will then return ```python [[ 16, 15, 14, 13], [ 12, 11, 10, 9], [ 8, 7, 6, 5], [ 4, 3, 2, 1]]) ``` such that now `rank0` contains the elements `16,15,12,11` and so forth. Finally, we remark that while providing abstractions for distributed data manipulation, Devito does not natively support any mechanisms for parallel I/O. However, the distributed NumPy arrays along with the ability to seamlessly transfer any desired slice of data between ranks provides a generic and flexible infrastructure for the implementation of any form of parallel I/O (e.g., see [@witte2018alf]). ## Industry-scale 3D seismic imaging in anisotropic media One of the main applications of seismic finite-difference modelling in exploration geophysics is reverse-time migration (RTM), a wave-equation based seismic imaging technique. Real-world seismic imaging presents a number of challenges that make applying this method to industry-scale problem sizes difficult. Firstly, RTM requires an accurate representation of the underlying physics via sophisticated wave-equations such as the tilted-transverse isotropic (TTI) wave-equation, for which both forward and adjoint implementations must to be provided. Secondly, wave-equations must be solved for a large number of independent experiments, where each individual PDE solve is itself expensive in terms of FLOPs and memory usage. For certain workloads, limited domain decomposition, which balances the domain size and the number of independent experiments, as well as checkpointing techniques must be adopted. In the following sections, we describe an industry-scale seismic imaging problem that poses all the aforementioned challenges, its implementation with Devito, and the results of an experiment carried out on the Azure Cloud using a synthetic data set. ### Anisotropic wave-equation In our seismic imaging case study, we use an anisotropic representation of the physics called Tilted Transverse Isotropic (TTI) modelling [@thomsen1986]. This representation for wave motion is one of the most widely used in exploration geophysics since it captures the leading order kinematics and dynamics of acoustic wave motion in highly heterogeneous elastic media where the medium properties vary more rapidly in the direction perpendicular to sedimentary strata [@alkhalifah2000; @baysal1983; @bubetti2012; @bubetti2014; @bubesatti2016; @chu2011; @duveneck; @fletcher; @fowlertti2010; @louboutin2018segeow; @whitmore1983; @witte2016segpve; @xu2014; @zhang2005; @zhang2011; @zhan2013]. The TTI wave-equation is an acoustic, low dimensional (4 parameters, 2 wavefields) simplification of the 21 parameter and 12 wavefields tensorial equations of motions [@hooke]. This simplified representation is parametrized by the Thomsen parameters ``\epsilon(x), \delta(x)`` that relate to the global (propagation over many wavelengths) difference in propagation speed in the vertical and horizontal directions, and the tilt and azimuth angles ``\theta(x), \phi(x)`` that define the rotation of the vertical and horizontal axes around the cartesian directions. However, unlike the scalar isotropic acoustic wave-equation itself, the TTI wave-equation is extremely computationally costly to solve and it is also not self-adjoint as shown in [@louboutin2018segeow]. The main complexity of the TTI wave-equation is that the rotation of the symmetry axis of the physics leads to rotated second-order finite-difference stencils. In order to ensure numerical stability, these rotated finite-difference operators are designed to be self-adjoint (c.f. [@zhang2011; @duveneck]). For example, we define the rotated second order derivative with respect to ``x`` as: ```math {#rot} G_{\bar{x}\bar{x}} &= D_{\bar{x}}^T D_{\bar{x}} \\ D_{\bar{x}} &= \cos(\mathbf{\theta})\cos(\mathbf{\phi})\frac{\mathrm{d}}{\mathrm{d}x} + \cos(\mathbf{\theta})\sin(\mathbf{\phi})\frac{\mathrm{d}}{\mathrm{d}y} - \sin(\mathbf{\theta})\frac{\mathrm{d}}{\mathrm{d}z}. ``` We enable the simple expression of these complicated stencils in Devito as finite-difference shortcuts such as `u.dx` where `u` is a `Function`. Such shortcuts are enabled not only for the basic types but for generic composite expressions, for example `(u+v.dx).dy`. As a consequence, the rotated derivative defined in #rot is implemented with [Devito] in two lines as: ```python{#rotxpy} dx_u = cos(theta) * cos(phi) * u.dx + cos(theta) * sin(phi) * u.dy - sin(theta) * u.dz dxx_u = (cos(theta) * cos(phi) * dx_u).dx.T + (cos(theta) * sin(phi) * dx_u).dy.T - (sin(theta) * dx_u).dz.T ``` Note that while the adjoint of the finite-difference stencil is enabled via the standard Python `.T` shortcut, the expression needs to be reordered by hand since the tilt and azymuth angles are spatially dependent and require to be inside the second pass of first-order derivative. We can see from these simple two lines that the rotated stencil involves all second-order derivatives (`.dx.dx`, `.dy.dy` and `.dz.dz`) and all second-order cross-derivatives (`dx.dy`, `.dx.dz` and `.dy.dz`) which leads to a denser stencil support and higher computational complexity (c.f. [@louboutin2016ppf]). For illustrative purposes, the complete generated code for tti modelling with and without MPI is made available at [CodeSample](https://github.com/mloubout/SC20Paper/tree/master/codesamples) in `tti-so8-unoptimized.c`, `tti-so8.c` and `tti-so8-mpi.c`. Owing to the very high number of floating-point operations (FLOPs) needed per grid point for the weighted rotated Laplacian, this anisotropic wave-equation is extremely challenging to implement. As we show in Table #ttiFLOPs\, and previously analysed in [@louboutin2016ppf], the computational cost with high-order finite-difference is in the order of thousands of FLOPs per grid point without optimizations. The version without FLOP-reducing optimizations is a direct translation of the discretized operators into stencil expressions (see `tti-so8-unoptimized.c`). The version with optimizations employs transformations such as common sub-expressions elimination, factorization, and cross-iteration redundancy elimination -- the latter being key in removing redundancies introduced by mixed derivatives. Implementing all of these techniques manually is inherently difficult and laborious. Further, to obtain the desired performance improvements it is necessary to orchestrate them with aggressive loop fusion (for data locality), tiling (for data locality and tensor temporaries), and potentially ad-hoc vectorization strategies (if rotating registers are used). While an explanation of the optimization strategies employed by [Devito](https://github.com/devitocodes/devito) is beyond the scope of this paper (see [@devito-compiler] for details), what is emphasized here is that users can easily take full advantage of these optimizations without needed to concern themselves with the details. #### Table: {#ttiFLOPs} | spatial order | w/o optimizations | w/ optimizations | |:----------------|:----------------- |:----------------- | | 4 | 501 | 95 | | 8 | 539 | 102 | | 12 | 1613 | 160 | | 16 | 5489 | 276 | : Per-grid-point FLOPs of the finite-difference TTI wave-equation stencil with different spatial discretization orders. It is evident that developing an appropriate solver for the TTI wave-equation, an endeavor involving complicated physics, mathematics, and engineering, is exceptionally time-consuming and can lead to thousands of lines of code even for a single choice of discretization. Verification of the results is no less complicated, any minor error is effectively untrackable and any change to the finite-difference scheme or to the time-stepping algorithm is difficult to achieve without substantial re-coding. Another complication stems from the fact that practitioners of seismic inversion are often geoscientists, not computer scientists/programmers. Low level implementations from non-specialists can often lead to poorly performing code. However, if research codes are passed to specialists in the domain of low level code optimization they often lack the necessary geophysical domain knowledge, resulting in code that may lack a key feature required by the geoscientist. Neither situation is conducive to addressing the complexities that come with implementing codes based on the latest geophysical insights in tandem with those from high-performance computing. With [Devito] on the other hand, both the forward and adjoint equations can be implemented in a few lines of Python code as illustrated with the rotated operator in Listing  #rotxpy\. The low level optimization element of the development is then taken care of under the hood by the Devito compiler. The simulation of wave motion is only one aspect of solving problems in seismology. During wave-equation based imaging, it is also required to compute sensitivities (gradient) with respect to the quantities of interest. This requirement imposes additional constraints on the design and implementation of model codes as outlined in [@virieux]. Along with several factors, such as fast setup time, we focused on correct and testable implementations for the adjoint wave-equation and the gradient (action of the adjoint Jacobian) [@louboutin2018segeow; @louboutin2020THmfi]; exactness being a mandatory requirement of gradient based iterative optimization algorithms. ### 3D Imaging example on Azure {#d-imaging-example-on-azure} We now demonstrate the scalability of [Devito] to real-world applications by imaging an industry-scale three-dimensional TTI subsurface model. This imaging was carried out in the Cloud on Azure and takes advantage of recent work to port conventional cluster code to the Cloud using a serverless approach. The serverless implementation is detailed in [@witte2019TPDedas; @witte2019SEGedw] where the steps to run computationally and financially efficient HPC workloads in the Cloud are described. This imaging project, in collaboration with Azure, demonstrates the scalability and robustness of [Devito] to large scale wave-equation based inverse problems in combination with a cost-effective serverless implementation of seismic imaging in the Cloud. In this example, we imaged a synthetic three-dimensional anisotropic subsurface model that mimics a realistic industry size problem with a realistic representation of the physics (TTI). The physical size of the problem is `10kmx10kmx2.8km` discretized on a `12.5m` grid with absorbing layers of width 40 grid points on each side leading to `881x881x371` computational grid points (``\approx300`` Million grid points). The final image is the sum of 1500 single-source images: 100 single-source images were computed in parallel on the 200 nodes available using two nodes per source experiment. ***Computational performance*** We briefly describe the computational setup and the performance achieved for this anisotropic imaging problem. Due to time constraints, and because the resources we were given access to for this proof of concept with Azure were somewhat limited, we did not have access to Infiniband-enabled virtual machines (VM). This experiment was carried out on `Standard_E64_v3` and `Standard_E64s_v3` nodes which, while not HPC VM nodes, are memory optimized thus allowing to the wavefield to be stored in memory for imaging (TTI adjoint state gradient [@virieux; @louboutin2018segeow]). These VMs are Intel® Broadwell E5-2673 v4 2.3GH that are dual socket, 32 physical cores (with hyperthreading enabled) and 432Gb of DRAM. The overall inversion involved computing the image for 1500 source positions, i.e. solving 1500 forward and 1500 adjoint TTI wave-equations. A single image required, in single precision, 600Gb of memory. Two VMs were used per source and MPI set with one rank per socket (4 MPI ranks per source) and 100 sources were imaged in parallel. The performance achieved was as follows: - 140 GFLOP/s per VM; - 280 GFLOP/s per source; - 28 TFLOP/s for all 100 running sources; - 110min runtime per source (forward + adjoint + image computation). We comment that if more resources were available, and because the imaging problem is embarrassingly parallel over sources and can scale arbitrarily,the imaging of all of the 1500 sources in parallel could have been attempted, which theoretically leads to a performance of 0.4PFLOP/s. ***How performance was measured*** The execution time is computed through Python-level timers prefixed by an MPI barrier. The floating-point operations are counted once all of the symbolic FLOP-reducing transformations have been performed during compilation. [Devito] uses an in-house estimate of cost, rather than `SymPy`'s estimate, to take care of some low-level intricacies. For example, Devito's estimate ignores the cost of integer arithmetic used for offset indexing into multi-dimensional arrays. To calculate the total number of FLOPs performed, [Devito] multiplies the floating-point operations calculated at compile time by the size of the iteration space, and it does that at the granularity of individual expressions. Thanks to aggressive code motion, the amount of innermost-loop-invariant sub-expressions in an `Operator` is typically negligible and therefore the [Devito] estimate does not suffer from this issue, or at least not, to the best of our knowledge, in a tangible way. The Devito-reported GFLOP/s were also checked against those produced by Intel Advisor on several single-node experiments: the differences -- typically [Devito](https://github.com/devitocodes/devito) underestimating the achieved performance -- were always at most in the order of units, and therefore negligible. ***Imaging result*** The subsurface velocity model used in this study is an artificial anisotropic model that is designed and built combining two broadly known and used open-source SEG/EAGE acoustic velocity models that each come with realistic geophysical imaging challenges such as sub-salt imaging. The anisotropy parameters are derived from a smoothed version of the velocity while the tilt angles were derived from a combination of the smooth velocity models and vertical and horizontal edge detection. The final seismic image of the subsurface model is displayed in Figure #OverTTI and highlights the fact that 3D seismic imaging based on a serverless approach and automatic code generation is feasible and provides good results. #### Figure: {#OverTTI} ![](./Figures/OverTTI1.png){width=50%} ![](./Figures/OverTTI2.png){width=50%}\ ![](./Figures/OverTTI3.png){width=50%} ![](./Figures/OverTTI4.png){width=50%}\ : 3D TTI imaging on a custom made model. [@witte2019TPDedas] describes the serverless implementation of seismic inverse problems in detail, including iterative algorithms for least-square minimization problems (LSRTM). The 3D anisotropic imaging results were presented as part of a keynote presentation at the EAGE HPC workshop in October 2019 [@herrmann2019EAGEHPCaii] and at the Rice O&G HPC workshop [@witte2019RHPCssi] in which the focus was on the serverless implementation of seismic inversion algorithms in the Cloud. This work illustrates the flexibility and portability of Devito: we were able to easily port a code developed and tested on local hardware to the Cloud, with only minor adjustments. Further, note that this experiment included the porting of MPI-based code for domain decomposition developed on desktop computers to the Cloud. Our experiments are reproducible using the instructions in a public repository [AzureTTI](https://github.com/slimgroup/Azure2019/tree/v1.0), which contains, among the other things, the Dockerfiles and Azure [batch-shipyard](https://batch-shipyard.readthedocs.io) setup. This example can also be easily run on a traditional HPC cluster environment using, for example, [JUDI](https://github.com/slimgroup/JUDI.jl) [@witte2018alf] or Dask [@dask] for parallelization over sources. ## Elastic modelling While the subsurface image obtained in section #d-imaging-example-on-azure utilized anisotropic propagators capable of mimicking intricate physics, in order to model both the wave kinematics and amplitudes correctly, elastic propagators are required. These propagators are, for example, extremely important in global seismology since shear surface waves (which are ignored in acoustic models) are the most hazardous. In this section, we exploit the tensor algebra language introduced in [Devito] v4.0 to express an elastic model with compact and elegant notation. The isotropic elastic wave-equation, parametrized by the so-called Lamé parameters ``\lambda, \mu`` and the density ``\rho`` reads: ```math{#elas1} &\frac{1}{\rho}\frac{dv}{dt} = \nabla . \tau \\ &\frac{d \tau}{dt} = \lambda \mathrm{tr}(\nabla v) \mathbf{I} + \mu (\nabla v + (\nabla v)^T) ``` where ``v`` is a vector valued function with one component per cartesian direction: ```math{#partvel} v = \begin{bmatrix} v_x(t, x, y) \\ v_y(t, x, y)) \end{bmatrix} ``` and the stress ``\tau`` is a symmetric second-order tensor-valued function: ```math{#stress} \tau = \begin{bmatrix}\tau_{xx}(t, x, y) & \tau_{xy}(t, x, y)\\\tau_{xy}(t, x, y) & \tau_{yy}(t, x, y)\end{bmatrix}. ``` The discretization of Equation #elas1 and #stress requires five equations in two dimensions (two equations for the particle velocity and three for the stress) and nine equations in three dimensions (three for the particle velocity and six for the stress). However, the mathematical definition only require two coupled vector/tensor-valued equations for any number of dimensions. ### Tensor algebra language We have augmented the [Devito](https://github.com/devitocodes/devito) language with tensorial objects to enable straightforward and mathematically rigorous definitions of high-dimensional PDEs, such as the elastic wave-equation defined in #elas1\. This implementation was inspired by [@ufl], a functional language for finite element methods. The extended [Devito] language introduces two new types, `VectorFunction`/`VectorTimeFunction` for vectorial objects such as the particle velocity, and `TensorFunction`/`TensorTimeFunction` for second-order tensor objects (matrices) such as the stress. These new objects are constructed in the same manner as scalar `Function` objects. They also automatically implement staggered grid and staggered finite-differences with the possibility of half-node averaging. Each component of a tensorial object -- a (scalar) [Devito] `Function` -- is accessible via conventional vector notation (e.g. `v[0],t[0,1]`). With this extended language, the elastic wave-equation defined in #elas1 can be expressed in only four lines of code: ```python v = VectorTimeFunction(name='v', grid=model.grid, space_order=so, time_order=1) tau = TensorTimeFunction(name='t', grid=model.grid, space_order=so, time_order=1) u_v = Eq(v.forward, model.damp * (v + s/rho*div(tau))) u_t = Eq(tau.forward, model.damp * (tau + s * (l * diag(div(v.forward)) + mu * (grad(v.forward) + grad(v.forward).T)))) ``` The `SymPy` expressions created by these commands can be displayed with `sympy.pprint` as shown in Figure #PrettyElas\. This representation reflects perfectly the mathematics while maintaining computational portability and efficiency through the [Devito] compiler. The complete generated code for the elastic wave-equation with and without MPI is made available at [CodeSample](https://github.com/mloubout/SC20Paper/tree/master/codesamples) in `elastic-so12.c` and `elastic-so12-mpi.c`. #### Figure: {#PrettyElas} ![](./Figures/vel_symb.png){width=100%} : Update stencil for the particle velocity. The stencil for updating the stress component is left out for readability, as the equation does not fit onto a single page. However, it can be found in the [Devito] tutorial on elastic modelling on github. ### 2D example To demonstrate the efficacy of the elastic implementation outlined above we utilized a broadly recognized 2D synthetic model, the elastic Marmousi-ii[@versteeg927; @marmouelas] model. The wavefields are shown on Figure #ElasWf and its corresponding elastic shot records are displayed in Figure #ElasShot\. These two figures show that the wavefield is, as expected, purely acoustic in the water layer (``\tau_{xy}=0``) and transitions correctly at the ocean bottom to an elastic wavefield. We can also clearly see the shear wave-front in the subsurface (at a depth of approximately 1km). Figures #ElasWf and #ElasShot demonstrate that this high-level [Devito] implementation of the elastic wave-equation is effective and accurate. Importantly, in constructing this model within the [Devito] DSL framework, computational technicalities such as the staggered grid analysis are abstracted away. We note that the shot records displayed in Figure #ElasShot match the original data generated by the creator of this elastic model [@marmouelas]\. #### Figure: {#ElasWf} ![](./Figures/marmou_snap.png){width=100%} : Particle velocities and stress at time ``t=3\text{s}`` for a source at 10m depth and `x=5\text{km}` in the marmousi-ii model. #### Figure: {#ElasShot} ![](./Figures/pressure_marmou.png){width=30%} ![](./Figures/vz_marmou.png){width=30%} ![](./Figures/vx_marmou.png){width=30%} : Seismic shot record for 5sec of modelling. `a` is the pressure (trace of stress tensor) at the surface (5m depth), `b` is the vertical particle velocity and `c` is the horizontal particle velocity at the ocean bottom (450m depth). ### 3D proof of concept Finally, three dimensional elastic data was modelled in the Cloud to demonstrate the scalability of [Devito] to cluster-size problems. The model used in this experiment mimics a reference model in geophysics known as the SEAM model [@fehler2011seam], a three dimensional extreme-scale synthetic representation of a subsurface. The physical dimensions of the model are `45kmx35kmx15km` discretized with a grid spacing of `20mx20mx10m` leading to a computational grid of `2250x1750x1500` grid points (5.9 billion grid points). One of the main challenges of elastic modelling is the extreme memory cost owing to the number of wavefields (a minimum of 21 fields in a three dimensional propagator) that need to be stored: - Three particle velocities with two time steps (`v.forward` and `v`) - Six stress with two time steps (`tau.forward` and `tau`) - Three model parameters `lamda`, `mu` and `rho` These 21 fields, for the 5.9 billion point grid defined above, lead to a minimum memory requirement of 461Gb for modelling alone. For this experiment, access was obtained for small HPC VMs (on Azure) called `Standard_H16r`. These VMs contain 16 core Intel Xeon E5 2667 v3 chips, with no hyperthreading, and 32 nodes were used for a single source experiment (i.e. a single wave-equation was solved). We used a 12th order discretization in space that led to 2.8TFLOP/time-step being computed by this model and the elastic wave was propagated for 16 seconds (23000 time steps). Completion of this modelling run took 16 hours, converting to 1.1TFLOP/s. While these numbers may appear low, it should be noted that the elastic kernel is extremely memory bound, while the TTI kernel is nearly compute bound (see rooflines in [@louboutin2016ppf; @devito-api; @devito-compiler]) making it more computationally efficient, particularly in combination with MPI. Future work will involve working on InfiniBand enabled and true HPC VMs on Azure to achieve Cloud performance on par with that of state of the art HPC clusters. Extrapolating from the performance obtained in this experiment, and assuming a fairly standard setup of 5000 independent source experiments, computing an elastic synthetic dataset would require 322 EFLOPs (23k time-steps x 2.8TFLOP/time-step x 5000 sources), or utilizing the full scalabilit ## Performance comparison with other codes Earlier performance benchmarks mainly focused on roofline model analysis. In this study, for completeness, the runtime of [Devito] is therefore compared to that of the open source hand-coded propagator [fdelmodc](https://github.com/JanThorbecke/OpenSource.git). This propagator, described in [@thorbecke], is a state of the art elastic kernel (Equation #elas1) and the comparisons presented here were carried out in collaboration with its author. To ensure a fair comparison, we ensured that the physical and computational settings were identical. The settings were as follows: - 2001 by 1001 physical grid points. - 200 grid points of dampening layer (absorbing layer [@cerjan]) on all four sides (total of 2401x1401 computational grid points). - 10001 time steps. - Single point source, 2001 receivers. - Same compiler (gcc/icc) to compile [fdelmodc] and run Devito. - Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.8GHz. - Single socket, four physical cores, four physical threads, thread pinning to cores and hyperthreading off. The runtimes observed for this problem were essentially identical, showing less than a one percent of difference. Such similar runtimes were obtained with both the Intel and GNU compilers and the experiment was performed with both fourth and sixth order discretizations. Kernels were executed five times each and the runtimes observed were consistently very similar. This comparison illustrates the performance achieved with [Devito] is at least on par with hand-coded propagators. Considering we do not take advantage of the Devito compilers full capabilities in two dimensional cases, we are confident that the code generated will be at least on par with the hand-coded version for three dimensional problems and this comparison will be part of our future work. ## Conclusions Transitioning from academic toy problems, such as the two-dimensional acoustic wave-equation, to real-world applications can be challenging, particularly if this transition is carried out as an afterthought. Owing to the fundamental design principles of [Devito] such scaling, however, becomes trivial. In this work we demonstrated the high-level interface provided by [Devito] not only for simple scalar equations but also for coupled PDEs. This interface allows, in a simple, concise and consistent manner, the expression of all-kinds of non-trivial differential operators. Next, and most importantly, we demonstrated that the compiler enables large-scale modelling with state-of-the art computational performance and programming paradigm. The single-node performance is on par with state of the art hand-coded models, but packaged with this performance comes the flexibility of the symbolic interface and multi-node parallelism, which is integrated in the compiler and interface in a accessible way. Finally, we demonstrated that our abstractions provide the necessary portability to enable both on-premise and Cloud based HPC. ## Code availability The code to reproduce the different examples presented in this work is available online in the following repositories: - The complete code for TTI imaging is available at [Azure2019](https://github.com/slimgroup/Azure2019/tree/v1.0) and includes the TTI propagators, the Azure setup for shot parallelism and a documentation. - The elastic modelling can be run with the elastic example available in Devito at [elastic_example.py](https://github.com/devitocodes/devito/blob/v4.2/examples/seismic/elastic/elastic_example.py) and can be run with any size and spatial order. A standalone script to run the large 3D elastic modelling is also available at [codesamples](https://github.com/mloubout/SC20Paper/tree/master/codesamples) ## References [fdelmodc]:https://github.com/JanThorbecke/OpenSource.git [AzureTTI]:https://github.com/slimgroup/Azure2019 [batch-shipyard]:https://batch-shipyard.readthedocs.io [JUDI]:https://github.com/slimgroup/JUDI.jl [Devito]:https://github.com/devitocodes/devito