**FP7 IP projects**

- DEEP
*Dynamical Exascale Entry Platform* - MONT-BLANC
*European scalable and power efficient HPC platform based on low-power embedded technology* - CRESTA
)*Dynamical Exascale Entry Platform*

**Other FP7 rellevant projects**

- EESI & EESI2 (
), where roadmaps to exascale for large industries are drafted, putting stress on the engineering tasks, such as advanced physical modelling and increased resolution, multi-disciplinary analysis and design, multi-scale and multi-physic simulations, etc.*European Exascale Software Initiative* - TEXT (
), where a novel hybrid parallel programming model (MPI/StarSs) is proposed.*Towards EXaflop applicaTions*

Any computer simulation addressed in this project proceeds through the following stages: pre-processing of data, grid generation, solution (analysis/design) and post-processing of numerical results. In the next section we describe the state-of-the-art and the scientific challenges ahead for each of these stages in turn.

The widespread availability of parallel machines with large memory solvers and the desire to model in ever increasing detail geometrical and physical features has led to a steady increase in the number of points and elements used in field solvers. During the 1990s, grids in excess of 10 ^{7} elements became common for production runs in Computational Fluid Dynamics (CFD) [10, 11, 12] and Computational Electromagnetics (CEM) [13]. This tendency has continued last few years, roughly following Moore’s law, i.e. grid sizes have increased by an order of magnitude every 5 years. Presently, grids in of the order of 10 ^{9} elements are commonly used for cutting-edge applications in the aerospace, defence, automotive, naval, energy and electromagnetics sectors. However, for many exascale applications grids in excess of 10 ^{12} elements will be required (e.g. LES simulation of flow past a wing or a real fluid-soil-structure interaction problem in a tsunami simulation).

While many solvers have been ported to parallel machines, grid generators have, in general, lagged behind. For applications where remeshing is an integral part of simulations, e.g. problems with moving bodies [14, 15] or changing topologies [16], **the time required for mesh regeneration can easily consume a significant percentage of the total time required to solve the problem**. This percentage increases drastically if the grid generation portion is not well parallelized, considering that a non-trivial **dynamic load balance** has to be implemented. Faced with this situation, a number of efforts have been reported on parallel grid generation [17, 18, 19, 20].

The two most common ways of generating unstructured grids are the Advancing Front Technique (AFT) [21, 17] and the Generalized Delaunay Triangulation (GDT) [22, 19]. While coding and data structures may influence the scalar speed of the ‘core’ AFT or GDT, one often finds that for large-scale applications, the evaluation of the desired element size and shape in space, given by background grids [23] consumes the largest fraction of the total grid generation time. Furthermore, the time required for mesh improvements (and any unstructured grid generator needs them) is in many cases higher than the core AFT or GDT modules. Typical speeds for the complete generation of a mesh (surface, mesh, improvement) on current Intel Xeon chips with 3.2GHz and sufficient memory are of the order of 0.5-2.0 Melements/min. Therefore, it would take approximately 2,000 minutes (i.e. 1.5 days) to generate a mesh of (only) 10 ^{9} elements. Assuming perfect parallelization, this task could be performed in the order of a minute on 2,000 processors, clearly showing the need for parallel mesh generation. **To date, none of the parallel grid generators quoted above have proven scalable beyond 128 processors, and have been demonstrated on relatively simple, academic geometries**.

These difficulties extend to the generation of millions spherical or arbitrary geometry particles, typical in discrete element methods, widely used for simulations in geomechanics, structural, oil and gas engineering and many processes involving granular materials and/or particulate flows.

Scaling of mesh generators necessarily requires mesh splitting before multiple processors assignment and domain decomposition techniques [24]. Therefore, graph partitioned algorithms and software such as METIS and ParMETIS [25], a MPI-based parallel library, have to be used.

The state-of-the-art for individual field solvers of explicit and implicit type (either for solids, fluids, electromagnetics, etc.) is the most advanced as far as parallel processing is concerned. This should come as no surprise, as this is the portion in the simulation pipeline that typically consumes the most time, and copious amounts of research funds have been devoted to the development of parallel solvers. In the context of implicit codes, hybrid solvers (direct-iterative) combined with domain decomposition methods (DDM) [26] have been developed. In the early 90s, two important DDM, the Finite Element Tearing and Interconnecting (FETI) method and the Balancing Domain Decomposition (BDD) method were introduced and more recently a family of primal-dual DDM, namely the P-FETI methods, were proposed [27, 26]. Since their introduction, FETI, BDD, P-FETI and their variants have gained importance and today are considered as highly efficient DDM [28, 29]. Recently, an enhanced implementation of balancing Neumann-Neumann (BNN) preconditioning has been proposed contrasted with the balancing domain decomposition by constraints (BDDC) preconditioner [30].

For coupled multidisciplinary solvers, the optimum solution method is still to be found. Several methods have been investigated [31, 32, 33, 34] but a fully satisfactory answer is still pending. The monolithic approach, where all field equations are solved simultaneously, is regarded as the most suitable one for strongly coupled problems [35], while staggered solution techniques are more efficient for loosely coupled problems [34]. Solutions on parallel computers have also been investigated [36] by implementing methods primarily based on frontal and multi-frontal solution techniques with monolithic approaches, and domain decomposition parallel solvers with staggered solution approaches [34].

**What is surprising is how few of these implicit field solvers, even after despite many years of improvement, scale up to a modest number of cores (O(10 ^{2})**). The situation is particularly bad for Lagrangian solvers for fluid and solid mechanics applications, where the number of elements influencing each other vary in time. Examples are structural mechanics codes with contact and/or rupture [37, 38] and fluid mechanics code using Lagrangian particle techniques [7, 39]. For purely Eulerian finite difference-, volume- and/or element codes where the mesh (and work per element) is fixed during the run, reasonable scalings up to 30,000 cores have been reported [40, 41].

The project will develop a new generation of **scalable, parallel field solvers of explicit and implicit type** for solving large-scale problems in **fluid dynamics** applications, **structural mechanics** problems and **coupled** situations (including fluid-structure interaction, thermal-mechanical analysis and magneto-hydrodynamics problems). The goal is to develop multi-physic solver for 10 ^{9} elements, with parallel computing through a number of cores 10 times those of current achievements. The algorithmic design guideline that we will follow in the project development is to employ fixed meshes (even in dealing with moving bodies) to minimize rebalancing complexities. We also recognize that local calculations shall be favored over global ones so to ease the use of naturally parallel algorithms.

Given such requirements, our current orientation in addressing the project goal (with reference to the incompressible fluid solvers) is to employ a modified semi-explicit velocity-pressure-strain formulation (10 dofs per node). The use of such approaches has several advantages with respect to standard velocity-pressure techniques (4 dofs per node) in that it allows improving the accuracy for a given mesh size, and guarantees larger stable time steps. The use of larger time steps provides on its own a crucial advantage over alternatives based on mesh refinement since it automatically reduces the time to solution (as less steps are required) for a given level of accuracy. Furthermore the use of the proposed modified formulation leads to a slightly higher solution cost at the local level, but shall not require a sensibly higher communication cost thus resulting in a favorable tradeoff between local computations and global communication.

Such technique will be employed on the top of octree meshes and eventually exploit a template base implementation of the required topologies to maximize the computational efficiency.

Moving bodies, in the context of FSI interaction, will be considered by the use of embedded techniques by either modifying the local finite element space by suitably modified shape functions or by enriching it. In either case this are purely local modifications which shall once again affect the local cost but do not affect in any way the global communications.

Although a definitive conclusion is not yet reached on the most appropriate preconditioner for the pressure equation, our current orientation is to employ a DD approach accelerated by employing deflation techniques. In our idea such technique shall use multi-grid approaches at the node level. This approach shall thus allow exploiting shared memory parallelization as well as hybrid CPU-GPU techniques at the node level. Furthermore the inclusion of an artificial compressibility will be considered to speed up the calculations.

The design of engineering systems can be further enhanced with optimized designs satisfying the behavioural constraints while taking into account uncertainties. Over the past three decades a remarkable progress was made in the development of new and more efficient computational methods for automatic optimization, mainly on a deterministic and to a lesser extent on a probabilistic framework [42]. In the latter case, in addition to deterministic functional requirements, performance criteria are also taken into consideration in connection to various target reliability constraints. There are mainly two design formulations that account for probabilistic systems response: Reliability-based design optimization (RBO) and robust design optimization (RDO), while the combined formulation is denoted as reliability-based robust design optimization (RRDO) [43].

A number of metaheuristic algorithms for engineering optimization have been proposed, among them genetic algorithms (GA) [44]; evolutionary programming (EP) [45]; evolution strategies (ES) [46]; genetic programming (GP) [47]; simulated annealing (SA) [48]; particle swarm optimization (PSO) [49]; ant colony algorithm (ACO) [50]; artificial bee colony algorithm (ABC) [51]; harmony search (HS) [52]; cuckoo search algorithm [53]; firefly algorithm (FA) [54]; bat algorithm [55]; krill herd [56]. Several metaheuristic algorithms have been also proposed for treating structural multi-objective optimization problems [57]. The three most well-known algorithms are the nondomination sort genetic algorithm (NSGA) [58]; the strength Pareto evolutionary algorithm (SPEA) [59] and the multi-objective elitist covariance matrix adaptation (MOECMA) method [60]. A combination of ES with NSGA and SPEA algorithms has been recently proposed by the group at NTUA and has been proved robust and efficient [61].

**The search for optimized designs of large-scale systems considering the effect of uncertainty in the key system parameters requires a prohibitive computational effort**. For example, in the case of stochastic optimization problems simulated with a detailed finite element modelling, the number of algebraic equations required to be solved may be of the order of several millions. It is therefore accepted that **performing design optimization under dynamic loading, with nonlinear system response considering uncertainties, is a “grand challenge” problem** requiring an exascale computing environment in order to span both time and space scales, as well as to reduce the uncertainty in the prediction [62]. The only way to attempt the solution of these problems is by reducing at least ten orders of magnitude the required computational effort for stochastic optimization solvers.

A general purpose numerical stochastic optimization methodology will be formulated, capable of achieving optimum designs, given the corresponding deterministic problems addressed in WP4 and WP5. This objective aims at implementing an integrated computational strategy in order to master the challenges associated with high fidelity simulation models with the treatment of large-scale simulation problems with uncertainties which, in the context of stochastic optimization with possibly nonlinear and/or dynamic system response, constitute an extremely computationally intensive task. The goal is to handle probabilistic design problems based on deterministic problems of 100.000 dofs with tens of uncertain design variables and stochastic parameters.

Parallel, scalable post-processors (some of them open-source) have recently appeared [63, 64]. These **post-processors have been conceived with the assumption that one has a pipeline with very large capacity** to connect to the parallel machine.

This is seldom the case in production environments. Therefore, it would be preferable to reduce the data on the parallel machine, and only transmit to the (parallel) high-end desktop the reduced data for visualization.

In NUMEXAS we will develop a new class of **reduced order methods** that will **speed the visualization of huge set of numerical data** resulting from exascale computations. Expected features that the new post-processing tools should cover are reduction and simplification of results of a model on the order of 10 ^{10}, allow interactive visualization of results, and investigate large amount of data storage capabilities through dedicated formats (like HDF5 or VTK).

In the Tera/Petascale era, focus is oriented towards developing new tools based on MPI, OpenMP, Fortran, C, C++ for different 'traditional' CPU architectures. During the last years, new challenges have appeared because of the high energy consumption of infrastructure with massive use of CPU. For energy consumption, to stay under reasonable values, new architectures must make use of different processors, and heterogeneous architectures (CPU/GPU) seems to be the only solution. Related to this new hardware definition, new programming models are necessary (such as PGAS, OpenCL or CUDA, without excluding the old ones such as MPI, OpenMP).

At the same time, **new problems have appeared in the scene, such as load balance between CPU/GPU, inefficient data transfer between host and accelerator or extreme scalability**. It is necessary to measure and determine the reason behind an ill function of the system. There is a large list of tools that can be used for these tasks, such as libraries (SIONlib), interactive event trace analysis tools (Paraver, Vampire), MPI runtime error detection tools (MUST, Marmot), GPU support (TUD), binary analyze tools (MAQAO) and more generic performance tools (TAU, UNITE, PAPI, ScoreP, Scalasca). They will be all tested in NUMEXAS to determine which are the best suited for Exascale simulations.

Accurate and extensive benchmarking will be necessary during the whole duration of the project, **as every generated code must be tested to determine its maximum scalability and the performance when run in different hardware conformations**. The different results obtained in the machines of the two participating HPCs will be classified and compared using the most up-to-date tools for performing these tasks.

In NUMEXAS we will develop a collection of **performance tools aiming to characterize the efficiency and scalability of the software developed on novel heterogeneous supercomputing** architectures with a number of processors in the order of 10 ^{5} processors. Performance tools will be used to monitor improvement in weak/strong scalability of the new algorithms and overall performance of the new solvers. Performance bottlenecks will be identified on the lower hardware level as well on the level of programming models. Both types of information will be communicated back to the algorithm developers together with suggestions for improvement. Such detailed, systematic and large-scale optimization is novel to the field. In addition, energy consumption will be monitored and recommendations for economical operation of individual computations be worked out.

**
611636**