GPU-accelerated smoothed particle finite element method for large deformation analysis in geomechanics

Particle finite element method (PFEM) is an effective numerical tool for solving large-deformation problems in geomechanics. By incorporating the node integration technique with strain smoothing into the PFEM, we proposed the smoothed particle finite element method (SPFEM). This paper extends the SPFEM to three-dimensional cases and presents a SPFEM executed on graphics processing units (GPUs) to boost the computational efficiency. The detailed parallel computing strategy on GPU is introduced. New computation formulations related to the strain smoothing technique are proposed to save memory space in the GPU parallel computing. Several benchmark problems are solved to validate the proposed approach and to evaluate the GPU acceleration performance. Numerical examples show that with the new formulations not only the memory space can be saved but also the computational efficiency is improved. The computational cost is reduced by ∼ 70% for the double-precision GPU parallel computing with the new formulations. ∗Corresponding author. E-mail: wei.wu@boku.ac.at Preprint submitted to Computers and Geotechnics October 14, 2020 This is a preprint of the paper published in Computers and Geotechnics Accepted citation: Zhang, W., Zhong, Z.H., Peng, C., Yuan, W.H. and Wu, W., 2021. GPU-accelerated smoothed particle finite element method for large deformation analysis in geomechanics. Computers and Geotechnics, 129, p.103856. DOI: https://doi.org/10.1016/j.compgeo.2020.103856


Introduction
Many problems in geotechnical engineering involve large deformation, e.g., sampling, in-situ penetration testing, pile installation, landslides and debris flow, etc. Large-deformation numerical simulations can provide a crucial perspective for these problems. However, they often pose some difficulties for conventional numerical method, e.g. the finite element method (FEM).
The mesh distortion associated with large deformation may lead to reduction of accuracy or even termination of the numerical calculations.
Many numerical frameworks have been presented to solve large-deformation problems in geomechanics so far, such as Arbitrary Lagrangian-Eulerian (ALE) method [1], remeshing and interpolation technique with small strain (RITSS) method [2], Smoothed particle hydrodynamics (SPH) [3], material point method (MPM) [4], particle finite element method (PFEM) [5], etc. Among them, the PFEM attracts more and more attentions in recent years. It inherits the solid theoretical foundation of FEM which can guarantee the accuracy and convergence, whilst it possesses the flexibility of mesh-free methods for large deformation analysis. In the PFEM, the nodes in traditional FEM are treated as Lagrangian particles. The Delaunay triangulation technique is used to re-generate the mesh when it is too distorted due to large deformation. Therefore, the PFEM is actually an updated Lagrangian FEM approach with remeshing technique to overcome the mesh distortion problem. In the field of geomechanics, the PFEM has been developed and applied to solve many different types of problems, including the ground excavation process [6,7], granular flow [8][9][10], landslide and landslidegenerated wave [11][12][13], soil penetration problem [14,15], hydro-mechanical coupled problem [15][16][17], and progressive slope failure [18][19][20].
In the original PFEM, the numerical integration is carried out at Gasuss points while the field information is stored at nodes/particles, As a result, during the numerical computation, it is required to transfer information between Gasuss points and nodes/particles frequently, which inevitably introduces errors and increases complexity. In view of this, the authors proposed the smoothed particle finite element method (SPFEM) recently [21,22]. In the SPFEM, a strain smoothing technique for node integration is incorporated. Thanks to this node integration technique, all the field variables are computed and stored at nodes/particles, making the method more like a Lagrangian particle-based method. There are also some extra advantages, such as utilization of low order elements without volumetric locking and insensitivity to mesh distortion. The former can assure the computational efficiency while the latter is especially beneficial for large deformation analysis. Very recently, the SPFEM has been extended to solve the fluid dynamics problem [23] and fluid-structure interaction problem [24], which has been termed 'PFEM with Nodal Integration' in their studies [25].
In recent years, using graphic processing units (GPUs) to achieve highperformance computing has gained popularity rapidly. GPUs were initially designed for three-dimensional image rendering that is not complex but highly computationally intensive. In 2007, NVIDIA released the compute unified device architecture (CUDA) programming platform which is designed for general-purpose computing. Since then, GPUs have been successfully used in many scientific fields for high-performance computing (e.g. [26][27][28][29]).
GPUs, which feature much more cores, lower thread-scheduling cost and higher memory bandwidth than ordinary CPUs, are capable of executing computations with thousands of independent threads, operating in Single-Instruction-Multiple-Data (SIMD) mode. Here, SIMD is a terminology used to describe a class of parallel computers with multiple processing elements that perform the same operation on multiple data points simultaneously. In the computational process of PFEM/SPFEM, there are many procedures in which the same calculations need to be performed for all elements or nodes/particles. Meanwhile, data dependence is local and usually only extends to a few surrounding elements or nodes. Therefore, most calculations can be distributed into independent GPU threads and executed in parallel. Compared with the PFEM, the computation process of SPFEM is much simplified due to the adoption of node integration scheme. The computation process of the explicit SPFEM [22] is especially simple because the accumulation of global stiffness matrix in the implicit SPFEM is avoided. These simplifications can undoubtedly greatly facilitate GPU parallel computing.
However, to the best of the authors' knowledge, GPU parallel computing for PFEM has not been reported yet.
This study aims to develop a GPU-accelerated SPFEM for large deformation analysis in geomechanics based on CUDA, with focus on the explicit version of SPFEM [22]. The explicit SPFEM is first extended to threedimensional cases. Then the GPU parallel computing strategy is presented after a brief view on the basic theory of SPFEM. To reduce the memory consumption during the GPU parallel computing, which is crucial for increasing the scale of problems analysed, new computation formulations related to the strain smoothing technique are used. Several benchmark examples are solved with sequential CPU simulations and GPU-accelerated simulations respectively. The speedup of the presented GPU-accelerated SPFEM is carefully evaluated.

Strain smoothing technique for node integration
In PFEM, the problem domain is discretized into a set of Lagrangian particles, which carry mass, velocity and other state variables (e.g. stress, hardening parameters related to constitutive models). The FEM computation mesh is generated based on this set of Lagrangian particles using the Delaunay triangulation technique combined with the alpha shape algorithm.
Thus, nodes in traditional FEM correspond to particles in PFEM. The terms 'node' and 'particle' can be used interchangeably in PFEM.
In SPFEM, the strain smoothing technique for node integration is incorporated. The first step of this technique is to construct smoothing cells associated with particles. In two-dimensional cases, the smoothing cell as-sociated with particle k is created by connecting sequentially the mid-edge points to the centroid points of the surrounding triangular elements of the particle. This approach can be easily extended to three-dimensional cases, as shown in Fig. 1. Note that it is not required to explicitly calculate the specific geometries during the computation process, which will be shown in the following.
With this approach, the whole problem domain is discretized into finite strain smoothing cells associated with particles. Thus, all the integration over the whole problem domain can be performed by accumulating over the smoothing cells associated with particles. Furthermore, all the field variables can be calculated in these smoothing cells. In SPFEM, each particle represents a smoothing cell associated with it, which makes the SPFEM more like a particle-based method.
In the smoothed strain technique, the volume of the smoothing cell associated with node k is calculated by where E k is the number of tetrahedral elements related to node k and V i is the volume of the ith element. The smoothed strain matrix is calculated bỹ where B i is the strain gradient matrix used in the standard FEM for the ith element. It can be seen that the smoothed strain is actually the average of the strains of the elements related to node k weighted by volume. For more details about the strain smoothing technique, readers are suggested to refer to [31,32].

Computational formulations of SPFEM
Similar to the FEM, there are two numerical solution strategies of SPFEM with different time integration schemes: the implicit SPFEM [21] and the explicit SPFEM [22]. In this study, the explicit SPFEM is considered, which has simpler formulations and thus is more suitable for GPU parallel computing.
In the computation domain, the motion of a continuum can be described as where, ρ is the material density, a is the acceleration, σ is the Cauchy stress tensor and b is the specific body force density. Considering the principle of virtual displacement and the divergence theorem, the weak form is expressed where u is the displacement vector, Ω represents the configuration domain, S represents the boundary, and τ S is the prescribed traction. After the node-based discretization, the above equation reads where T is the total number of nodes in the computation domain.
in which where F ext and F int are termed as the external and internal forces, respectively, and M is the diagonal mass matrix.
A typical computational cycle of SPFEM is shown in Algorithm 1. More details are available in Ref. [21,22].

Adaptive critical time step size
In utilizing the explicit time integration scheme, the incremental time step size should satisfy the Courant stability restriction to ensure numerical stability. The incremental time step size is generally calculated by where ∆t cr is the critical time step size, l min is the characteristic length of the mesh, c is the sound speed in the material and α is a scale factor. As the characteristic length of the mesh varies during the computation, an adaptive time step size is realized.
The determination of time step sizes greatly influences the computation cost of a numerical method using explicit time integration. In the original PFEM, the determination of time step sizes is the same as the standard FEM.
For example, the characteristic length of the mesh in Eq. (10) can be taken to be the minimum radius of spheres inscribed in the tetrahedral elements [33].
As stated in [34], the Delaunay triangulation and the triangulation satisfying the max-min angle criterion are identical in two-dimensional cases, while in three-dimensional cases they are no longer identical in general and the Delaunay triangulation in three-dimensional cases does not seem to satisfy any optimal angle condition. Actually, many so-called silver elements may occur after Delaunay triangulation [35]. As a result, with the criterion of the minimum radius of spheres inscribed in the tetrahedrons, the time step size may reduce by several orders of magnitude during the computation [33].
In this study, as the node integration technique of strain smoothing is adopted, the integration domain for each node is the smoothing cell associated with this node. The smoothing cells generally have a more regular shape (See Fig. 1) and their volumes are generally much larger than those of the silver elements, because their volumes are the summation of the volumes of elements related to the nodes divided by four (See Eq. (1)). The characteristic length of a smoothing cell can be obtained in analogy to an element in the standard FEM. In this study, one half of the minimum distance between the node and its related nodes is used. This value is generally much larger than the value determined by the criterion of the minimum radius of spheres inscribed in the tetrahedrons, and thus the computational efficiency can be greatly increased as a result. This can be considered as a newly discovered advantage of incorporating the strain smoothing technique for node integration into an explicit PFEM.

Contact strategies in the explicit SPFEM
In the explicit SPFEM, as the solution strategy is similar to that of the explicit FEM, the contact algorithm in the explicit FEM can be easily incorporated. Specifically, the slave node to master segment contact scheme is utilized. A simple illustration is shown in Fig. 2. The frictional contact condition can be written as: where g n is the gap between node and segment, λ n is the normal contact force which is positive corresponding to compression, λ t is the tangential contact force, and µ f is the friction coefficient.
The penalty method is utilized to fulfil the frictional contact condition.
For the explicit time integration scheme, when a contact pair is detected, i.e. g n < 0, the contact force can be calculated as R cs = λ, f or slave node R cm,i = −N i λ, f or master nodes (12) where λ is the vector of contact force, i is the node number of contact segment and N i is the corresponding shape function of contact segment. For the vector of contact force λ, the normal part λ n is obtained as λ n = − g n while the tangential part λ t is obtained as λ t = min(− g t , µλ n ) when the Coulomb friction model is used. The above contact forces are added into F ext in Equation (6) and then the accelerations, velocities and positions of nodes can be updated correspondingly.
In case that the node to rigid wall contact is considered, the numerical implementation can be further simplified. When the penetration between node k and a rigid wall occurs, i.e. g n < 0, the acceleration and velocity of node k can be modified as follows to make the normal penetration g n to be zero, where n is the unit normal vector of the rigid wall. When the Coulomb friction model is used, the tangential contact force can be obtained as in which m k is the mass of node k, ∆t is the time increment, and the normal contact force λ n can be calculated by where v n old is the normal velocity of node k before modification.

Mesh rebuilding technique in the explicit SPFEM
An important part of SPFEM is the mesh rebuilding technique to avoid mesh distortion when large deformations occur in the continuum domain.
The mesh rebuilding technique in the explicit SPFEM is basically the same with that in the original PFEM [5], which is generally based on the Delaunay triangulation technique combined with alpha shape algorithm.
When the mesh rebuilding is required, the Delaunay triangulation technique, which is widely used in FEM mesh generation [36], is first performed to rebuild the element connectivity on the basis of a cloud of nodes. Another task of mesh rebuilding is the identification of computational domain. There is generally no unique solution to this problem. The solution originally proposed by Idelsohn [37] and subsequently adopted as a standard feature of the PFEM is to use the alpha-shape method previously developed by Edelsbrunner [38] for computer graphics applications. The basic principle of the alpha-shape method is that for a cloud of points with a characteristic spacing h and a predefined value of parameter α, all nodes on an empty sphere with a radius greater than αh are considered as boundary nodes.
When the alpha-shape method is combined with the Delaunay triangulation technique, a simple two-step algorithm for mesh rebuilding is available [39]: First, Delaunay triangulation is performed to generate the convex domain on the basis of a cloud of nodes, and thus a mesh consisting of tetrahedrons is obtained in three-dimensional case. Then, the radius of the circumsphere of each tetrahedron is examined and the tetrahedrons with a radius greater than αh are deleted, and hence the remain tetrahedrons correspond to the computational domain.
Note that in the SPFEM, thanks to the node integration technique of strain smoothing, more distorted mesh can be used without serious loss of accuracy, i.e. the requirement of mesh quality is not as strict as the original PFEM. Nevertheless, mesh smoothing technique is utilized here to improve the mesh quality. To be specific, the Laplacian smoothing technique [40,41] is applied to a few nodes when they are too close to their neighbours. Some other mesh smoothing technique can also be used (e.g. [33,42]), however, we choose the Laplacian smoothing technique due to its simplicity and high efficiency. Moreover, with the Laplacian smoothing technique, the critical time step size can be increased remarkably, which is useful for improving the computational efficiency.
It should be pointed out that in some PFEM simulations, new particles were introduced during the simulation to further improve the mesh quality (e.g. [14,43,44]). However, for the present approach, it seems that introducing new particles is not necessary and thus no new particle is introduced during the simulation for simplicity.

GPU acceleration scheme
The GPU parallel computing is implemented based on the Compute Unified Device Architecture (CUDA), a toolkit developed by NVIDIA to perform parallel computing using their graphics cards. The parallelisation scheme is shown in Fig. 3. The whole computation consists of a CPU part and a GPU part. The CPU part works as a process controller, which is used for loading and saving data, initializing GPU data, and controlling the GPU computing kernels to finish designated computation tasks, while the GPU part executes the actual computing.
Ideally, all the steps are expected to be parallelised on GPU to maximise the speedup of GPU parallel computing. However, in the SPFEM, the Delaunay triangulation technique is required to re-generate the computation mesh. In this study, a state-of-the-art library for Delaunay triangulation, CGAL [45], is adopted to obtain a robust Delaunay triangulation. So  and not easy to be extended as a CPU. Meanwhile, reducing memory consumption is also meaningful for improving efficiency. As the global memory used to store the smoothed strain matrices is a high-latency memory, accesses to the global memory are time-consuming and should be reduced as much as possible to improve the efficiency.
Let's start from the original formulation. The smoothed strain of a node k is calculated byε where u k is the vector assembled by the displacements of all the nodes in the smoothing cell associated with node k. By substituting Eq. (2) into Eq. (16), the smoothed strain of the node k is calculated as When calculating the smoothed strains of nodes with the original formulation (Eq. (17)), parallelization over nodes is performed. In the thread of a specific node k, looping over the elements related to this node is performed.
Noting the equivalence between the vector assembled by the displacements of the nodes related to node k and that assembled by the displace-ments of the elements related to node k, the above equation can be re-written where E k is the number of elements related to node k, V j is the volume of element j, B j is the strain matrix of element j in standard FEM, and u j is the vector assembled by the displacements of element j. Please note the difference between the original formulation (Eq. (17)) and the new formulation (Eq. (18)). With the new formulation (Eq. (18)), parallelization over elements can be easily performed to obtain the smoothed strains of all the nodes.
The new formulation for the calculation of internal forces related to the strain smoothing technique is a bit more complicated. In the original formulation of SPFEM, the internal forces applied on its neighbour nodes for a node k can be calculated as, where N k is the number of the neighbour nodes of node k. By substituting where E k is the number of elements related to node k. Since the two summations in Eq. (20) are both carried out on the elements related to node k, Eq. (20) can be re-written as in which B j m is the strain matrix related to node m of element j. Let us make a comparison between the original formulations (Eqs. (17) and (19)) and the new ones (Eqs. (18) and (21)) from the perspective of GPU memory consumption. In three-dimensional cases, the strain matrix of node k related to a specific node consists of 18 floating-point numbers.
Hence, using the original formulation, a total of (N k + 1) × 18 floating-point numbers should be stored to obtain the smoothed strain matrix of a particular node k. As the number of the neighbour nodes N k of a specific node is undetermined, it is sensible to allocate a memory of equal size which is slightly larger than needed. As stated above, the possible maximum value of N k in three-dimensional cases is much larger than that in two-dimensional cases. In this study, N k is taken to be 64 to guarantee robustness. Bearing in mind that the smoothed strains and internal forces of all the nodes are calculated in a parallelised way. As a result, M × (N k + 1) × 18 floating-point numbers should be saved during the computation, where M is the concurrent thread number in GPU, which generally ranges from several thousands to tens of thousands for a common GPU. One may save memory by only saving the derivatives of shape functions instead of the strain matrix. This approach will surely reduce the memory requirement, but it is not an essential improvement. However, with the new formulations, if we calculate and store the strain matrices of all the elements before, which is sensible to avoid unnecessarily repeated computation, only the strain matrix variable With the original formulation, parallelization over nodes is performed. In the thread of a specific node k, looping over the elements related to node k is required. Moreover, the identification of the sequential number of node k in its surrounding elements is also required. On the contrary, with the new formulation, parallelization over elements is performed with a rather simple procedure for each thread.

Parallelisation for getting the indices of elements and nodes related to nodes
Another crucial step in the SPFEM is to get the indices of elements and nodes related to all the nodes, although this step is not so time-consuming in the entire simulation. This step should be performed in Step 2 in Algorithm 1, before the calculation of smoothed strains. A naive method is to loop over all the elements for each node to determine if the element is related to this node. However, this method is both time-consuming and memoryconsuming. A more efficient algorithm is used instead in this study.
As shown in Fig. 4, a kernel function in GPU is executed to perform the same calculation for all the elements with parallelised GPU threads. For the element i, its index i is recorded respectively to the four nodes of the element.
A key step is to accumulate the serial number of current element related to each node. The so-called atomic operations are used again to do this. As the sequence of the related elements for each node does not matter in the SPFEM, desired results can be obtained by the parallelisation over elements with GPU threads.
It is more straightforward to get the indices of nodes related to nodes once the indices of elements related to nodes are obtained. Parallelisation over nodes with GPU threads can be performed without the so-called race condition. For each node, all the nodes of the elements related to this node are picked out and the unique indices of nodes are records as the indices of nodes related to nodes.

Numerical examples
To validate the proposed approach and to evaluate the speedup of the GPU parallel computing, three benchmark examples are numerically solved.
The first example is the vibration of an elastic continuum bar. This problem has an analytical solution and is suitable for validating the proposed approach, especially for investigating the new formulations related to the strain smoothing technique in terms of correctness and efficiency. In the second example, the laboratory experiment of soil collapse is simulated. The experiment was carried out by Bui [3] to validate their SPH approach for geo-materials. The third example is the progressive failure of a long strainsoftening soil slope. Since the geometry of the problem is complicated during the whole simulation, and a strain softening constitutive model is used, it is suitable to validate the application capability of the proposed approach on an actual engineering problem. and other specific behaviour of soil (e.g. [47][48][49][50][51][52][53]). Furthermore, the results of the first example show that the efficiency of single-precision simulation is only slightly higher than that of double-precision simulation. This problem may be solved using separation of variables [54]. The bar is found to oscillate in modes that are dependent on the initial conditions. A specific mode can be excited by using initial conditions which are multiples of this natural mode of the system. Here, the first mode is considered, and thus the analytical solution of this problem is given by [54,55]:

Axial vibration of an elastic continuum bar
where v(x, t) is the velocity at time t and u(x, t) is the displacement at time  can not only save memory occupation but also greatly improve computational efficiency. Therefore, the new formulations are adopted throughout all the following simulations due to their superiority.
The computational costs for different meshes with different simulation measures are listed in Table 1. The double-precision CPU simulation with the largest number of nodes is taken as a reference to evaluate the distribution of the workload across the steps of the present approach. As shown in Fig. 8, for this example, most of the computational effort is allocated to Step 3 and Step 5 (please see Algorithm 1), which respectively account for 29.89% and 64.87% of the total workload. The computational effort of Step 1 and Step 2 is minimal because no re-meshing is performed. The computational effort of Step 4 is low because a simple elastic constitutive model is used.
The speedup ratios between the GPU simulation and CPU simulation are plotted in Fig. 9. The overall speedup ranges from 7.82 to 10.89 for singleprecision computation and ranges from 8.40 to 12.30 for double-precision computation. It is clear that with the GPU-accelerated approach, a speedup around 10 can be achieved. Moreover, the speedup with finer mesh is generally slightly higher than that with coarser mesh. As shown in Fig 9, the performance of GPU simulation varies for different steps in Algorithm 1. The speedup of Step 4 is the largest because all the threads can be executed without mutual effect. For Step 3 and Step 5, the speedup is smaller. However, the speedup performance is still rather obvious. The speedup for Step 6 is the lowest, which may be due to the algorithm complexity related to boundary condition application and the relatively low computation intensity. Note that since Step 1 is executed on the CPU, no speedup is recorded. Moreover, Step 2 only takes a minimal time even for the finest mesh, which increases the randomness of the value of speedup, so the speedup of Step 2 is also not plotted in Fig 9.

Laboratory experiment of soil collapse
Bui et al. [3] conducted a simple laboratory experiment of soil collapse to validate their SPH approach for large-deformation analysis of geomaterials. Besides the SPH simulation [3], the experiment was also simulated by MPM [56] and two-dimensional SPFEM [22], so it can serve as a good benchmark. In the laboratory experiment, the soil continuum was modelled by a large number of small aluminium bars, which is a common approach to Before the simulation of soil collapse, the initial stress due to gravity is generated by a dynamic relaxation simulation, in which the bottom boundary is fully fixed and the four side boundaries are constrained at their normal directions. Then the soil collapse is triggered by removing the restraint of the right boundary. A Coulomb contact model between soil and ground is used, in which no tangential relative motion is assumed. A total time of 1.0 s is simulated, after which a stable slope is formed. Adaptive time step size is utilized during the simulation, and the scale factor α in Eq. (10) is taken to be 0.9.
The final surface configurations after collapse obtained with different mesh densities are compared with the experimental result, as shown in Fig. 11(a).
All numerical results show a similar response, and they all match well with the experimental results. With the increase of the number of nodes, the runout distance increases slightly. The failure line, which is defined as the line between motionless soil and the moved soil, is also compared against the experimental result and good agreement can also be found. As shown in Fig. 11(b), the numerical results are compared with the results obtained by other numerical methods (i.e. SPH [3] and MPM [56]), as well as twodimensional SPFEM [22]. All the numerical methods can well capture the experimental result. The run-out distance obtained by this study is slightly shorter than the two-dimensional SPFEM [22]. The reason may be that the nodes in the three-dimensional case interact with more nodes in comparison to two-dimensional cases. The final mesh configuration with the coarsest mesh is chosen to make a visual comparison with the experimental result, as shown in Fig. 12. It is clear that the three-dimensional SPFEM is capable of reproducing the experimental result.
The computational costs of double-precision GPU simulation and sequential CPU simulation are listed in Table 2. The latter with the largest number of nodes is taken as a reference to evaluate the distribution of the workload across the steps of the present approach. As shown in Fig. 13, the timeconsuming steps are Step 2, Step 3 and Step 5, which respectively account for 19.33%, 20.84% and 36.66% of the total workload, while the computational effort of Step 6 is minimal. The above workload distribution may be considered as a typical case for geotechnical applications.
As shown in Fig. 14 Fig. 16, the cohesion equals to its peak value (c p ) at the beginning and decreases proportionally to the equivalent plastic strain invariant defined by ε = 2/3e p : e p . where e p is the deviatoric part of the plastic strain tensor.
When the invariant achieve a certain value (ε pr ), the cohesion reduces to its residual value (c r ). This example, feature by the complicated geometry during the whole simulation process and the strain-softening constitutive model, has been well simulated by two-dimensional MPM [57], PFEM [18] and SPFEM [13].
For the long strain-softening soil slope, the failure mode, run-out distance and retrogression distance are greatly influenced by the degree of the strength reduction (i.e. c r /c p ). As stated in Ref. [59], lower residual strength corresponds to more times of retrogression failure, longer run-out distance and longer retrogression distance. In this study, a relatively severe strength reduction (i.e. 0.2) is considered to generate multiple retrogression failure.
It should be pointed out that the utilization of the strain-softening model may result in the solution becoming mesh dependent. This stems from the fact that the related boundary value problem is no longer elliptic in statics or hyperbolic in dynamics [58]. Special regularisation techniques, such as nonlocal theory or gradient plasticity, have to be incorporated to overcome this issue. As this study focuses on the GPU acceleration of SPFEM, the regularisation is not involved.  Fig. 15, the bottom boundary is fully fixed and the left side boundary is constrained at x-direction. All the motions along y-direction are constrained to mimic a plane strain response. The soil is first assumed to be elastic and the initial stress due to gravity is generated by a dynamic relaxation simulation. After the initial stress is generated, the soil plasticity is considered. Since the slope can not maintain stability with these material parameters, the landslide is triggered. A total of 10 seconds are simulated to reproduce the whole progressive failure process. Adaptive time step size is utilized and the scale factor α in Eq. (10) is taken to be 0.9.
The simulation is completed in around 5.66 hours. The initial vertical stress is shown in Fig. 17(a). The development process of the progressive failure is shown in Fig. 17(b)-(g), with coloured contours representing the equivalent plastic strain invariantε. Many shear bands occur during the whole simulation process due to the adoption of the strain-softening soil constitutive model. As shown in Fig. 17(b), the first major shear band (MS1) initiate from the bottom and propagate towards the top surface. During the sliding, two crossed shear bands (S1, S2) occur in the middle of the landslide body. One (S1) propagates towards the top surface and the other (S2) propagates towards the front inclined surface, resulting in a graben (Fig. 17(c)).
With the advancement of the landslide body, a new major shear band (MS2) initiate from the bottom and propagate towards the top surface due to lack of support, leading to the second retrogressive failure (Fig. 17(d)). And then, a new shear band (S3) forms in the middle of the second landslide body, which initiates from the bottom and propagate towards the front inclined surface (Fig. 17(e)). Then a new shear band (S4) initiates from the bottom of the landslide body and propagates towards the top surface (Fig. 17(f)).
The retrogressive failure finally stops with the help of the friction between the landslide bodies and the ground. Fig. 17(g) shows the final configuration of the slope. Eventually, the retrogressive failure results in a deposit with a run-out distance of 12.0 m and a retrogression distance of 13.6 m ( Fig. 17(g)).
A crucial factor affecting the run-out distance and the retrogression distance is the friction coefficient between soil and ground. To investigate this, a new simulation is conducted with a friction coefficient of 0.1. The comparison of the final configuration of the slope with different friction coefficients is shown in Fig. 18. A smaller friction coefficient corresponds to larger run-out distance and retrogression distance. The final run-out distance is 21.0 m while the retrogression distance is 18.4 m. Taken together, the GPU-accelerated SPFEM is capable of solving large-scale complicated large-deformation problem in geomechanics.

Conclusions
PFEM is an attractive numerical framework for large deformation analysis in geomechanics. By incorporating the strain smoothing technique for node integration, the authors proposed an improved PFEM, i.e. SPFEM.
The primary advantages of SPFEM is that all the field viables are computed and stored at nodes/particles. Some additional advantages include no volumetric locking when low order elements are used, and insenstivity to element distortion. The objectives of the present work are twofold. First, the explicit SPFEM is extended to three dimensional cases. Then, a GPU accelerated SPFEM is developed.
A GPU parallelisation strategy for the three-dimensional explicit SPFEM is presented in detail. All the steps are executed on GPU except that the Delaunay triangulation is executed on CPU with a state-of-the-art library. Since  Step 1 Step 2 Step 3 Step 4 Step 5 Step 6    Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 0.22% 0.07% 29.89% 2.95% 1.99%