Lmst

#GPUSPH

I've been chasing for a few weeks now a bug where particles in a simulation “germinate” from nowhere, where “nowhere” actually means: two decies in a multi-GPU context suddenly start disagreeing about the velocity/position of a particle, so they each let it evolve its own way.

(Why are particles managed by multiple devices at the same time? Because of a thing called the “halo” in the domain parts that are at the boundary between two different devices.)

I've finally bitten the bullet, and started printing the values each devices see for the specific particle, and indeed something very weird happens: at one point, one of the devices starts to see zeroes instead of the last value they themselves wrote.

This is a relatively well-tested area of the #GPUSPH codebase. While I can't rule out a subtle bug, I'm starting to suspect that the issue may lie elsewhere. And if I'm hitting a hardware issue, I'm not going to be happy …

The number of people who don't know about #GPUSPH within #INGV is too damn high (.jpg).

Memes aside, I've had several opportunities these days to talk with people both within the Osservatorio Etneo and other branches of the Institute, and most of them had no idea something like that was being developed within INGV.

On the one hand, this is understandable, especially for teams that have never had a direct need to even look for #CFD code because of the focus of their research.

On the other hand, this also shows that I should have been much more aggressive with marketing the project internally. (And don't even get me started on who had the actual managerial power to do so before me, but that would put me on a rant that I'd rather avoid for now.)

I'm glad I've finally started working on this aspect, but also I can't say I'm too happy about having to do so.

Hopefully this is something that will help bring mass to it.

Our most recent paper on #SPH / #FEM coupling for offshore structures modeling with #GPUSPH has been published:

https://authors.elsevier.com/c/1m3VB_hNWk2tT

These kinds of works, with validation against experimental results, is always a challenging task, even for the simpler problems. Lab experiments and numerical simulations have each their own set of problems that need to be addressed, and the people working on the two sides of the fence often have a very different perspective on what should be considered trivial and not worth measuring, and what is instead crucial to the success of the experiment.

Getting these two sides to talk to each other successfully is no walk in the park, and I wish to extend my deepest gratitude to Vito Zago, who has gone to incredible lengths both during the “science making” to make things work out, and during the manuscript submission and review process, a nearly Sisyphean task in itself.

#SmoothedParticleHydrodynamics #FiniteElements #FiniteElementMethods

Today I introduced a much-needed feature to #GPUSPH.

Our code supports multi-GPU and even multi-node, so in general if you have a large simulation you'll want to distribute it over all your GPUs using our internal support for it.

However, in some cases, you need to run a battery of simulations and your problem size isn't large enough to justify the use of more than a couple of GPUs for each simulation.

In this case, rather than running the simulations in your set serially (one after the other) using all GPUs for each, you'll want to run them in parallel, potentially even each on a single GPUs.

The idea is to find the next avaialble (set of) GPU(s) and launch a simulation on them while there are still available sets, then wait until a “slot” frees up and start the new one(s) as slots get freed.

Until now, we've been doing this manually by partitioning the set of simulations to do and start them in different shells.

There is actually a very powerful tool to achieve this on the command, line, GNU Parallel. As with all powerful tools, however, this is somewhat cumbersome to configure to get the intended result. And after Doing It Right™ one must remember the invocation magic …

So today I found some time to write a wrapper around GNU Parallel that basically (1) enumerates the available GPUs and (2) appends the appropriate --device command-line option to the invocation of GPUSPH, based on the slot number.

#GPGPU #ParallelComputing #DistributedComputing #GNUParallel

I just realized that in my quest to port #GPUSPH to other POSIX-like OSes, I've never actually tried something like Alpine or other non-glibc Linux system.

Talking about dependencies: one thing we did *not* reimplement in #GPUSPH is rigid body motion. GPUSPH is intended to be code for #CFD, and while I do dream about making it a general-purpose code for #ContinuumMechanics, at the moment anything pertaining solids is “delegated”.

When a (solid) object is added to a test case in GPUSPH, it can be classified as either a “moving” or a “floating” object. The main difference is that a “moving” object is assumed to have a prescribed motion, which effectively means the user has to also define how the object moves, while a “floating” object is assumed to move according to the standard equations of motion, with the forces and torques exerted on the body by the fluid provided by GPUSPH.

For floating objects, we delegate the rigid body motion computation to the well-established simulation engine #ProjectChrono
https://projectchrono.org/

Chrono is a “soft dependency” of GPUSPH: you do not need it to build a generic test case, but you do need it if you want floating objects without having to write the entire rigid body solver yourself.

1/n

#SmoothedParticleHydrodynamics #SPH #ComputationalFluidDynamics

That first implementation didn't even support the multi-GPU and multi-node features of #GPUSPH (could only run on a single GPU), but it paved the way for the full version, that took advantage of the whole infrastructure of GPUSPH in multiple ways.

First of all, we didn't have to worry about how to encode the matrix and its sparseness, because we could compute the coefficients on the fly, and operate with the same neighbors list transversal logic that was used in the rest of the code; this allowed us to minimize memory use and increase code reuse.

Secondly, we gained control on the accuracy of intermediate operations, allowing us to use compensating sums wherever needed.

Thirdly, we could leverage the multi-GPU and multi-node capabilities already present in GPUSPH to distribute computations across all available devices.

And last but not least, we actually found ways to improve the classic #CG and #BiCGSTAB linear solving algorithms to achieve excellent accuracy and convergence even without preconditioners, while making the algorithms themselves more parallel-friendly:

https://doi.org/10.1016/j.jcp.2022.111413

4/n

#LinearAlgebra #NumericalAnalysis

I've just reviewed a manuscript about the recent progresses made to introduce #GPU support in a classic, large #CFD code with existing good support for massive simulations on traditional #HPC settings (CPU clusters).

I'm always fascinated by the stark difference between the kind of work that goes into this process, and what went into the *reverse* process that we followed for #GPUSPH, which was developed for GPU from the start, and was only ported to CPU recently, through the approach described in this paper I'm sure I've already mentioned here:

https://doi.org/10.1002/cpe.8313

When I get to review this kind of articles, I always feel the urge to start a dialogue with the authors about these differences, but that's not really my role as the reviewer, so I have to hold back and limit my comments to what's required for my role.

So I guess you get to read about the stuff I couldn't write in my reviewer comments.

1/n

By Tesler's law of conservation of complexity
https://en.wikipedia.org/wiki/Law_of_conservation_of_complexity
there's a lower bound to which you can reduce complexity. Beyond that, you're only moving complexity from one aspect to another.

In the case of #GPUSPH, this has materialized in the fact that the exponential complexity of variant support has been converted in what is largely a *linear* complexity of interaction functions. You can find an example in my #SPHERIC2019 presentation:
https://www.gpusph.org/presentations/spheric/2019/bilotta-spheric2019/#9.0

Those slides (if you want you can start at the beginning here <https://www.gpusph.org/presentations/spheric/2019/bilotta-spheric2019/>) also give you an idea of what happens to the code. And probably also give you a hint about what the issue is.

10/

I'm not going to claim that we found the perfect balance in #GPUSPH, but one thing I can say is that I often find myself thanking my past self for insisting on pushing for this or that abstraction over more ad hoc solutions, because it has made a lot of later development easier *and* more robust.

AFAIK, our software is the SPH implementation that supports the widest range of SPH formulations. Last time I tried counting how many variants were theoretically possible with all options combinations, it was somewhere in the whereabouts of 9 *billion* variants, taking into account the combinatorial explosion of numerical and physical modeling options —and even if not all of them are actually supported in the code, the actual number is still huge, and the main reason why we switched from trying to compile all of them and then let the user choose whatever they wanted at runtime to forcing the user to make some compile-time choices when defining a test case.

The second point, if you remember what I wrote on the first post of this thread <https://fediscience.org/@giuseppebilotta/114140164838336955> is about the handling of multiple formulations, more complex physics and so on.

This is actually a place where I'm more cautious to embrace the preference expressed in the original thread and the comments, particularly the concerns about abstraction.

#GPUSPH has been in development for over a decade now, and it has grown organically from something *very* hard coded to something with a lot of abstraction. Finding the right balance between the two is really not that simple. There are three issues at hand:

1. how much abstraction is actually needed?
2. what is the cost of introducing an abstraction when it *becomes* needed?
3. what is the cost of introducing the *wrong* abstraction?

Writing generic code without a specific scope is truly dispersive, and can lead to extreme(ly unnecessary) complexity.
OTOH, *not* writing generic code leads to a lot of repetition, which can make composability and bugfixing harder (did you fix that same bug in all occurrences of the code?)

Even now, Thrust as a dependency is one of the main reason why we have a #CUDA backend, a #HIP / #ROCm backend and a pure #CPU backend in #GPUSPH, but not a #SYCL or #OneAPI backend (which would allow us to extend hardware support to #Intel GPUs). <https://doi.org/10.1002/cpe.8313>

This is also one of the reason why we implemented our own #BLAS routines when we introduced the semi-implicit integrator. A side-effect of this choice is that it allowed us to develop the improved #BiCGSTAB that I've had the opportunity to mention before <https://doi.org/10.1016/j.jcp.2022.111413>. Sometimes I do wonder if it would be appropriate to “excorporate” it into its own library for general use, since it's something that would benefit others. OTOH, this one was developed specifically for GPUSPH and it's tightly integrated with the rest of it (including its support for multi-GPU), and refactoring to turn it into a library like cuBLAS is

a. too much effort
b. probably not worth it.

Again, following @eniko's original thread, it's really not that hard to roll your own, and probably less time consuming than trying to wrangle your way through an API that may or may not fit your needs.

I believe our approach to “develop what we need when we need it”, which has been a staple in the development of #GPUSPH, has been a strong point. We *do* depend on a few external dependencies, but most of the code has been developed “in-house”.

Fun fact: the only “hard dependency” for GPUSPH is NVIDIA's Thrust library, which we depend on for particle sorting (and in a few other places, but for optional features, like the segmented reductions used to collect body forces when doing #FSI i.e. #FluidStructureInteraction aka #FluidSolidInteraction).

And this hard dependency has been a *pain*.

We first had issues in the Maxwell architecture era, which stalled our work for months because all simulations would consistently lead to a hardware lock-up
https://github.com/NVIDIA/thrust/issues/742
and a few years later we had another issue —luckily one we could work around within GPUSPH this time:
https://github.com/NVIDIA/thrust/issues/1341

And of course if possible you want to let the user choose arbitrary inter-particle spacings, possibly at runtime. This is e.g. the reason why #GPUSPH has a built-in simple #CSG (#ConstructiveSolidGeometry) system: it's not needed for #SPH, but it allows user to set up test cases even with relatively complex geometries without resorting to external preprocessing stages that wouldn't give the same flexibility in term of resolution choices.

And here's the thing: the CSG in GPUSPH has a *lot* of room for improvement, but it's not really the “core” of GPUSPH: how much developer time should be dedicated to it, especially considering that the number of developers is small, and most have more competences in the mathematical and physical aspects of SPH rather than in CSG?

The end result is that things get implemented “as needed”, which brings me back to the gamedev thread I was mentioning earlier:

One of the objectives with our #GPUSPH model is actually to build a sufficiently detailed 3D model that would allow us to explore these effects and hopefully derive simplified laws that can be applied back to the faster models used for short- and long-term hazard assessment. We're still far from this objective, and it's always a struggle, but we're inching our way forward. And if we can't get there, we'll at least have opened up the path for the next generation willing to tackle the problem.

One thing's for sure: very few things were more misplaced than my childhood fear that there would be nothing to discover when I'd grow up.

14/14

Finally, #GPUSPH has a #Fediverse account: @gpusph

Hello all! This will be the official #GPUSPH account on the Fediverse going forward. What is GPUSPH, you ask? It's a software for #ComputationalFluidDynamics using the #SmoothedParticleHydrodynamics method, accelerated by running entirely* on GPU. In fact, it was the first to do so, leveraging the new GPGPU capabilities offered by NVIDIA CUDA.

(These days we have wider hardware support, but for a long time CUDA was all we supported.)

#introduction #newHere #CFD #HPC

*conditions apply

One of the nice things of the refactoring that I had to do to introduce CPU support is that it also allowed me to trivially had support for #AMD #HIP / #ROCm.
That, and the fact that AMD engineers have written a drop-in replacement for the Thrust library that we depend on in a couple of places. (This is also one of the things that is holding back a full #SYCL port for #GPUSPH, BTW.)

#GPUSPH also supports multi-node execution via #MPI (you can run across multiple GPUs on separate nodes).
This can be used with the new CPU backend too, but it's untested and I expect it to be not particularly efficient, because kernel execution and data transfers are not yet asynchronous in the CPU backend. (It's honestly a low priority task for us, and given that we're short on hands we'll prioritize other things for the time being.)

The result is something that can be trivially parallelized with #OpenMP.
As an alternative, it's possible to use the #multiGPU support in #GPUSPH to run code in parallel. This is obviously less efficient, although it may be a good idea to use it in #NUMA setups (one thread per NUMA node, OpenMP for the cores in the same node). This is not implemented yet.

#GPUSPH

Client Info