I've been chasing for a few weeks now a bug where particles in a simulation “germinate” from nowhere, where “nowhere” actually means: two decies in a multi-GPU context suddenly start disagreeing about the velocity/position of a particle, so they each let it evolve its own way.
(Why are particles managed by multiple devices at the same time? Because of a thing called the “halo” in the domain parts that are at the boundary between two different devices.)
I've finally bitten the bullet, and started printing the values each devices see for the specific particle, and indeed something very weird happens: at one point, one of the devices starts to see zeroes instead of the last value they themselves wrote.
This is a relatively well-tested area of the #GPUSPH codebase. While I can't rule out a subtle bug, I'm starting to suspect that the issue may lie elsewhere. And if I'm hitting a hardware issue, I'm not going to be happy …