#jiva

2023-11-20

man, my #openebs jiva experience has been nightmare. it requires manual intervention like every day on my cluster, and the community is just a sea of ignored issues and slack messages. i don't know why the talos team recommends it.

am i the only one who feels this way?

#kubernetes #jiva #talos #sidero

Tero Keski-Valkamatero@rukii.net
2023-07-08

#RukiiNet had a 13 hour outage today because of some #OpenEBS #Jiva #NFS race condition thing again. Took me a long time to understand what exactly was the problem. Deleted some NFS pods which were stuck and they recreated themselves correctly after some rebootings.

Tero Keski-Valkamatero@rukii.net
2023-01-24

#RukiiNet #SelfHosting update:
Just after writing this #Curie went down again, and it didn't help that the #NFS pods were all on a different node. It all went down regardless.

Even got some data corruption again, it's always a huge manual hassle to bring everything back up. I read somewhere that #MicroK8S tends to be bad with hard reboots if some specific singleton cluster pods like coredns or calico or nfc controller or hostpath provisioner are on the node which goes down. I wonder if it's possible to just add replicas for those...

I found a new (old and known) bug with #OpenEBS, and a mitigation. In some cases, #Jiva has replicas in a readonly state for a moment as it syncs the replicas, and if the moon phase is correct, there's an apparent race condition where the iSCSI mounts become read-only, even though the underlying volume has already become read-write.

To fix this is to go to the node which mounted these, do "mount | grep ro,", and ABSOLUTELY UNDER NO CIRCUMSTANCE UNMOUNT (learned the hard way). Instead, I think it's possible to just remount these rw.

There's also an irritating thing where different pods run their apps with different UIDs, and the Dynamic NFS Provisioner StorageClass needs to be configured to mount the stuff with the same UID. I originally ran this by just setting chmod 0777, but the apps insist on creating files with a different permission set, so when their files get remounted, their permissions stay but the UID changes, and after a remount they don't have write access to the files anymore.

This compounds with the fact that each container runs on its own UID, so each needs its own special StorageClass for that UID... Gods.

I got the new #IntelNUC for the fourth node in the cluster to replace the unstable Curie node, but memories for it are coming Thursday.

Tero Keski-Valkamatero@rukii.net
2023-01-23

#RukiiNet #SelfHosting update:
After fighting with an unstable host due to memories I believe, and the whole cluster always going down when one node went down, I deep dove into what actually happens.

Turned out as I had installed #OpenEBS #Jiva for replicating volumes on my #MicroK8S using their official Helm charts, it didn't work at all. It just made all the replicas correctly, went through all the motions and then stored all the data on a single pod ephemeral store! I had to take the cluster down to investigate, that took a weekend more or less.

I found out that if OpenEBS Jiva is installed as a MicroK8S plug-in, pointing it specifically to their Git main, and not to a tagged release which doesn't work, then it works. I tried to find out the difference between the Helm chart this installs and the one I had installed, with no luck. I think I installed OpenEBS Jiva Helm chart before and that didn't work, while MicroK8S plug-in installs OpenEBS chart with Jiva enabled as a setting.

Anyhow, ordered a new #IntelNUC again, to reduce my maintenance actions due to one flaky node as well. But as I recreated basically the whole cluster with functioning OpenEBS now, and restored all the (daily) backups once again, it seems everything works and probably single node going down shouldn't take the whole Mastodon down anymore regardless.

During all this I have also filed a lot of issues to the relevant projects on GitHub and documented my findings there, so that people getting the same errors can find solutions.

Tero Keski-Valkamatero@rukii.net
2023-01-19

#RukiiNet #SelfHosting update:
I think there is an issue with #OpenEBS #Jiva replication on the #Kubernetes cluster.

It seems all the volume data goes to the Jiva controller pod for that PVC, and it stores all the data in /var/snap/microk8s/common/var/lib/kubelet/pods/PODID/volumes/kubernetes.io~csi/PVCID/mount.

That directory perhaps should be a mount to somewhere, but isn't. It just stores the plain files there on a single node.

The Jiva replicas, three per every volume claim, are set up correctly but the file data doesn't seem to go to those...

Tero Keski-Valkamatero@rukii.net
2023-01-03

@mlink, I actually described the scheme here in rough detail regarding #OpenEBS #Jiva and #DynamicNFSProvisioner:
costacoders.es/news/2022-12-24

The persistence volume replica stores are ultimately hostpaths.

Tero Keski-Valkamatero@rukii.net
2022-12-24

Oh wow, tested back up restoration process as a side-effect of battling the whole day with #OpenEBS, #Jiva, #iscsi and #NFS. I know too much about how these things work now. Need to write a blog article.
Now all data is three-fold replicated across the cluster and back up restoration process has been tested.
#SelfHosting #Kubernetes

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst