Lmst

#HDF5 jest super. W skrócie:

1. Oryginalnie, projekt używał systemu budowania autotools. Instalował binarkę h5cc, która — obok bycia nakładką na kompilator — miała dodatkowe opcje do uzyskiwania informacji o instalacji HDF5.
2. Później dodano alternatywny system budowania #CMake. W ramach tego systemu budowania instalowana jest uproszczona binarka h5cc, bez tych dodatkowych funkcji.
3. Każdy, kto próbował budować przez CMake, szybko odkrywał, że ta nowa binarka psuje większość paczek używających HDF5, więc wracano do autotools i zgłoszono problem do HDF5.
4. Autorzy zamknęli zgłoszenie, stwierdzając (tłum. moje): "Zmiany w h5cc przy użyciu CMake zostały udokumentowane w Release.txt, kiedy ich dokonano - kopia archiwalna powinna być dostępna w plikach z historią."
5. Autorzy ogłosili zamiar usunięcia wsparcia autotools.

Co stawia nas w następującej sytuacji:

1. Praktycznie wszyscy (przynajmniej #Arch, #Conda-forge, #Debian, #Fedora, #Gentoo) używa autotools, bo budowanie przy pomocy CMake psuje zbyt wiele.
2. Oryginalnie uznano to za problem w HDF5, więc nie zgłaszano problemu innym paczkom. Podejrzewam, że wiele dystrybucji nawet nie wie, że HDF5 odrzuciło zgłoszenie.
3. Paczki nadal są "zepsute", i zgaduję, że ich autorzy nawet nie wiedzą o problemie, bo — cóż, jak wspominałem — praktycznie wszystkie dystrybucje nadal używają autotools, a przy testowaniu budowania CMake nikt nie zgłaszał problemów do innych paczek.
4. Nawet nie mam pewności, czy ten problem da się "dobrze" naprawić. Nie znam tej paczki, ale wygląda to, jakby funkcjonalność usunięto bez alternatywy, i tym samym ludzie mogą co najwyżej samemu zacząć używać CMake (wzdych) — tym samym oczywiście psując swoje paczki na wszystkich dystrybucjach, które budują HDF5 przez autotools, o ile nie dodadzą dodatkowo kodu dla wsparcia tego drugiego wariantu.
5. Wszystko wskazuje na to, że HDF5 jest biblioteką, której autorów nie obchodzą ich własni użytkownicy.

https://github.com/HDFGroup/hdf5/issues/1814

#HDF5 is doing great. So basically:

1. Originally, upstream used autotools. The build system installed a h5cc wrapper which — besides being a compiler wrapper — had a few config-tool style options.
2. Then, upstream added #CMake build system as an alternative. It installed a different h5cc wrapper that did not have the config-tool style options anymore.
3. Downstreams that tried CMake quickly discovered that the new wrapper broke a lot of packages, so they reverted to autotools and reported a bug.
4. Upstream closed the bug, handwaving it as "CMake h5cc changes have been noted in the Release.txt at the time of change - archived copy should exist in the history files."
5. Upstream announced the plans to remove autotools support.

So, to summarize the current situation:

1. Pretty much everyone (at least #Arch, #Conda-forge, #Debian, #Fedora, #Gentoo) is building using autotools, because CMake builds cause too much breakage.
2. Downstreams originally judged this to be a HDF5 issue, so they didn't report bugs to affected packages. Not sure if they're even aware that HDF5 upstream rejected the report.
3. All packages remain "broken", and I'm guessing their authors may not even be aware of the problem, because, well, as I pointed out, everyone is still using autotools, and nobody reported the issues during initial CMake testing.
4. I'm not even sure if there is a good "fix" here. I honestly don't know the package, but it really sounds like the config-tool was removed with no replacement, so the only way forward might be for people to switch over to CMake (sigh) — which would of course break the packages almost everywhere, unless people also add fallbacks for compatibility with autotools builds.
5. The upstream's attitude suggests that HDF5 is pretty much a project unto itself, and doesn't care about its actual users.

https://github.com/HDFGroup/hdf5/issues/1814

Three Ways of Storing and Accessing Lots of Images in #Python
https://realpython.com/storing-images-in-python/

Using plain files, #LMDB, and #HDF5. It's too bad there's an explicit serialization step for the LMDB case. In C we'd just splat the memory in and out of the DB as-is, with no ser/deser overhead.

Also they use two separate tables for image and metadata in HDF5, but only one table in LMDB (with metadata concat'd to image). I don't see why they didn't just use two tables there as well.

I could have sworn #HDF was only up to version 5.something, but maybe I lost track.

#HDF5

Photo of NYS license plate with tag "HDF 7".

#HDF5 and #DASK are both supported in #skflow #TensorFlow! See examples in https://goo.gl/MSH3dr

New software descriptor published on ing.grid! "h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5" by Matthias Probst and Balazs Pritz https://www.inggrid.org/article/id/4028/ #RDM #HDF5 #python #FAIR #kit

NeXus ist ein in unseren ommunities weitverbreitetes #Datenformat. Es bietet einen Standard, welche Parameter gespeichert und wie sie im #HDF5-Datei strukturiert werden sollen, um #Metadaten zusammen mit den Daten in einer hochstrukturierten Weise zu integrieren.

Darauf aufbauend gibt es eine maschinenlesbare #Ontologie; diese definiert eindeutige Bezeichner und schafft ein kontrolliertes Vokabular für die Namen aller experimentellen Parameter und gemessenen Variablen im Experiment.

2/3

Gestern haben Rolf und Heike beim #NFDI-Netzwerktreffen Berlin-Brandenburg unser Daphne-Konsortium vorgestellt. Speziell sind sie auf die Terminologien und Datenformate eingegangen, welche für #Synchrotron- und #Neutronen-Experimente genutzt werden.

1/3

Die Folien gibt's bei Zonodo:
https://zenodo.org/records/12728050

#RDM #Ontologien #NeXus #HDF5

📢 New 0.2 release of Caterva2, our on-demand framework for sharing Blosc2 (and now #HDF5 too!) data! 🎉
Accessing Blosc2 remotely is becoming easier than ever.
More info:
https://github.com/Blosc/Caterva2/releases

STOP DOING HDF5

- Years of HDF5 yet no real-world use-case found
- Files were never meant to be hierarchical or "self explaining"
- "Hello I would like a file system within a file please" ...statements dreamt up by the utterly deranged
- Look at what the HDF5 group has been demanding our respect for all this time
- wanted to have an explanation for the stuff in a folder? We had a tool for that: it was called a Readme

They have played us for absolute fools

#HDF5

#HDF5 is still a popular file format in scientific computing, but #LMDB is always superior, especially in machine learning. https://www.reddit.com/r/MachineLearning/comments/1ad6j2i/d_how_to_make_my_training_faster/

New preprint available: h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5.

"This paper presents an open-source package, called h5RDMtoolbox, written in Python helping to quickly implement and maintain FAIR research data management along the entire data lifecycle using HDF5 as the core file format"

https://preprints.inggrid.org/repository/view/23/

#datamanagement #metadata #HDF5 #datalifecycle #python

Those of you who have been around this blog a while, know that we have also been dabbling in laboratory automation (see, for example, this post).

Over the last year or so, led by my former colleague Dr. Glen Smales, we used RoWaN to synthesise over 1200 Zeolitic imidazolate frameworks-8 (ZIF-8) metal-organic framework samples spread over 20 series, while systematically varying many synthesis parameters. Changing seemingly innocuous aspects appears to have a strong impact on the final particle size (distribution), so a correct logging of the metadata is essential for correlating synthesis parameters with morphology.

We have presented this work several times already in various presentations (one of which should go online soon) and large manuscripts are in the works with the full story. Given the scope of this investigation, preparing these manuscripts eats up a lot of time. Today, I would like to focus on how we structured the synthesis metadata that was collected for every sample.

Since small synthesis details might correlate with the resultant particle morphology, we went overboard with the metadata collection. Everything we could think of and get access to, was recorded and integrated in the final structure. This includes such things as bottle opening dates and the humidity in the laboratory. This “everything and the kitchen sink”-approach has saved our bacon a few times in the past when debugging MOUSE measurement issues, and will probably not be out of place here either.

Data Sources

The data sources here were varied, we take what we can get, how we can get it: some data comes from the log of RoWaN itself, which can store such things as EPICS messages as well as step messages (e.g. “Start injection of solution 1”, and “Stop injection of solution 1”) in a long CSV file. Some manual steps were recorded manually using predefined messages in predefined cells in the Jupyter notebook. Some instrument readings (density meter reading for example for the stock solutions) is read manually and added to the log by hand.

Figure 0: The DACHS project logo, because every project needs a logo! Image courtesy of Glen Smales.

Data on the starting compounds / chemicals and equipment comes from a master Excel file that contains this information in a structured manner across several sheets. We use Excel because it is a pragmatic user-entry system that does not require any training and whose output can be read with Python. It has severe restrictions, though, that would preclude its use in larger projects and an easily-configurable web form solution is still sought (hook me up when you know of one that can take external yaml or json files specifying configuration of pages, parameters, data types and limits).

Translation, translation, translation

During my last presentation at FAIRmat, there was a good set of questions from the audience about if we can (or should) dictate what, and in what format this metadata should be recorded. At the moment, I would shy away from such efforts. On the one hand, it is impossible to predict what you will need in the end, so I would advise to record everything. On the other hand, any additional workload on an already overloaded workforce is bound to be hated, ignored or forgotten, so chances are that you will not get it as you intended anyway.

Better to encourage recording whatever low-hanging fruit are reachable as automated as possible. You can hook up your heating/stirring plates, syringe injectors, thermometers etcetera to a laboratory network and, with a little effort, continuously read out what they do in whatever format they do it in. Make it as simple as possible to record this, lower the threshold for such documentation, and only require manual documentation if it can somehow absolutely not be avoided. You will get a collection of files and will need to translate that into useful information probably together with some custom knowledge and lots of time stamps. This, in my eyes is unavoidable, although the eventual introduction of a message broker would help greatly. As it is now, we spend a significant portion of our code translating between what we get and what we need, adding some domain knowledge into the code.

We made a piece of software called DACHS that does this. Most of this code defines generic data classes for the various objects and information entries that we want. Values and units, where possible, get translated into unit-aware quantities with Pint, which makes it extremely easy to do unit-aware calculations with, say, moles, grams and millilitres and combinations thereof. The value of this ability cannot be overstated.

One non-generic segment of the code, however, is focused on applying those models to our specific data sources, and to subsequently structure these into a specific tree for this experiment. So what does this tree look like?

The base tree

We decided to split our base tree into three sections: “Equipment”, “Chemicals”, and “Synthesis”, which contain what they say on the tin (Figure 1).

Figure 1: the base tree of the synthesis log HDF5 structure (starting on the right, going deeper to the left)

The Equipment section contains a list of the equipment used for that particular experiment (this can vary between experiment series). Equipment includes such things as stirrer bars, Falcon tubes and tubing, as this might correlate to the result. Each equipment has detailed information including which PVs, or “Process Variables” they control. These PVs can contain linear calibration information, i.e. split over a calibration offset and a calibration factor , so that:

The Chemicals section is split into several parts, beginning with the starting compounds. These contain the base reagents as found in the jars of the suppliers. Each reagent has information such as supplier, CAS number and lot number, as well as jar opening date (if known). It also contains information on the chemical in the reagent, with chemical formula, molar mass, density and space group (if known). See Figure 2 for a bit more information on how deep we go with this.

Figure 2: Going into a little more detail in the actual tree using the DAWN DataVis explorer.

Limitations

Here, it perhaps behooves to highlight that the solution here is a pragmatic proof-of-principle solution mostly resulting from the effort of less than a handful of people. There are limitations aplenty, for example, there is probably a spate of parameters we are not – but should be – logging.

One thing I could not figure out how to implement easily is equipment interconnectivity. Equipment is connected on several levels: an electrical level (Figure 3), a virtual level (figure 4), and a synthesis-side physical level (tubing etc, Figure 5). So this information is missing from the structure and only exists as vector graphics hand-drawn using the draw.io software. I’m sure there is a better way.

Figure 3: Equipment connections on the physical/electrical level

Then there is the matter of uncertainties. For someone who keeps hammering on about the value of uncertainties and even splitting them up into different classes of uncertainties (“absolute” and “relative”, see the Everything SAXS paper), it is quite hypocritical of me to not have added them here. I decided against it in the end as none of the uncertainty estimates I could come up with were grounded in any sort of realism, i.e. all of them would be guesstimated and therefore would not have any weight to them. Maybe at a later stage it can be improved.

Figure 4: Equipment connectivity on the virtual level

Also, I had initially provisioned for a more defined synthesis step (called SynthesisStep), which theoretically should allow classifying each synthesis step. This, in turn, should form a good step (ha) towards automatically recreating a synthesis using another robot.

Figure 5: Physical connectivity between equipment from the synthesis perspective (tubing and sample motions as indicated by the solid and dotted lines, respectively). Image by way of Glen Smales.

in the end I did not include this as the more I worked with it, the more I realised that most of its purpose is already performed by RawLogMessage. And any universality with other robot systems is moot if I don’t tailor it to use the confines of other robot system concepts.

Here, too, translation will be essential to turn the information contained in these synthesis logs into a synthesis protocol for other robots to use.

So where to go from here?

With the information encapsulated in the synthesis files, we can now start exploring. I’ve built a simple dashboard using Dash and Plotly that allows one to plot parameters against another. Additionally, for samples that have been measured already, the scattering pattern and size distribution can be shown too (indeed, the resultant size parameters are plottable too). This allows a bit more information to be gleaned, but the two-dimensional nature of the plotting isn’t really enough to extricate the necessary interparameter correlations. That will require a bit more effort.

Nevertheless, it’s pretty, and that will suffice for now! (see Figure 6)

Figure 6: The synthesis explorer dashboard. Each dot in the left figure represents a synthesis. For the syntheses whose samples have been measured, the corresponding data, fit and resultant size distribution are shown on the right-hand side.

https://lookingatnothing.com/index.php/archives/3720

#archivalData #data #hdf5 #labAutomation #nexus #robotics #storageFormat #synthesis

HELPMI:
Develop a user-driven #NeXus extension proposal for laser-plasma (#laser #plasma) experiments, by #Helmholtz #Metadata Collaboration

Basically, having nice and standardised metadata (e.g. in #HDF5 files) for #pewpew experiments, making data #FAIR for others – and also more easy to access for yourself ;)

Now we're learning about h5wasm, especially viewing compressed #hdf5 data files with web assembly #wasm.

Maintenance of #HDF5 filters, plugins:

Who fixes problems?
What happens when, filter maintainers disappear?
When underyling libraries are no longer maintained?
API changes?

Who cares, who pays?

This week, I'm at @DESY for the #HDF User Group #HUG summit on plugins and data compression.

We will start with latest updates on plugins:
“HDF5 and plugins – overview and roadmap“, Dana Robinson.
Then, Elena Pourmal will hopefully have good news: “Expanding HDF5 capabilities to support multi-threading access and new types of storage“.

Currently, reading big datasets can be quite a burden due to some … challenges of #HDF5 with multi-threading.

New release of hidefix (the fast #rust #HDF5 reader) with simpler methods reading an open NetCDF file: https://github.com/gauteh/hidefix

```
use hidefix::prelude::*;

let i = netcdf::open("tests/data/coads_climatology.nc4").unwrap().index().unwrap();
let iv = i.reader("SST").unwrap().values::<f32>(None, None).unwrap();

```

Thinking about publishing some #experimental data in fluid dynamics in an open format that could be a standard, and trying not to reproduce this piece of art
https://xkcd.com/927/
#hdf5 looks interesting, or #netCDF but my data are not so standardized.

Fixed a long standing bug in hidefix (https://github.com/gauteh/hidefix) where #hdf5 files with un-aligned chunking failed. 0.7.0 is out. Fast as ever compared to native hdf5 library. #rust

#HDF5

Client Info