#linkml

jonny (good kind)jonny@neuromatch.social
2024-10-03

It took me ~a year to translate Neurodata Without Borders to linkml+pydantic with full abstraction over array/storage backend. Now that I did that, it is taking me ~hours to make interfaces to put NWB in SQL dbs, web APIs for editing and serving NWB datasets (where you can download arbitrary slices of the individual datasets instead of a bigass 100GB HDF5 file), and interconversion between hdf5, dask, and zarr.

Anyway open data in neuroscience is about to get real good.

#neuroscience #linkml #OpenData #OpenScience

2024-09-12

LinkML is the most XKCD 927 technology I have ever seen.

#linkml #xkcd #rdm

jonny (good kind)jonny@neuromatch.social
2024-05-25

Here's an ~ official ~ release announcement for #numpydantic

repo: github.com/p2p-ld/numpydantic
docs: numpydantic.readthedocs.io

Problems: @pydantic is great for modeling data!! but at the moment it doesn't support array data out of the box. Often array shape and dtype are as important as whether something is an array at all, but there isn't a good way to specify and validate that with the Python type system. Many data formats and standards couple their implementation very tightly with their schema, making them less flexible, less interoperable, and more difficult to maintain than they could be. The existing tools for parameterized array types like nptyping and jaxtyping tie their annotations to a specific array library, rather than allowing array specifications that can be abstract across implementations.

numpydantic is a super small, few-dep, and well-tested package that provides generic array annotations for pydantic models. Specify an array along with its shape and dtype and then use that model with any array library you'd like! Extending support for new array libraries is just subclassing - no PRs or monkeypatching needed. The type has some magic under the hood that uses pydantic validators to give a uniform array interface to things that don't usually behave like arrays - pass a path to a video file, that's an array. pass a path to an HDF5 file and a nested array within it, that's an array. We take advantage of the rest of pydantic's features too, including generating rich JSON schema and smart array dumping.

This is a standalone part of my work with @linkml arrays and rearchitecting neurobio data formats like NWB to be dead simple to use and extend, integrating with the tools you already use and across the experimental process - specify your data in a simple yaml format, and get back high quality data modeling code that is standards-compliant out of the box and can be used with arbitrary backends. One step towards the wild exuberance of FAIR data that is just as comfortable in the scattered scripts of real experimental work as it is in carefully curated archives and high performance computing clusters. Longer term I'm trying to abstract away data store implementations to bring content-addressed p2p data stores right into the python interpreter as simply as if something was born in local memory.

plenty of todos, but hope ya like it.

#linkml #python #NewWork #pydantic #ScientificSoftware

[This and the following images aren't very screen reader friendly with a lot of code in them. I'll describe what's going on in brackets and then put the text below.

In this image: a demonstration of the basic usage of numpydantic, declaring an "array" field on a pydantic model with an NDArray class with a shape and dtype specification. The model can then be used with a number of different array libraries and data formats, including validation.]

Numpydantic allows you to do this:

from pydantic import BaseModel
from numpydantic import NDArray, Shape

class MyModel(BaseModel):
    array: NDArray[Shape["3 x, 4 y, * z"], int]

And use it with your favorite array library:

import numpy as np
import dask.array as da
import zarr

# numpy
model = MyModel(array=np.zeros((3, 4, 5), dtype=int))
# dask
model = MyModel(array=da.zeros((3, 4, 5), dtype=int))
# hdf5 datasets
model = MyModel(array=('data.h5', '/nested/dataset'))
# zarr arrays
model = MyModel(array=zarr.zeros((3,4,5), dtype=int))
model = MyModel(array='data.zarr')
model = MyModel(array=('data.zarr', '/nested/dataset'))
# video files
model = MyModel(array="data.mp4")[Further demonstration of validation and array expression, where a Union of NDArray specifications can specify a more complex data type - eg. an image that can be any shape in x and y, an RGB image, or a specific resolution of a video, each with independently checked dtypes]

For example, to specify a very special type of image that can either be

    a 2D float array where the axes can be any size, or

    a 3D uint8 array where the third axis must be size 3

    a 1080p video

from typing import Union
from pydantic import BaseModel
import numpy as np

from numpydantic import NDArray, Shape

class Image(BaseModel):
    array: Union[
        NDArray[Shape["* x, * y"], float],
        NDArray[Shape["* x, * y, 3 rgb"], np.uint8],
        NDArray[Shape["* t, 1080 y, 1920 x, 3 rgb"], np.uint8]
    ]

And then use that as a transparent interface to your favorite array library!
Interfaces
Numpy

The Coca-Cola of array libraries

import numpy as np
# works
frame_gray = Image(array=np.ones((1280, 720), dtype=float))
frame_rgb  = Image(array=np.ones((1280, 720, 3), dtype=np.uint8))

# fails
wrong_n_dimensions = Image(array=np.ones((1280,), dtype=float))
wrong_shape = Image(array=np.ones((1280,720,10), dtype=np.uint8))

# shapes and types are checked together, so this also fails
wrong_shape_dtype_combo = Image(array=np.ones((1280, 720, 3), dtype=float))[Demonstration of usage outside of pydantic as just a normal python type - you can validate an array against a specification by checking it the array is an instance of the array specification type]

And use the NDArray type annotation like a regular type outside of pydantic – eg. to validate an array anywhere, use isinstance:

array_type = NDArray[Shape["1, 2, 3"], int]
isinstance(np.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(zarr.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(np.zeros((4,5,6), dtype=int), array_type)
# False
isinstance(np.zeros((1,2,3), dtype=float), array_type)
# False[Demonstration of JSON schema generation using the sort of odd case of an array with a specific dtype but an arbitrary shape. It has to use a recursive JSON schema definition, where the items of a given JSON array can either be the innermost dtype or another instance of that same array. Since JSON Schema doesn't support extended dtypes like 8-bit integers, we encode that information as maximum and minimum constraints on the `integer` class and add it in the schema metadata. Since pydantic renders all recursive schemas like this in the same $defs block, we use a blake2b hash against the dtype specification to keep them deduplicated.]

numpydantic can even handle shapes with unbounded numbers of dimensions by using recursive JSON schema!!!

So the any-shaped array (using nptyping’s ellipsis notation):

class AnyShape(BaseModel):
    array: NDArray[Shape["*, ..."], np.uint8]

is rendered to JSON-Schema like this:

{
  "$defs": {
    "any-shape-array-9b5d89838a990d79": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/any-shape-array-9b5d89838a990d79"
          },
          "type": "array"
        },
        {"maximum": 255, "minimum": 0, "type": "integer"}
      ]
    }
  },
  "properties": {
    "array": {
      "dtype": "numpy.uint8",
      "items": {"$ref": "#/$defs/any-shape-array-9b5d89838a990d79"},
      "title": "Array",
      "type": "array"
    }
  },
  "required": ["array"],
  "title": "AnyShape",
  "type": "object"
}
:mastodon: Mike Amundsenmamund
2024-05-23

@smallcircles

thanks for the pointer to

Nik | Klampfradler 🎸🚲nik@toot.teckids.org
2024-04-29

I need someone to help me think about a #LinkML puzzle…

I have a an abstract class A, which has an attribute "factors" with range Factor. I want to make a class B, that inherits from A, and sets the "factors" to a concrete list of instances of the "Factor" class. Then, I will have instances of class B, that all share this list of "Factor"s in the "factors" attributes.

I can't seem to come up with how to do this mix of class and instance. Can LinkML even represent it?

#FediHelp #LinkedData

418 I'm a Teapotchainik@merveilles.town
2024-03-13

Taking a good look at #LinkML for a work thing to document data that needs to be shared between aquaculture labs, to articulate clear, agreed schemata for CSV and NetCDF files. LinkML comes with a tool to take compact tables and explode them into RDF triples 10-50x the size for no obvious benefit. Are people really still materialising triples for ideological reasons? @jonny surely you don't do that with your stuff, right?

jonny (good kind)jonny@neuromatch.social
2024-01-13

i'll say more about what this is in the morning, but anyway here's a #LinkML transcription of #ActivityStreams that will also get the implicit definition of an Actor in #ActivityPub later, along with all the other fun stuff that brings like generic dataclasses and pydantic models for programming with, sql, graphql, json schema... yno all the formats.

github.com/p2p-ld/linkml-activ

jonny (good kind)jonny@neuromatch.social
2023-09-27

So im almost finished with my first independent implementation of a standard and I want to write up the process bc it was surprisingly challenging and I learned a lot about how to write them.

I was purposefully experimenting with different methods of translation (eg. Adapter classes vs. pure functions in a build pipeline, recursive functions vs. flattening everything) so the code isnt as sleek as it could be. I had planned on this beforehand, but two major things I learned were a) not just isolating special cases, but making specific means to organize them and make them visible, and b) isolating different layers of the standard (eg. schema language is separate from models is separate from I/O) and not backpropagating special cases between layers.

This is also my first project thats fully in the "new style" of python thats basically a typed language with validating classes, and it makes you write differently but uniformly for the better - it's almost self-testing bc if all the classes validate in an end-to-end test then you know that shit is working as intended. Forcing yourself to deal with errors immediately is the way.

Lots more 2 say but anyway we're like 2 days of work away from a fully independent translation of #NWB to #LinkML that uses @pydantic models + #Dask for arrays. Schema extensions are now no-code: just write the schema (in nwb schema lang or linkml) and poof you can use it. Hoping this makes it way easier for tools to integrate with NWB, and my next step will be to put them in a SQL database and triple store so we can yno more easily share and grab smaller pieces of them and index across lots of datasets.

Then, uh, we'll bridge our data archives + notebooks with the fedi for a new kind of scholarly communication....

BOSC (OpenBio's Conference)BOSC@genomic.social
2023-09-18

Our next ISCBacademy webinar, jointly hosted with the Bio-Ontologies COSI, features genomic.social/@sierramoxon speaking about “LinkML: an open data modeling framework, grounded with ontologies”. Free to ISCB non-members as well as members.

More info: open-bio.org/2023/09/18/iscbac
#LinkML #datamodeling #opendata #ontologies

Sierra Moxon
jonny (good kind)jonny@neuromatch.social
2023-09-07

rendering ~200 #pydantic models with ~900 fields from #linkML, 60s -> 4s by avoiding deep copies and only loading yaml once. we're back down in usability land, that's enough for today.

This image and next, an icicle plot showing the amount of time each call took underneath a "serialize" method in pydantic.py

The top bar (total time) is 59.6s, and the major subsections are a deepcopy operation (30s) within an "induced slot" method (43s), and a yaml load operation (14s). The rest of the icicle plot is unlabeledIcicle plot for the same "serialize" method, this time only 3.98 total seconds. 

The major calls are "induced slot" (2.84s), now without deepcopy, and an "environment" method that takes 0.8s.
LinkML (OLD ACCOUNT)linkml@qoto.org
2023-08-29

Join our new #LinkML group on LinkedIn! linkedin.com/groups/14303246/

LinkML logo
jonny (good kind)jonny@neuromatch.social
2023-08-17

#NWB schema language translated to #LinkML ... check

so now translating the rest of it should just be writing a few mappings

Nat'l Microbiome Data CollabMicrobiomeData@sciencemastodon.com
2023-08-09

NMDC team member Mark Miller will be speaking about implementing the MIxS standard into LinkML on Aug. 10 from 9:45 AM – 10:00 AM THA during the @genomestandards Annual Meeting at SiMR Room 101 (Siriraj Hospital Mahidol University). bit.ly/3YioM0U #datascience #LinkML

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst