#numpydantic

jonny (good kind)jonny@neuromatch.social
2024-09-24

whoop whoop full provenance-preserving roundtrip serialization to JSON in #numpydantic 1.6.0.

sure you could serialize a complex, nested data model with a ton of arrays in a precious high-performance data backend as a bunch of random JSON numbers and test the parsers of fate, or you could serialize them as relative paths with an indication of an interface loading class to a json/yaml file and distribute them without losing all the nice chunking and whatnot you gone and done. Whenever they review my PR, if you use numpydantic with @linkml , then you get all the rich metadata and modeling power of linked data with arrays in a way that makes sense, with arbitrary array framework backends rather than some godforsaken rest/first tree or treating arrays as if they're the same as scalar values -- and now complete with a 1:1 JSON/YAML serializable/deserializable storage format.

Another day closer to linked data for real data with tools that feel good to use. Another day closer to p2p linked data.

next up is including hashes, multiple sources, and support for more types of constraints (now that i'm actually getting feature requests for this thing, which is weird to me).

numpydantic.readthedocs.io/en/

JSON

JSON is the ~ ♥ fun one ♥ ~

There isn’t necessarily a single optimal way to represent all possible arrays in JSON. The standard way that n-dimensional arrays are rendered in json is as a list-of-lists (or array of arrays, in JSON parlance), but that’s almost never what is desirable, especially for large arrays.
Normal Style[1]

Lists-of-lists are the standard, however, so it is the default behavior for all interfaces, and all interfaces must support it.

For our humble model:

[Python code follows, demonstrating how for a simple model with a single `array` field, each of several array formats (numpy, dask arrays, and zarr) can all be dumped to plain, identical json)]

class MyModel(BaseModel):
    array: NDArray

We should get the same thing for each interface:

model = MyModel(array=[[1,2],[3,4]])
print(model.model_dump_json())

{"array":[[1,2],[3,4]]}

model = MyModel(array=da.array([[1,2],[3,4]], dtype=int))
print(model.model_dump_json())

{"array":[[1,2],[3,4]]}

model = MyModel(array=zarr.array([[1,2],[3,4]], dtype=int))
print(model.model_dump_json())

{"array":[[1,2],[3,4]]}Roundtripping

To roundtrip make arrays round-trippable, use the round_trip argument to model_dump_json().

All the following should return an equivalent array from the same file/etc. as the source array when using model_validate_json() .

[More python code follows, showing how when dumping with `round_trip` additional metadata is included, in this case the `dask` array metadata about `dtype`, `chunk` shape, and the array itself (since this was an in-memory array)]

print_json(model.model_dump_json(round_trip=True))

{
    'array': {
        'type': 'dask',
        'name': 'array-2a39187fc9fcee3f4cdbc1f2911b4b92',
        'chunks': [[2], [2]],
        'dtype': 'int64',
        'array': [[1, 2], [3, 4]]
    }
}Controlling paths

When possible, the full content of the array is omitted in favor of the path to the file that provided it.

[More python code, showing how when different array formats that are stored on disk are dumped with round_trip, the relative file path is dumped rather than the full array itself]

model = MyModel(array="data/test.avi")
print_json(model.model_dump_json(round_trip=True))

{'array': {'type': 'video', 'file': 'data/test.avi'}}

model = MyModel(array=("data/test.h5", "/data"))
print_json(model.model_dump_json(round_trip=True))

{
    'array': {
        'type': 'hdf5',
        'file': 'data/test.h5',
        'path': '/data',
        'field': None
    }
}Durable Interface Metadata
Numpydantic tries to be stable, but we’re not perfect. To preserve the full information about the interface that’s needed to load the data referred to by the value, use the mark_interface contest parameter:

[Python code, showing how the `mark_interface` parameter causes the output to gain an additional layer of metadata, with an "interface" dict defining exactly what module, class, and version were used to create it]

print_json(
  model.model_dump_json(
    round_trip=True, 
    context={"mark_interface": True}
    ))

{
    'array': {
        'interface': {
            'module': 'numpydantic.interface.hdf5',
            'cls': 'H5Interface',
            'name': 'hdf5',
            'version': '1.6.0'
        },
        'value': {
            'type': 'hdf5',
            'file': 'data/test.h5',
            'path': '/data',
            'field': None
        }
    }
}
jonny (good kind)jonny@neuromatch.social
2024-08-14

pushing da boundaries of the python type system to make parameterized callable classes lol trust me it makes sense
github.com/p2p-ld/numpydantic/

#numpydantic

[Example of using a parameterized type as a callable in python... that's about as good of a summary as i can give. We are declaring that something is an array with a single dimension of shape 3, and then giving an array to that declaration - it is good, then the array is returned unharmed. if it is wrong, a ShapeError is raised]

So this makes it so you can use an annotation as a functional validator. it looks a little bit whacky but idk it makes sense as a PARAMETERIZED TYPE

[python code begins]

>>> from numpydantic import NDArray, Shape
>>> import numpy as np

>>> array = np.array([1,2,3], dtype=int)
>>> validated = NDArray[Shape["3"], int](array)
>>> assert validated is array
True

>>> bad_array = np.array([1,2,3,4], dtype=int)
>>> _ = NDArray[Shape["3"], int](bad_array)
    175 """
    176 Raise a ShapeError if the shape is invalid.
    177 
    178 Raises:
    179     :class:`~numpydantic.exceptions.ShapeError`
    180 """
    181 if not valid:
--> 182     raise ShapeError(
    183         f"Invalid shape! expected shape {self.shape.prepared_args}, "
    184         f"got shape {shape}"
    185     )

ShapeError: Invalid shape! expected shape ['3'], got shape (4,)
[python code ends]
jonny (good kind)jonny@neuromatch.social
2024-05-25

Here's an ~ official ~ release announcement for #numpydantic

repo: github.com/p2p-ld/numpydantic
docs: numpydantic.readthedocs.io

Problems: @pydantic is great for modeling data!! but at the moment it doesn't support array data out of the box. Often array shape and dtype are as important as whether something is an array at all, but there isn't a good way to specify and validate that with the Python type system. Many data formats and standards couple their implementation very tightly with their schema, making them less flexible, less interoperable, and more difficult to maintain than they could be. The existing tools for parameterized array types like nptyping and jaxtyping tie their annotations to a specific array library, rather than allowing array specifications that can be abstract across implementations.

numpydantic is a super small, few-dep, and well-tested package that provides generic array annotations for pydantic models. Specify an array along with its shape and dtype and then use that model with any array library you'd like! Extending support for new array libraries is just subclassing - no PRs or monkeypatching needed. The type has some magic under the hood that uses pydantic validators to give a uniform array interface to things that don't usually behave like arrays - pass a path to a video file, that's an array. pass a path to an HDF5 file and a nested array within it, that's an array. We take advantage of the rest of pydantic's features too, including generating rich JSON schema and smart array dumping.

This is a standalone part of my work with @linkml arrays and rearchitecting neurobio data formats like NWB to be dead simple to use and extend, integrating with the tools you already use and across the experimental process - specify your data in a simple yaml format, and get back high quality data modeling code that is standards-compliant out of the box and can be used with arbitrary backends. One step towards the wild exuberance of FAIR data that is just as comfortable in the scattered scripts of real experimental work as it is in carefully curated archives and high performance computing clusters. Longer term I'm trying to abstract away data store implementations to bring content-addressed p2p data stores right into the python interpreter as simply as if something was born in local memory.

plenty of todos, but hope ya like it.

#linkml #python #NewWork #pydantic #ScientificSoftware

[This and the following images aren't very screen reader friendly with a lot of code in them. I'll describe what's going on in brackets and then put the text below.

In this image: a demonstration of the basic usage of numpydantic, declaring an "array" field on a pydantic model with an NDArray class with a shape and dtype specification. The model can then be used with a number of different array libraries and data formats, including validation.]

Numpydantic allows you to do this:

from pydantic import BaseModel
from numpydantic import NDArray, Shape

class MyModel(BaseModel):
    array: NDArray[Shape["3 x, 4 y, * z"], int]

And use it with your favorite array library:

import numpy as np
import dask.array as da
import zarr

# numpy
model = MyModel(array=np.zeros((3, 4, 5), dtype=int))
# dask
model = MyModel(array=da.zeros((3, 4, 5), dtype=int))
# hdf5 datasets
model = MyModel(array=('data.h5', '/nested/dataset'))
# zarr arrays
model = MyModel(array=zarr.zeros((3,4,5), dtype=int))
model = MyModel(array='data.zarr')
model = MyModel(array=('data.zarr', '/nested/dataset'))
# video files
model = MyModel(array="data.mp4")[Further demonstration of validation and array expression, where a Union of NDArray specifications can specify a more complex data type - eg. an image that can be any shape in x and y, an RGB image, or a specific resolution of a video, each with independently checked dtypes]

For example, to specify a very special type of image that can either be

    a 2D float array where the axes can be any size, or

    a 3D uint8 array where the third axis must be size 3

    a 1080p video

from typing import Union
from pydantic import BaseModel
import numpy as np

from numpydantic import NDArray, Shape

class Image(BaseModel):
    array: Union[
        NDArray[Shape["* x, * y"], float],
        NDArray[Shape["* x, * y, 3 rgb"], np.uint8],
        NDArray[Shape["* t, 1080 y, 1920 x, 3 rgb"], np.uint8]
    ]

And then use that as a transparent interface to your favorite array library!
Interfaces
Numpy

The Coca-Cola of array libraries

import numpy as np
# works
frame_gray = Image(array=np.ones((1280, 720), dtype=float))
frame_rgb  = Image(array=np.ones((1280, 720, 3), dtype=np.uint8))

# fails
wrong_n_dimensions = Image(array=np.ones((1280,), dtype=float))
wrong_shape = Image(array=np.ones((1280,720,10), dtype=np.uint8))

# shapes and types are checked together, so this also fails
wrong_shape_dtype_combo = Image(array=np.ones((1280, 720, 3), dtype=float))[Demonstration of usage outside of pydantic as just a normal python type - you can validate an array against a specification by checking it the array is an instance of the array specification type]

And use the NDArray type annotation like a regular type outside of pydantic – eg. to validate an array anywhere, use isinstance:

array_type = NDArray[Shape["1, 2, 3"], int]
isinstance(np.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(zarr.zeros((1,2,3), dtype=int), array_type)
# True
isinstance(np.zeros((4,5,6), dtype=int), array_type)
# False
isinstance(np.zeros((1,2,3), dtype=float), array_type)
# False[Demonstration of JSON schema generation using the sort of odd case of an array with a specific dtype but an arbitrary shape. It has to use a recursive JSON schema definition, where the items of a given JSON array can either be the innermost dtype or another instance of that same array. Since JSON Schema doesn't support extended dtypes like 8-bit integers, we encode that information as maximum and minimum constraints on the `integer` class and add it in the schema metadata. Since pydantic renders all recursive schemas like this in the same $defs block, we use a blake2b hash against the dtype specification to keep them deduplicated.]

numpydantic can even handle shapes with unbounded numbers of dimensions by using recursive JSON schema!!!

So the any-shaped array (using nptyping’s ellipsis notation):

class AnyShape(BaseModel):
    array: NDArray[Shape["*, ..."], np.uint8]

is rendered to JSON-Schema like this:

{
  "$defs": {
    "any-shape-array-9b5d89838a990d79": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/any-shape-array-9b5d89838a990d79"
          },
          "type": "array"
        },
        {"maximum": 255, "minimum": 0, "type": "integer"}
      ]
    }
  },
  "properties": {
    "array": {
      "dtype": "numpy.uint8",
      "items": {"$ref": "#/$defs/any-shape-array-9b5d89838a990d79"},
      "title": "Array",
      "type": "array"
    }
  },
  "required": ["array"],
  "title": "AnyShape",
  "type": "object"
}

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst