<pre class='metadata'>
Title: Next-generation file formats (NGFF)
Shortname: ome-ngff
Level: 1
Status: LS-COMMIT
Status: w3c/ED
Group: ome
URL: https://joshmoore.github.io/ngff
Repository: https://github.com/joshmoore/ngff
Issue Tracking: Forums https://forum.image.sc/tag/ome-ngff
Logo: http://www.openmicroscopy.org/img/logos/ome-logomark.svg
Local Boilerplate: header yes
Local Boilerplate: copyright yes
Boilerplate: style-darkmode off
Markup Shorthands: markdown yes
Editor: Josh Moore, Open Microscopy Environment (OME) https://www.openmicroscopy.org
Abstract: This document contains next-generation file format (NGFF)
Abstract: specifications for storing bioimaging data in the cloud.
Abstract: All specifications are submitted to the https://image.sc community for review.
Status Text: The current released version of this specification is 0.1. Migration scripts
Status Text: will be provided between numbered versions. Data written with the latest changes
Status Text: (an "editor's draft") will not necessarily be supported.
</pre>

Introduction {#intro}
=====================

Bioimaging science is at a crossroads. Currently, the drive to acquire more,
larger, preciser spatial measurements is unfortunately at odds with our ability
to structure and share those measurements with others. During a global pandemic
more than ever, we believe fervently that global, collaborative discovery as
opposed to the post-publication, "data-on-request" mode of operation is the
path forward. Bioimaging data should be shareable via open and commercial cloud
resources without the need to download entire datasets.

At the moment, that is not the norm. The plethora of data formats produced by
imaging systems are ill-suited to remote sharing. Individual scientists
typically lack the infrastructure they need to host these data themselves. When
they acquire images from elsewhere, time-consuming translations and data
cleaning are needed to interpret findings. Those same costs are multiplied when
gathering data into online repositories where curator time can be the limiting
factor before publication is possible. Without a common effort, each lab or
resource is left building the tools they need and maintaining that
infrastructure often without dedicated funding.

This document defines a specification for bioimaging data to make it possible
to enable the conversion of proprietary formats into a common, cloud-ready one.
Such next-generation file formats layout data so that individual portions, or
"chunks", of large data are reference-able eliminating the need to download
entire datasets.


Why "<dfn export="true"><abbr title="Next-generation file-format">NGFF</abbr></dfn>"? {#why-ngff}
-------------------------------------------------------------------------------------------------

A short description of what is needed for an imaging format is "a hierarchy
of n-dimensional (dense) arrays with metadata". This combination of features
is certainly provided by <dfn export="true"><abbr title="Hierarchical Data Format 5">HDF5</abbr></dfn>
from the <a href="https://www.hdfgroup.org">HDF Group</a>, which a number of
bioimaging formats do use. HDF5 and other larger binary structures, however,
are ill-suited for storage in the cloud where accessing individual chunks
of data by name rather than seeking through a large file is at the heart of
parallelization.

As a result, a number of formats have been developed more recently which provide
the basic data structure of an HDF5 file, but do so in a more cloud-friendly way.
In the [PyData](https://pydata.org/) community, the Zarr [[zarr]] format was developed
for easily storing collections of [NumPy](https://numpy.org/) arrays. In the
[ImageJ](https://imagej.net/) community, N5 [[n5]] was developed to work around
the limitations of HDF5 ("N5" was originally short for "Not-HDF5").
Both of these formats permit storing individual chunks of data either locally in
separate files or in cloud-based object stores as separate keys.

A [current effort](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html)
is underway to unify the two similar specifications to provide a single binary
specification. The editor's draft will soon be entering a [request for comments (RFC)](https://github.com/zarr-developers/zarr-specs/issues/101) phase with the goal of having a first version early in 2021. As that
process comes to an end, this document will be updated.

OME-NGFF {#ome-ngff}
--------------------

The conventions and specifications defined in this document are designed to
enable next-generation file formats to represent the same bioimaging data
that can be represented in \[OME-TIFF](http://www.openmicroscopy.org/ome-files/)
and beyond. However, the conventions will also be usable by HDF5 and other sufficiently advanced
binary containers. Eventually, we hope, the moniker "next-generation" will no longer be
applicable, and this will simply be the most efficient, common, and useful representation
of bioimaging data, whether during acquisition or sharing in the cloud.

Note: The following text makes use of OME-Zarr [[ome-zarr-py]], the current prototype implementation,
for all examples.

On-disk (or in-cloud) layout {#on-disk}
=======================================

An overview of the layout of an OME-Zarr fileset should make
understanding the following metadata sections easier. The hierarchy
is represented here as it would appear locally but could equally
be stored on a web server to be accessed via HTTP or in object storage
like S3 or GCS.

```
.                             # Root folder, potentially in S3,
│                             # with a flat list of images by image ID.
│
├── 123.zarr                  # One image (id=123) converted to Zarr.
│
└── 456.zarr                  # Another image (id=456) converted to Zarr.
    │
    ├── .zgroup               # Each image is a Zarr group, or a folder, of other groups and arrays.
    ├── .zattrs               # Group level attributes are stored in the .zattrs file and include
    │                         #  "multiscales" and "omero" below)
    │
    ├── 0                     # Each multiscale level is stored as a separate Zarr array,
    │   ...                   # which is a folder containing chunk files which compose the array.
    ├── n                     # The name of the array is arbitrary with the ordering defined by
    │   │                     # by the "multiscales" metadata, but is often a sequence starting at 0.
    │   │
    │   ├── .zarray           # All image arrays are 5-dimensional
    │   │                     # with dimension order (t, c, z, y, x).
    │   │
    │   ├── 0.0.0.0.0         # Chunks are stored with the flat directory layout.
    │   │   ...               # Each dotted component of the chunk file represents
    │   └── t.c.z.y.x         # a "chunk coordinate", where the maximum coordinate
    │                         # will be `dimension_size / chunk_size`.
    │
    └── labels
        │
        ├── .zgroup           # The labels group is a container which holds a list of labels to make the objects easily discoverable
        │
        ├── .zattrs           # All labels will be listed in `.zattrs` e.g. `{ "labels": [ "original/0" ] }`
        │                     # Each dimension of the label `(t, c, z, y, x)` should be either the same as the
        │                     # corresponding dimension of the image, or `1` if that dimension of the label
        │                     # is irrelevant.
        │
        └── original          # Intermediate folders are permitted but not necessary and currently contain no extra metadata.
            │
            └── 0             # Multiscale, labeled image. The name is unimportant but is registered in the "labels" group above.
                ├── .zgroup   # Zarr Group which is both a multiscaled image as well as a labeled image.
                ├── .zattrs   # Metadata of the related image and as well as display information under the "image-label" key.
                │
                ├── 0         # Each multiscale level is stored as a separate Zarr array, as above, but only integer values
                │   ...       # are supported.
                └── n
```

Metadata {#metadata}
====================

The various `.zattrs` files throughout the above array hierarchy may contain metadata
keys as specified below for discovering certain types of data, especially images.

"multiscales" metadata {#multiscale-md}
---------------------------------------

Metadata about the multiple resolution representations of the image can be
found under the "multiscales" key in the group-level metadata.
The specification for the multiscale (i.e. "resolution") metadata is provided
in [zarr-specs#50](https://github.com/zarr-developers/zarr-specs/issues/50).
If only one multiscale is provided, use it. Otherwise, the user can choose by
name, using the first multiscale as a fallback:

```python
datasets = []
for named in multiscales:
    if named["name"] == "3D":
        datasets = [x["path"] for x in named["datasets"]]
        break
if not datasets:
    # Use the first by default. Or perhaps choose based on chunk size.
    datasets = [x["path"] for x in multiscales[0]["datasets"]]
```

The subresolutions in each multiscale are ordered from highest-resolution
to lowest.

"omero" metadata {#omero-md}
----------------------------

Information specific to the channels of an image and how to render it
can be found under the "omero" key in the group-level metadata:

```json
"id": 1,                              # ID in OMERO
"name": "example.tif",                # Name as shown in the UI
"version": "0.1",                     # Current version
"channels": [                         # Array matching the c dimension size
    {
        "active": true,
        "coefficient": 1,
        "color": "0000FF",
        "family": "linear",
        "inverted": false,
        "label": "LaminB1",
        "window": {
            "end": 1500,
            "max": 65535,
            "min": 0,
            "start": 0
        }
    }
],
"rdefs": {
    "defaultT": 0,                    # First timepoint to show the user
    "defaultZ": 118,                  # First Z section to show the user
    "model": "color"                  # "color" or "greyscale"
}
```

See https://docs.openmicroscopy.org/omero/5.6.1/developers/Web/WebGateway.html#imgdata
for more information.

"labels" metadata {#labels-md}
------------------------------

The special group "labels" found under an image Zarr contains the key `labels` containing
the paths to label objects which can be found underneath the group:

```json
{
  "labels": [
    "orphaned/0"
  ]
}
```

Unlisted groups MAY be labels.

"image-label" metadata {#label-md}
----------------------------------

Groups containing the `image-label` dictionary represent an image segmentation
in which each unique pixel value represents a separate segmented object.
`image-label` groups MUST also contain `multiscales` metadata and the two
"datasets" series MUST have the same number of entries.

The `colors` key defines a list of JSON objects describing the unique label
values. Each entry in the list MUST contain the key "label-value" with the
pixel value for that label. Additionally, the "rgba" key MAY be present, the
value for which is an RGBA unsigned-int 4-tuple: `[uint8, uint8, uint8, uint8]`
All `label-value`s must be unique. Clients who choose to not throw an error
should ignore all except the _last_ entry.

Some implementations may represent overlapping labels by using a specially assigned
value, for example the highest integer available in the pixel range.

The `source` key is an optional dictionary which contains information on the
image the label is associated with. If included it MAY include a key `image`
whose value is the relative path to a Zarr image group. The default value is
"../../" since most labels are stored under a subgroup named "labels/" (see
above).


```json
"image-label":
  {
    "version": "0.1",
    "colors": [
      {
        "label-value": 1,
        "rgba": [255, 255, 255, 0]
      },
      {
        "label-value": 4,
        "rgba": [0, 255, 255, 128]
      },
      ...
      ]
    },
    "source": {
      "image": "../../"
    }
]
```


<table>
  <thead>
    <tr>
      <td>Revision</td>
      <td>Date</td>
      <td>Description</td>
    </tr>
  </thead>
  <tr>
    <td>0.1.3-dev4</td>
    <td>2020-09-14</td>
    <td>Add the image-label object                 </td>
  </tr>
  <tr>
    <td>0.1.3-dev3</td>
    <td>2020-09-01</td>
    <td>Convert labels to multiscales              </td>
  </tr>
  <tr>
    <td>0.1.3-dev2</td>
    <td>2020-08-18</td>
    <td>Rename masks as labels                     </td>
  </tr>
  <tr>
    <td>0.1.3-dev1</td>
    <td>2020-07-07</td>
    <td>Add mask metadata                          </td>
  </tr>
  <tr>
    <td>0.1.2     </td>
    <td>2020-05-07</td>
    <td>Add description of "omero" metadata        </td>
  </tr>
  <tr>
    <td>0.1.1     </td>
    <td>2020-05-06</td>
    <td>Add info on the ordering of resolutions    </td>
  </tr>
  <tr>
    <td>0.1.0     </td>
    <td>2020-04-20</td>
    <td>First version for internal demo            </td>
  </tr>
</table>


<pre class="biblio">
{
  "blogNov2020": {
    "href": "https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/",
    "title": "Public OME-Zarr data (Nov. 2020)",
    "authors": [
      "OME Team"
    ],
    "status": "Informational",
    "publisher": "OME",
    "id": "blogNov2020",
    "date": "04 November 2020"
  },
  "imagesc26952": {
    "href": "https://forum.image.sc/t/ome-s-position-regarding-file-formats/26952",
    "title": "OME’s position regarding file formats",
    "authors": [
      "OME Team"
    ],
    "status": "Informational",
    "publisher": "OME",
    "id": "imagesc26952",
    "date": "19 June 2020"
  },
  "n5": {
    "id": "n5",
    "href": "https://github.com/saalfeldlab/n5/issues/62",
    "title": "N5---a scalable Java API for hierarchies of chunked n-dimensional tensors and structured meta-data",
    "status": "Informational",
    "authors": [
      "John A. Bogovic",
      "Igor Pisarev",
      "Philipp Hanslovsky",
      "Neil Thistlethwaite",
      "Stephan Saalfeld"
    ],
    "date": "2020"
  },
  "ome-zarr-py": {
    "id": "ome-zarr-py",
    "href": "https://doi.org/10.5281/zenodo.4113931",
    "title": "ome-zarr-py: Experimental implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.",
    "status": "Informational",
    "publisher": "Zenodo",
    "authors": [
      "OME",
      "et al"
    ],
    "date": "06 October 2020"
  },
  "zarr": {
    "id": "zarr",
    "href": "https://doi.org/10.5281/zenodo.4069231",
    "title": "Zarr: An implementation of chunked, compressed, N-dimensional arrays for Python.",
    "status": "Informational",
    "publisher": "Zenodo",
    "authors": [
      "Alistair Miles",
      "et al"
    ],
    "date": "06 October 2020"
  }
}
</pre>
