# GOSH: **G**eneralised **O**scillator **S**trengths in **H**DF5
### File format specification 0.7.2

The ***G**eneralised **O**scillator **S**trengths in **H**DF5*, or **GOSH** file format is built on top of the HDF5 scientific data format, adopting a speficic set of rules for organisig the data within, to make it possible to exchange generalised oscillator strength data among scientists while maintaining interoperability with the largest array of software packages.

This will make it is easier for scientists working on GOS data to deploy their results, it will save the software developers the effort of re-implementing new formats, and will allow users the highest choice in both fields.

### Extension

The recommended extensions are .gosh or .gos

#### Layout


```
file
  └───file_format - attribute
  └───file_format_version - attribute
  └───Ti       - group
  |   └───K1        - group
  |   |    data         - dataset
  |   |    q            - dataset
  |   |    free_energy  - dataset
  |   |    variants     - dataset
  |   |    metadata     - group
  |   └───L1        - group
  |   |    data         - dataset
  |   |    q            - dataset
  |   |    free_energy  - dataset
  |   |    variants     - dataset
  |   |    metadata     - group
  |   └───L23       - group
  |        ...
  |
  └───Al       - group
  |   └───K1        - group
  |   |    ...
  |   |    ...
  |
  └───metadata - group
      └───authors - attribute
      └───bib_ref  - group
      |      bib_cite      - attribute
      |      bib_doi	     - attribute
      |      bib_url	     - attribute
      |      ...
      └───capabilities- group
      |      oxidation       - attribute
      |      quantum_numbers - attribute
      |      ...
      └───computation_info - group
      |      ...
      └───edges_info - group
           └───K            - group
                 ref_edge         - attribute
                 n                - attribute
                 l                - attribute
                 occupancy_ratio  - attribute
```


#### Root attributes

The root has two attributes:
`file_format`, which should always be the string `GOSH` and `file_format_version` which should be a string with the exact version number at the head of this specification, e.g. `"0.7.2"`

#### Groups

The file contains one group for each element described, plus one for the metadata.

#### Edges

Each element group contains one subgroup per each edge for which data is provided.
The name of the subgroup is the same as the edge, e.g. K, L1 etc. as also identified by edges_info (see below). Each subgroup has a dataset called `data`, two datasets for the axes calles `q` and `free_energy`, a dataset called `variant`, and a group called `metadata`.

##### metadata (edge)

This group contains metadata that is only specific to the edge rather than to the entire dataset.

For now, only one field is specified by this documentation, but other fields can be added as required.

| name | description
| :--: | :---
| computed_onset | The value of the onset energy of the edge as computed.

##### data

The dataset is an array with size *m×n×o* where *n* and *m* are the number of points along the momentum and energy axis respectively, and *o* is used to iterate over variants of the same edge (e.g. different valence). When there is only one variant, the dataset should still be 3D, with the 3rd axis having a length of 1.

##### axes

Each subgroup also has two datasets dedicated to axes coordinates: `q` and `free_energy`, with a length of *m* and *n* respectively.
`q` is a 1D array of size *m* containing the coordinates of the mesh points along the momentum axis, in inverse angstroms.
Likewise `free_energy` is a 1D array of size *n* containing the coordinates along the energy axis, in eV. The energy values are expressed in free energy, i.e. post-onset energy.
This makes it more natural for the edge onset to be loaded from a database of experimentally determined values.
The points do not need to be equally spaced in either direction.
In particular, an exponential mesh in both direction has proven effective in combining good resolution at lower energy/momentum (where sharper features are located) with a wide coverage up to high energy/momentum.

##### variants

A fourth dataset called `variants` is a list of strings explaining the features of the *o* different variants of the same edge.
The table has two columns, the first one contains the coordinate along the third index. The second one is a dictionary, serialised as a JSON-formatted string, including the "characteristics" of each variant.

| variant | properties
| :-----: | :-------------------- 
|    0    |   default
|    1    | {"relativistic": "low",  "oxidation": "3+"}
|    2    | {"relativistic": "high", "oxidation": "3+"}
|    3    | {"relativistic": "low",  "oxidation": "4+"}
|    4    | {"relativistic": "high", "oxidation": "4+"}

These have to be listed with each edge because different edges might have different variants.
For instance the oxidation states would differ for each element, both in how many are available, and which ones are physically possible.

---

#### Metadata (general)

Contains the following data:
|     name     | description 
|     ---:     |   :---   
|   authors    | author name(s)
|   **bib_ref**    | Group for bibl. info
|   .../bib_cite   | citation for related publication
|   .../bib_doi    | doi for publication
|   .../bib_url    | url of publication
|  **capabilities** | groups for describing which effects are modeled in the data
|.../quantum_numbers| list of quantum numbers
|.../relastivistic effects| ara they available?
|**computation_info**| useful or relevant information about how the data has been computed
|.../code_ref | URL for the code used to compute the data
|.../\* | the computation_info group can contain arbitrary fields with aby useful information or parameters
| **data_ref** | group, contains data_doi, data_record, data_url and data_version
|.../data_doi | doi that should resolve to the data download page
|.../data_record | Zenodo upload title
|.../data_url | url to download the data
|.../data_version | version of the data, semver recommended
|  **edges_info** | metadata groups that sets a few convention. See below.

##### edges_info

This should be a metadata group that allows to convert the notation commonly used notations to indicate the edges to names of the tables where the relevant information is located.
One of the important roles of this, currently, is to account for different choices that may have been done in the normalisation of the GOS. For instance, if the data does not take into account J-splitting, the data can't contain separate L2 and L3 edges. These sub-edges can however be reproduced using scaled and shifted copies of the L2,3 edge, but the normalisation of these different edges becomes important. One could, for instance, have normalised the data to the whole L2,3 edge or, as is the case in the well known dataset computed by P. Rez, have normalised them to the sole L3.

The form of edges_info is a group with a subgroup for each edge "name".
The data type of each field is fixed by its physical requirements:

|     name           | dtype | description 
|     ------:        | :---: |:---------------------
|   **edge**         | group | group, named after the edge, e.g K or L2,3. Contains the following attributes:
| .../ref_edge       |  str  | the name of the edge dataset to be used to simulate the edge
|.../occupancy_ratio | float | the scaling to be applied to reproduce the correct edge
|   .../n            |  int  | principal quantum number of the bound electrons
|   .../l            |  int  | orbital quantum number of the bound electrons


an example of how the respective fields could look like:

| edge | ref_edge | occupancy_ratio |  n  |  l   
| :--: | :------: | :-----------:   | :-: | :-: 
|  K   |   K1     |      1.0        |  1  |  0
|  K1  |   K1     |      1.0        |  1  |  0
|  L1  |   L1     |      1.0        |  2  |  0
|  L2  |   L23    |   0.333333      |  2  |  1
|  L3  |   L23    |   0.666666      |  2  |  1
|  L23 |   L23    |      1.0        |  2  |  1
| L2,3 |   L23    |      1.0        |  2  |  1
| ...  |   ...    |      ...        | ... | ...

##### capabilities

This should be a group that describes what physical effects are or aren't included in the data. Also a table where the "tags" and features used in the `variant` tables are listed, with a useful description of how they are to be used. It is also allowed to list features that are not implemented, followed by the string "unavailable".

