--- title: Metadatasets: a dataset of datasets keywords: fastai sidebar: home_sidebar summary: "This functionality will allow you to create a dataset from data stores in multiple, smaller datasets." description: "This functionality will allow you to create a dataset from data stores in multiple, smaller datasets." nb_path: "nbs/015_data.metadatasets.ipynb" ---
{% raw %}
{% endraw %}
  • I'd like to thank both Thomas Capelle (https://github.com/tcapelle) and Xander Dunn (https://github.com/xanderdunn) for their contributions to make this code possible.
  • This functionality allows you to use multiple numpy arrays instead of a single one, which may be very useful in many practical settings. I've tested it with 10k+ datasets and it works well.
{% raw %}
{% endraw %} {% raw %}

class TSMetaDataset[source]

TSMetaDataset(dataset_list, **kwargs)

A dataset capable of indexing mutiple datasets at the same time

{% endraw %} {% raw %}

class TSMetaDatasets[source]

TSMetaDatasets(metadataset, splits) :: FilteredBase

Base class for lists with subsets

{% endraw %} {% raw %}
{% endraw %}

Let's create 3 datasets. In this case they will have different sizes.

{% raw %}
vocab = L(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
dsets = []
for i in range(3):
    size = np.random.randint(50, 150)
    X = torch.rand(size, 5, 50)
    y = vocab[torch.randint(0, 10, (size,))]
    tfms = [None, TSClassification(add_na=True)]
    dset = TSDatasets(X, y, tfms=tfms)
    dsets.append(dset)
dsets
[(#105) [(TSTensor(vars:5, len:50, device=cpu), TensorCategory(9)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(7)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(5)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(9)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(10)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2))] ...],
 (#134) [(TSTensor(vars:5, len:50, device=cpu), TensorCategory(1)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(2)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(7)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(6)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(10)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(4)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(9)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(8)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(3))] ...],
 (#143) [(TSTensor(vars:5, len:50, device=cpu), TensorCategory(5)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(10)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(6)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(5)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(10)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(8)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(7)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(1)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(6)), (TSTensor(vars:5, len:50, device=cpu), TensorCategory(7))] ...]]
{% endraw %} {% raw %}
metadataset = TSMetaDataset(dsets)
metadataset, metadataset.vars, metadataset.len
(<__main__.TSMetaDataset at 0x7fd7ffe75f28>, 5, 50)
{% endraw %}

We'll apply splits now to create train and valid metadatasets:

{% raw %}
splits = TimeSplitter()(metadataset)
splits
((#306) [0,1,2,3,4,5,6,7,8,9...],
 (#76) [306,307,308,309,310,311,312,313,314,315...])
{% endraw %} {% raw %}
metadatasets = TSMetaDatasets(metadataset, splits=splits)
metadatasets.train, metadatasets.valid
(<__main__.TSMetaDataset at 0x7fd7ffe75d30>,
 <__main__.TSMetaDataset at 0x7fd7ffe75c18>)
{% endraw %} {% raw %}
dls = TSDataLoaders.from_dsets(metadatasets.train, metadatasets.valid)
xb, yb = first(dls.train)
xb, yb
(tensor([[[8.9708e-01, 2.8598e-01, 9.0524e-01,  ..., 4.7881e-01,
           6.9086e-01, 9.7953e-01],
          [3.9702e-01, 2.8280e-01, 7.1657e-01,  ..., 1.7420e-01,
           1.9575e-03, 2.7200e-01],
          [4.9516e-01, 9.2424e-01, 6.4480e-01,  ..., 8.6884e-01,
           1.9167e-01, 3.8663e-01],
          [3.0259e-01, 2.1004e-01, 6.3733e-01,  ..., 7.8205e-02,
           1.5396e-01, 3.9986e-01],
          [5.1964e-01, 3.4127e-01, 6.4531e-01,  ..., 7.1806e-02,
           7.4778e-01, 4.2946e-01]],
 
         [[7.5882e-01, 8.0031e-01, 7.3100e-01,  ..., 1.3822e-02,
           8.3882e-02, 1.7649e-01],
          [7.8212e-01, 8.4554e-01, 5.3522e-01,  ..., 8.4573e-01,
           2.9283e-01, 4.1084e-01],
          [7.1707e-01, 1.0961e-01, 9.9014e-01,  ..., 2.9253e-01,
           3.3794e-01, 2.3092e-01],
          [9.7081e-01, 9.3648e-01, 3.8191e-01,  ..., 2.8765e-01,
           9.0285e-01, 4.7684e-01],
          [3.2324e-01, 3.4674e-01, 8.8366e-01,  ..., 8.3131e-01,
           1.9483e-01, 6.3751e-02]],
 
         [[3.6577e-01, 5.3525e-01, 4.1795e-01,  ..., 3.5981e-01,
           9.3276e-01, 7.0333e-01],
          [6.7278e-01, 7.0413e-02, 5.7374e-01,  ..., 9.0295e-01,
           3.6350e-01, 9.6660e-01],
          [7.4306e-01, 8.0161e-01, 4.6418e-01,  ..., 6.9928e-01,
           3.8255e-01, 2.8446e-01],
          [1.1848e-01, 2.9266e-01, 4.7914e-01,  ..., 9.2846e-01,
           9.1835e-01, 1.3424e-01],
          [5.2314e-01, 7.8462e-01, 4.0047e-01,  ..., 2.8954e-01,
           7.8985e-02, 5.9372e-01]],
 
         ...,
 
         [[3.9528e-01, 6.2661e-01, 5.0106e-01,  ..., 5.9371e-01,
           9.4917e-01, 4.4450e-01],
          [8.5632e-01, 5.2220e-01, 5.2169e-01,  ..., 3.6134e-01,
           8.3527e-01, 6.9476e-01],
          [2.6391e-01, 6.8925e-01, 8.1441e-01,  ..., 5.8711e-01,
           2.4186e-01, 1.3854e-01],
          [6.9608e-01, 5.8143e-01, 6.7683e-01,  ..., 3.6198e-01,
           7.9069e-01, 2.3458e-01],
          [9.1666e-01, 8.4379e-01, 9.7085e-01,  ..., 1.3755e-02,
           3.3765e-02, 1.0020e-01]],
 
         [[8.4602e-01, 3.5836e-01, 5.5184e-01,  ..., 7.9122e-01,
           3.3502e-01, 3.9309e-01],
          [6.2136e-01, 7.2072e-01, 7.8639e-01,  ..., 1.8939e-01,
           3.6156e-04, 6.2199e-02],
          [5.7941e-01, 6.6271e-01, 3.4343e-01,  ..., 7.1136e-01,
           7.4348e-01, 6.5310e-01],
          [1.0420e-01, 7.0913e-01, 8.8308e-01,  ..., 8.2808e-01,
           3.4749e-01, 1.6145e-01],
          [8.0476e-01, 1.0886e-01, 6.2308e-02,  ..., 2.7693e-02,
           1.3562e-01, 1.8487e-01]],
 
         [[4.3464e-01, 9.6710e-01, 3.7880e-01,  ..., 1.1528e-01,
           5.5569e-01, 8.5616e-01],
          [7.8498e-01, 3.6707e-01, 9.2552e-01,  ..., 8.8065e-01,
           7.7275e-01, 9.5932e-02],
          [5.8527e-01, 7.6148e-01, 6.1508e-01,  ..., 3.9530e-01,
           1.9376e-01, 7.3949e-01],
          [4.8249e-01, 8.3423e-01, 4.7482e-01,  ..., 2.5656e-01,
           7.5617e-01, 6.9391e-01],
          [9.5702e-01, 7.9796e-01, 5.0623e-01,  ..., 7.0712e-01,
           4.8639e-01, 3.4118e-01]]]),
 TensorCategory([ 9,  8,  4,  3,  4, 10, 10,  4,  7,  6,  4, 10,  2, 10,  6,  2,  8,  7,
          2,  3,  2,  3,  6,  2,  4,  9,  4,  5,  8,  2,  8,  2,  9,  1,  1, 10,
          8,  8, 10,  6,  8,  4,  2,  2,  7,  4,  6,  1, 10,  1,  7,  4,  8,  1,
          9,  1,  9,  1,  6,  3, 10,  8,  8,  8]))
{% endraw %}

There also en easy way to map any particular sample in a batch to the original dataset and id:

{% raw %}
dls = TSDataLoaders.from_dsets(metadatasets.train, metadatasets.valid)
xb, yb = first(dls.train)
mappings = dls.train.dataset.mapping_idxs
for i, (xbi, ybi) in enumerate(zip(xb, yb)):
    ds, idx = mappings[i]
    test_close(dsets[ds][idx][0].data.cpu(), xbi.cpu())
    test_close(dsets[ds][idx][1].data.cpu(), ybi.cpu())
{% endraw %}

For example the 3rd sample in this batch would be:

{% raw %}
dls.train.dataset.mapping_idxs[2]
array([  0, 102], dtype=int32)
{% endraw %}