Published April 19, 2024 | Version Initial
Computational notebook Open

SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems

Description

# Split Variational Autoencoders
Generate scenarios for stochastic planning and decisions from distributed and siloed data with full data isolation.

## Components
The splitVAEs framework has the following components

- **datahandlers**: are classes that help read and process data. These could be simple time series datasets or images used for training.
- **models**: are classes that implement the edge as well as the server level computational steps needed for splitVAE. All classes are subclassed from the **BaseModel** class.
  For the server these include VAE, VAEEncoder and VAEDecoder. For the Edge level training these include an Encoder and a Decoder.
- **trainer**: are classes that implement training logic and marshall and provision MPI resources. The BaseTrainHandler is the superclass that does all the heavy lifting in terms of mprovisioning MPI resources, deciding the rank and the size of the network. 
  It also reads config files to determine what model parameters to provide to the server and edge models respectively. BaseAgent is the superclass that handles the actual training.
  It is subclassed by the Edge agent and the Server agent that respectively implement the *train()* and *test()* functions.
- **utils**: contains code needed to setup the framework.

## Config files
The following configuration files are needed

- **trainer.yaml** is the experiment config file. It sets up the parameters needed for batch size, epochs, minimum train size and the kl backpass strategy. 
  The minimum train size is the minimum number of rows or datapoints each agent must have inorder to trigger one whole training session locally. 
  For instance, if MIN_TRAINING_SIZE=1024 and some agents still have only 900 local rows but some have 2000, a global training round is not possible.
  Kl backpass strategy is how and where to backpropagate kl loss from. Best left to local itself.
  In the trainer config file, we have model config file assignments to various MPI ranks. Under the **models_config** category, we have *decoder* and *encoder*. 
  Each of these subcategories have a file name and a list of lists associated with them. The list of lists is an indication of which ranks are supposed to follow this particular file definition for encoder or decoder respectively.
  For instance, say in an MPI network of 10 ranks, you want ranks 1,2 and 4 to follow the definition in **decoder1.yaml**,**encoder1.yaml** respectively for decoder, encoder and ranks 3 and 5 to follow **decoder2.yaml**,**encoder2.yaml**.
  Then the model_config definition would be:
  ```
  decoder:
      vae_decoder.yaml: [[0]]
      decoder1.yaml: [[1,2],[4]]
      decoder2.yaml: [[3],[5]]
  encoder:
      vae_encoder.yaml: [[0]]
      encoder1.yaml: [[1,2],[4]]
      encoder2.yaml: [[3],[5]]
  ```
  The convention also supports continous listing. For instance you could say [[1,64]] to assign the same configuration to all ranks from 1 to 64.
- **(vae_)encoder.yaml/(vae_)decoder.yaml**: These are the yaml files for model definitions for encoder and decoder respectively. They support pytorch arguments as well as activation functions. Take a look at (this link)[https://gist.github.com/ferrine/89d739e80712f5549e44b2c2435979ef] for more details.
  Note that a typical file configuration could look something like this:
  ```
  optimizer:
    - args: {"name":"adam","lr":1e-2}
  architecture:
    - Linear:
        args: [8,32]
    - ReLU:
        inplace: false
    - Linear:
        args: [32,128]
  ```
  In this file for example, the optimizer tag specifies the optimizer to use, its name and other parameters associated with it for instance the learning rate.
  The architecture tag lays out the layers, number of layers and the activation function for each layer. For each layer-activation combo, we specify the number of inputs, outputs and activation.
  This config style also supports CNNs. Here is an example:
  ```
    architecture:
      - Conv2d:
          args: [3, 16, 25]
          stride: 1
          padding: 2
      - ReLU:
          inplace: true
      - Conv2d:
          args: [16, 25, 5]
          stride: 1
          padding: 2
  ```
## How to run
The entire codebase runs based on *main.py* which in turn calls *TrainHandler.py* which is subclassed from **BaseTrainHandler**. To run the code,
```commandline
export EXP_CONFIG_FILE=<path/to/config/directory>/config_files/trainer.yaml DATA_PATH=<path/to/your/data>
mpiexec -np no_of_processes python main.py
```
Here *no_of_processors* **must** be set to **no_of_edge_devices+1**, where no_of_edge_devices is the total number of edge devices we need to simulate.
*<path/to/config/directory>* indicates the path to the directory that contains the *config_files* folder.

After you run main.py, the testing logic will be automatically called. It will plot the testing results and the training error (reconstruction+kl_loss).

Files

splitVAEs-dev.zip

Files (268.0 kB)

Name Size Download all
md5:a0fde8ee33a20761bd013089be4f6e08
268.0 kB Preview Download

Additional details

Software

Programming language
Python
Development Status
Active