williamFalcon/pytorch-lightning: Generalization!

doi:10.5281/zenodo.3530845

Published November 6, 2019 | Version 0.5.3

Software Open

williamFalcon/pytorch-lightning: Generalization!

1. Facebook AI Research
2. CTU in Prague
3. Target
4. Peking University, @24OI
5. Indian Institute of Technology Mandi
6. pontificia univerdidad católica
7. Commonwealth School
8. Biometrica
9. Angle Technologies
10. @BorgwardtLab
11. University of Trento
12. University of Texas at Austin
13. Preferred Networks
14. NYU
15. No

Generalization release

The main focus of this release was on adding flexibility and generalization to support broad research cases.

Next release will be Dec 7th (every 30 days).

Internal Facebook support

@lorenzoFabbri @tullie @myleott @ashwinb @shootingsoul @vreis These features were added to support FAIR, FAIAR and broader ML across other FB teams.

In general, we can expose any part that isn't exposed yet where someone might want to override the lightning implementation.

Added truncated back propagation through time support (thanks @tullie).

Trainer(truncated_bptt_steps=2)

Added iterable datasets.

# return iterabledataset
def train_dataloader(...):
    ds = IterableDataset(...)
    return Dataloader(ds)

# set validation to a fix number of batches
# (checks val every 100 train epochs)
Trainer(val_check_interval=100)

Add ability to customize backward and other training parts:

 def backward(self, use_amp, loss, optimizer):
     """
     Override backward with your own implementation if you need to
     :param use_amp: Whether amp was requested or not
     :param loss: Loss is already scaled by accumulated grads
     :param optimizer: Current optimizer being used
     :return:
     """
     if use_amp:
         with amp.scale_loss(loss, optimizer) as scaled_loss:
             scaled_loss.backward()
     else:
         loss.backward()

DDP custom implementation support (override these hooks):

 def configure_ddp(self, model, device_ids):
     """
     Override to init DDP in a different way or use your own wrapper.
     Must return model.
     :param model:
     :param device_ids:
     :return: DDP wrapped model
     """
     model = LightningDistributedDataParallel(
         model,
         device_ids=device_ids,
         find_unused_parameters=True
     )
     return model

 def init_ddp_connection(self, proc_rank, world_size):
     """
     Connect all procs in the world using the env:// init
     Use the first node as the root address
     """

     # use slurm job id for the port number
     # guarantees unique ports across jobs from same grid search
     try:
         # use the last 4 numbers in the job id as the id
         default_port = os.environ['SLURM_JOB_ID']
         default_port = default_port[-4:]

         # all ports should be in the 10k+ range
         default_port = int(default_port) + 15000

     except Exception as e:
         default_port = 12910

     # if user gave a port number, use that one instead
     try:
         default_port = os.environ['MASTER_PORT']
     except Exception:
         os.environ['MASTER_PORT'] = str(default_port)

     # figure out the root node addr
     try:
         root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
     except Exception:
         root_node = '127.0.0.2'

     root_node = self.trainer.resolve_root_node_address(root_node)
     os.environ['MASTER_ADDR'] = root_node
     dist.init_process_group('nccl', rank=proc_rank, world_size=world_size)

Support for your own apex init or implementation.

 def configure_apex(self, amp, model, optimizers, amp_level):
     """
     Override to init AMP your own way
     Must return a model and list of optimizers
     :param amp:
     :param model:
     :param optimizers:
     :param amp_level:
     :return: Apex wrapped model and optimizers
     """
     model, optimizers = amp.initialize(
         model, optimizers, opt_level=amp_level,
     )

     return model, optimizers

DDP2 implementation (inspired by parlai and @stephenroller).
DDP2 acts as DP in the node and DDP across nodes. As a result, an optional method is introduced training_end
where you can use the outputs of training_step (performed on each GPU with a portion of the batch), to do something with the outputs of all batches on the node (ie: negative sampling). ```python Trainer(distributed_backend='ddp2')

def training_step(...):

# x is 1/nb_gpus of the full batch
out = model(x)
return {'out': out}

def training_end(self, outputs):

 # all_outs has outs from ALL gpus 
 all_outs = outputs['out']
 loss = softmax(all_outs)
 return {'loss': loss}

```

Logging

More logger diversity including Comet.ml.
Versioned logs for all loggers.
switched from print to logging

progress bar

now the progress bar has a full bar for the full train + val epochs and a second bar visible only during val.

loading

checkpoints now store hparams
no need to pass tags.csv to restore state because it lives in the checkpoint.

Slurm resubmit with apex + ddp

Fixes issue of ddp restore weights blowing out GPU memory (load on cpu first then GPU).
Saves apex states automatically and restores it for a checkpoint.

Refactoring

internal code made modular through Mixins for ease of readability and to minimize merge conflicts.

Docs

Tons of doc improvements.

Thanks!

Thank you to the amazing contributor community! Especially @neggert and @Borda for reviewing PRs and taking care of a good number of Github issues. The community is thriving and has really embraced making Lightning better.

Great job everyone!

Files

williamFalcon/pytorch-lightning-0.5.3.zip

Files (730.2 kB)

Name	Size	Download all
williamFalcon/pytorch-lightning-0.5.3.zip md5:80171dc38e70edacc8535a96095ae5ab	730.2 kB	Preview Download

Additional details

Is supplement to: https://github.com/williamFalcon/pytorch-lightning/tree/0.5.3 (URL)

	All versions	This version
Views	7,695	816
Downloads	230	8
Data volume	2.9 GB	5.8 MB

williamFalcon/pytorch-lightning: Generalization!

Creators

Description

Files

williamFalcon/pytorch-lightning-0.5.3.zip

Files (730.2 kB)

Additional details

Related works