Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published October 24, 2018 | Version v1
Report Open

MPI Learn: distributed training

  • 1. CERN openlab summer student

Description

MPI Learn is a framework for distributed training of Neural Networks. Machine Learning models can take a very long time to train. This can be improved using parallelism, by distributing the training over several processes and several hardware resources. Implementing parallelism requires expertise and is time consuming. MPI Learn is aimed at machine learning users, who need to speedup the training of their models. A user should input a model, training and validation data, and tune other training parameters.

MPILearnwillinternallydistributethetrainingoverthespecifiednumberofprocesses, and output results, abstracting all the parallelism from the user. MPI Learn is intended to be part of a bigger project, MPI Opt which aims to perform hyperparameter optimization, in a distributed fashion. This framework will search for the best hyperparameters in a user defined search space. The search will be parallelized, with several executions of MPI Learn being run in parallel. MPI Learn is currently implemented and being used in some practical projects. The work developed over the course of this summer focused on optimizing the framework, and analyzing its execution with the objective of increasing performance.

Files

Report_Filipe_Magalhaes.pdf

Files (250.2 kB)

Name Size Download all
md5:05433f4b60fc45dcba055e0a49befd40
250.2 kB Preview Download