Report Open Access

MPI Learn: distributed training

Magalhaes, Filipe

MPI Learn is a framework for distributed training of Neural Networks. Machine Learning models can take a very long time to train. This can be improved using parallelism, by distributing the training over several processes and several hardware resources. Implementing parallelism requires expertise and is time consuming. MPI Learn is aimed at machine learning users, who need to speedup the training of their models. A user should input a model, training and validation data, and tune other training parameters.

MPILearnwillinternallydistributethetrainingoverthespecifiednumberofprocesses, and output results, abstracting all the parallelism from the user. MPI Learn is intended to be part of a bigger project, MPI Opt which aims to perform hyperparameter optimization, in a distributed fashion. This framework will search for the best hyperparameters in a user defined search space. The search will be parallelized, with several executions of MPI Learn being run in parallel. MPI Learn is currently implemented and being used in some practical projects. The work developed over the course of this summer focused on optimizing the framework, and analyzing its execution with the objective of increasing performance.

Files (250.2 kB)
Name Size
Report_Filipe_Magalhaes.pdf
md5:05433f4b60fc45dcba055e0a49befd40
250.2 kB Download
169
127
views
downloads
All versions This version
Views 169169
Downloads 127127
Data volume 31.8 MB31.8 MB
Unique views 163163
Unique downloads 123123

Share

Cite as