MPI Learn: distributed training
Description
MPI Learn is a framework for distributed training of Neural Networks. Machine Learning models can take a very long time to train. This can be improved using parallelism, by distributing the training over several processes and several hardware resources. Implementing parallelism requires expertise and is time consuming. MPI Learn is aimed at machine learning users, who need to speedup the training of their models. A user should input a model, training and validation data, and tune other training parameters.
MPILearnwillinternallydistributethetrainingoverthespecifiednumberofprocesses, and output results, abstracting all the parallelism from the user. MPI Learn is intended to be part of a bigger project, MPI Opt which aims to perform hyperparameter optimization, in a distributed fashion. This framework will search for the best hyperparameters in a user defined search space. The search will be parallelized, with several executions of MPI Learn being run in parallel. MPI Learn is currently implemented and being used in some practical projects. The work developed over the course of this summer focused on optimizing the framework, and analyzing its execution with the objective of increasing performance.
Files
Report_Filipe_Magalhaes.pdf
Files
(250.2 kB)
Name | Size | Download all |
---|---|---|
md5:05433f4b60fc45dcba055e0a49befd40
|
250.2 kB | Preview Download |