MPI Learn: distributed training

Magalhaes, Filipe

doi:10.5281/zenodo.1470488

Published October 24, 2018 | Version v1

Report Open

MPI Learn: distributed training

Magalhaes, Filipe¹

1. CERN openlab summer student

MPI Learn is a framework for distributed training of Neural Networks. Machine Learning models can take a very long time to train. This can be improved using parallelism, by distributing the training over several processes and several hardware resources. Implementing parallelism requires expertise and is time consuming. MPI Learn is aimed at machine learning users, who need to speedup the training of their models. A user should input a model, training and validation data, and tune other training parameters.

MPILearnwillinternallydistributethetrainingoverthespecifiednumberofprocesses, and output results, abstracting all the parallelism from the user. MPI Learn is intended to be part of a bigger project, MPI Opt which aims to perform hyperparameter optimization, in a distributed fashion. This framework will search for the best hyperparameters in a user defined search space. The search will be parallelized, with several executions of MPI Learn being run in parallel. MPI Learn is currently implemented and being used in some practical projects. The work developed over the course of this summer focused on optimizing the framework, and analyzing its execution with the objective of increasing performance.

Files

Report_Filipe_Magalhaes.pdf

Files (250.2 kB)

Name	Size	Download all
Report_Filipe_Magalhaes.pdf md5:05433f4b60fc45dcba055e0a49befd40	250.2 kB	Preview Download

	All versions	This version
Views	282	281
Downloads	222	222
Data volume	56.5 MB	56.5 MB

MPI Learn: distributed training

Creators

Description

Files

Report_Filipe_Magalhaes.pdf

Files (250.2 kB)