Usage and examples

The command line interface for neuropredict is strongly recommended (given its focus on batch processing multiple comparisons). If the installation was successful, options could be obtained by typing one of the following commands:

neuropredict
neuropredict -h

Those options are also shown below (may not always show up due to problems with software for auto generation of docs). Check the bottom of this page for examples.

Easy, standardized and comprehensive predictive analysis.

usage: neuropredict [-h] [-m META_FILE] [-o OUT_DIR] [-f FS_SUBJECT_DIR]
                    [-y PYRADIGM_PATHS [PYRADIGM_PATHS ...]]
                    [-u USER_FEATURE_PATHS [USER_FEATURE_PATHS ...]]
                    [-d DATA_MATRIX_PATHS [DATA_MATRIX_PATHS ...]]
                    [-a ARFF_PATHS [ARFF_PATHS ...]] [-p POSITIVE_CLASS]
                    [-t TRAIN_PERC] [-n NUM_REP_CV]
                    [-k NUM_FEATURES_TO_SELECT]
                    [-s [SUB_GROUPS [SUB_GROUPS ...]]]
                    [-g {none,light,exhaustive}]
                    [-e {randomforestclassifier,extratreesclassifier}]
                    [-z MAKE_VIS] [-c NUM_PROCS] [-v]

Named Arguments

-m, --meta_file
 

Abs path to file containing metadata for subjects to be included for analysis.

At the minimum, each subject should have an id per row followed by the class it belongs to.

E.g. .. parsed-literal:

sub001,control
sub002,control
sub003,disease
sub004,disease
-o, --out_dir Output folder to store gathered features & results.
-f, --fs_subject_dir
 

Absolute path to SUBJECTS_DIR containing the finished runs of Freesurfer parcellation Each subject will be queried after its ID in the metadata file.

E.g. --fs_subject_dir /project/freesurfer_v5.3

Input data and formats

Only one of the following types can be specified.

-y, --pyradigm_paths
 

Path(s) to pyradigm datasets.

Each path is self-contained dataset identifying each sample, its class and features.

-u, --user_feature_paths
 

List of absolute paths to user’s own features.

Format: Each of these folders contains a separate folder for each subject (named after its ID in the metadata file) containing a file called features.txt with one number per line. All the subjects (in a given folder) must have the number of features (#lines in file). Different parent folders (describing one feature set) can have different number of features for each subject, but they must all have the same number of subjects (folders) within them.

Names of each folder is used to annotate the results in visualizations. Hence name them uniquely and meaningfully, keeping in mind these figures will be included in your papers. For example,

--user_feature_paths /project/fmri/ /project/dti/ /project/t1_volumes/

Only one of --pyradigm_paths, user_feature_paths, data_matrix_path or arff_paths options can be specified.

-d, --data_matrix_paths
 

List of absolute paths to text files containing one matrix of size N x p (num_samples x num_features).

Each row in the data matrix file must represent data corresponding to sample in the same row of the meta data file (meta data file and data matrix must be in row-wise correspondence).

Name of this file will be used to annotate the results and visualizations.

E.g. ``–data_matrix_paths /project/fmri.csv /project/dti.csv /project/t1_volumes.csv ``

Only one of --pyradigm_paths, user_feature_paths, data_matrix_path or arff_paths options can be specified.

File format could be
  • a simple comma-separated text file (with extension .csv or .txt): which can easily be read back with
    numpy.loadtxt(filepath, delimiter=’,’) or
  • a numpy array saved to disk (with extension .npy or .numpy) that can read in with numpy.load(filepath).

One could use numpy.savetxt(data_array, delimiter=',') or numpy.save(data_array) to save features.

File format is inferred from its extension.

-a, --arff_paths
 

List of paths to files saved in Weka’s ARFF dataset format.

Note:
  • this format does NOT allow IDs for each subject.
  • given feature values are saved in text format, this can lead to large files with high-dimensional data,
    compared to numpy arrays saved to disk in binary format.

More info: https://www.cs.waikato.ac.nz/ml/weka/arff.html

Cross-validation

Parameters related to training and optimization during cross-validation

-p, --positive_class
 

Name of the positive class (e.g. Alzheimers, MCI etc) to be used in calculation of area under the ROC curve. Applicable only for binary classification experiments.

Default: class appearing last in order specified in metadata file.

-t, --train_perc
 

Percentage of the smallest class to be reserved for training.

Must be in the interval [0.01 0.99].

If sample size is sufficiently big, we recommend 0.5. If sample size is small, or class imbalance is high, choose 0.8.

-n, --num_rep_cv
 

Number of repetitions of the repeated-holdout cross-validation.

The larger the number, more stable the estimates will be.

-k, --num_features_to_select
 
Number of features to select as part of feature selection.

Options:

  • ‘tenth’
  • ‘sqrt’
  • ‘log2’
  • ‘all’

Default: ‘tenth’ of the number of samples in the training set.

For example, if your dataset has 90 samples, you chose 50 percent for training (default), then Y will have 90*.5=45 samples in training set, leading to 5 features to be selected for taining. If you choose a fixed integer, ensure all the feature sets under evaluation have atleast that many features.

-s, --sub_groups
 

This option allows the user to study different combinations of classes in a multi-class (N>2) dataset.

For example, in a dataset with 3 classes CN, FTD and AD, two studies of pair-wise combinations can be studied separately with the following flag --sub_groups CN,FTD CN,AD. This allows the user to focus on few interesting subgroups depending on their dataset/goal.

Format: Different subgroups must be separated by space, and each sub-group must be a comma-separated list of class names defined in the meta data file. Hence it is strongly recommended to use class names without any spaces, commas, hyphens and special characters, and ideally just alphanumeric characters separated by underscores.

Any number of subgroups can be specified, but each subgroup must have atleast two distinct classes.

Default: 'all', leading to inclusion of all available classes in a all-vs-all multi-class setting.

-g, --gs_level

Possible choices: none, light, exhaustive

Flag to specify the level of grid search during hyper-parameter optimization on the training set. Allowed options are : ‘none’, ‘light’ and ‘exhaustive’, in the order of how many values/values will be optimized.

More parameters and more values demand more resources and much longer time for optimization.

The ‘light’ option tries to “folk wisdom” to try least number of values (no more than one or two),
for the parameters for the given classifier. (e.g. a lage number say 500 trees for a random forest optimization). The ‘light’ will be the fastest and should give a “rough idea” of predictive performance. The ‘exhaustive’ option will try to most parameter values for the most parameters that can be optimized.

Predictive Model

Parameters related to pipeline comprising the predictive model

-e, --classifier
 

Possible choices: randomforestclassifier, extratreesclassifier

String specifying one of the implemented classifiers. (Classifiers are carefully chosen to allow for the comprehensive report provided by neuropredict).

Default: ‘RandomForestClassifier’ More options will be implemented in due course.

Visualization

Parameters related to generating visualizations

-z, --make_vis Option to make visualizations from existing results in the given path. This is helpful when neuropredict failed to generate result figures automatically e.g. on a HPC cluster, or another environment when DISPLAY is either not available.

Computing

Parameters related to computations/debugging

-c, --num_procs
 

Number of CPUs to use to parallelize CV repetitions.

Default : 4.

Number of CPUs will be capped at the number available on the machine if higher is requested.

-v, --version show program’s version number and exit

If you don’t see any command line usage info shown above, click here:

A rough example of usage can be:

neuropredict -m meta_data.csv -f /work/project/features_dir

Example for meta-data

For example, if you have a dataset with the following three classes: 5 controls, 6 disease_one and 9 other_disease, all you would need to do is produce a meta data file as shown below (specifying a class label for each subject):
3071,controls
3069,controls
3064,controls
3063,controls
3057,controls
5004,disease_one
5074,disease_one
5077,disease_one
5001,disease_one
5002,disease_one
5003,disease_one
5000,other_disease
5006,other_disease
5013,other_disease
5014,other_disease
5016,other_disease
5018,other_disease
5019,other_disease
5021,other_disease
5022,other_disease

and neuropredict will produce the figures (and numbers in a CSV files) as shown here:

_images/composite_flyer.001.png

The higher resolution PDFs are included in the docs folder.

The typical output on the command line would like something like:

neuropredict -y *.MLDataset.pkl -m meta_FourClasses.csv -o ./predictions -t 0.75 -n 250

Requested features for analysis:
get_pyradigm from chebyshev.MLDataset.pkl
get_pyradigm from chebyshev_neg.MLDataset.pkl
get_pyradigm from chi_square.MLDataset.pkl
get_pyradigm from correlate_1.MLDataset.pkl
get_pyradigm from correlate.MLDataset.pkl
get_pyradigm from cosine_1.MLDataset.pkl
get_pyradigm from cosine_2.MLDataset.pkl
get_pyradigm from cosine_alt.MLDataset.pkl
get_pyradigm from cosine.MLDataset.pkl
get_pyradigm from euclidean.MLDataset.pkl
get_pyradigm from fidelity_based.MLDataset.pkl
Different classes in the training set are stratified to match the smallest class!

 CV repetition   0
     feature   0      weight_chebyshev : balanced accuracy: 0.3018
     feature   1  weight_chebyshev_neg : balanced accuracy: 0.2917
     feature   2     weight_chi_square : balanced accuracy: 0.2603
     feature   3    weight_correlate_1 : balanced accuracy: 0.3271
     feature   4      weight_correlate : balanced accuracy: 0.3647
     feature   5       weight_cosine_1 : balanced accuracy: 0.3202
     feature   6       weight_cosine_2 : balanced accuracy: 0.2869
     feature   7     weight_cosine_alt : balanced accuracy: 0.3656
     feature   8         weight_cosine : balanced accuracy: 0.3197
     feature   9      weight_euclidean : balanced accuracy: 0.2579
     feature  10 weight_fidelity_based : balanced accuracy: 0.1190

 CV repetition   1
     feature   0      weight_chebyshev : balanced accuracy: 0.3416
     feature   1  weight_chebyshev_neg : balanced accuracy: 0.3761
     feature   2     weight_chi_square : balanced accuracy: 0.3748
     feature   3    weight_correlate_1 : balanced accuracy: 0.3397
     feature   4      weight_correlate : balanced accuracy: 0.4087
     feature   5       weight_cosine_1 : balanced accuracy: 0.3074
     feature   6       weight_cosine_2 : balanced accuracy: 0.4059
     feature   7     weight_cosine_alt : balanced accuracy: 0.3658
     feature   8         weight_cosine : balanced accuracy: 0.3290
     feature   9      weight_euclidean : balanced accuracy: 0.2662
     feature  10 weight_fidelity_based : balanced accuracy: 0.2090

 CV repetition   2
 . . . .
 . . . .
 . . . .
 CV repetition   n

pyradigm here is the python class to ease your ML workflow - check it out here: pyradigm.readthedocs.io

I hope this user-friendly tool would help you get started on the predictive analysis you’ve been wanting to do for a while.