API Reference

neuropredict (designed with the goal of not needing the user to code) is designed primarily be working on the command line. Hence, it is recommended to follow Usage and examples, although you can peek into its internal API and provide us with any feedback you may have.

run_workflow.fit(input_specification, meta_data, output_dir, pipeline=None, train_perc=0.5, num_repetitions=200, positive_class=None, feat_sel_size='tenth')[source]

Generate comprehensive report on the predictive performance for different feature sets and statistically compare them.

Main entry point for API access.

Parameters:

input_specification : multiple

Either
  • path to a file containing list of paths (each line containing path to a valid MLDataset)
  • list of paths to MLDatasets saved on disk
  • list of MLDatasets (not recommended when feature sets and datasets are big in size and number)
  • list of tuples (to specify multiple features), each element containing (X, y) i.e. data and target labels
  • a single tuple containing (X, y) i.e. data and target labels
  • list of paths to CSV files, each containing one type of features.

When specifying multiple sets of input features, ensure: - all of them contain the same number of samples - each sample belongs to same class across all feature sets.

meta_data : multiple

  • a path to a meta data file (see Input formats page)
  • a tuple conaining (sample_id_list, classes_dict), where the first element is list of samples IDs and the second element is a dict (keyed in sample IDs) with values representing their classes.

pipeline : object

A sciki-learn pipeline describing the sequence of steps (typically a set of feature selections and dimensionality reduction steps followed by classifier). Default: None, which leads to the selection of a Random Forest classifier with no feature selection.

method_names : list

A list of names to denote the different feature extraction methods

out_results_dir : str

Path to output directory to save the cross validation results to.

train_perc : float, optional

Percetange of subjects to train the classifier on. The percentage is applied to the size of the smallest class to estimate the number of subjects from each class to be reserved for training. The smallest class is chosen to avoid class-imbalance in the training set. Default: 0.8 (80%).

num_repetitions : int, optional

Number of repetitions of cross-validation estimation. Default: 200.

positive_class : str

Name of the class to be treated as positive in calculation of AUC

feat_sel_size : str or int

Number of features to retain after feature selection. Must be a method (tenth or square root of the size of smallest class in training set,

or a finite integer smaller than the data dimensionality.

Returns:

results_path : str

Path to pickle file containing full set of CV results.

run_workflow.run_cli()[source]

Main entry point.

run_workflow.get_parser()[source]

Parser to specify arguments and their defaults.

rhst.run(dataset_path_file, method_names, out_results_dir, train_perc=0.8, num_repetitions=200, positive_class=None, feat_sel_size='tenth')[source]
Parameters:

dataset_path_file : str

path to file containing list of paths (each containing a valid MLDataset).

method_names : list

A list of names to denote the different feature extraction methods

out_results_dir : str

Path to output directory to save the cross validation results to.

train_perc : float, optional

Percetange of subjects to train the classifier on. The percentage is applied to the size of the smallest class to estimate the number of subjects from each class to be reserved for training. The smallest class is chosen to avoid class-imbalance in the training set. Default: 0.8 (80%).

num_repetitions : int, optional

Number of repetitions of cross-validation estimation. Default: 200.

positive_class : str

Name of the class to be treated as positive in calculation of AUC

feat_sel_size : str or int

Number of features to retain after feature selection. Must be a method (tenth or square root of the size of smallest class in training set,

or a finite integer smaller than the data dimensionality.

Returns:

results_path : str

Path to pickle file containing full set of CV results.

rhst.load_results(results_file_path)[source]

Loads the results serialized by RHsT.

rhst.save_results(out_dir, var_list_to_save)[source]

Serializes the results to disk.

visualize.feature_importance_map(feat_imp, method_labels, base_output_path, feature_names=None, show_distr=False, plot_title='feature importance', show_all=False)[source]
Generates a map/barplot of feature importance.
feat_imp must be a list of length num_datasets,
each an ndarray of size [num_repetitions,num_features[idx]] where num_features[idx] refers to the dimensionality of n-th dataset.

metho_names must be a list of strings of the same length as feat_imp. feature_names must be a list (of ndarrays of strings) same size as feat_imp,

each element being another list of labels corresponding to num_features[idx].
Parameters:

feat_imp : list

List of numpy arrays, each of length num_features

method_labels : list

List of names for each method (or feature set).

base_output_path : str

feature_names : list

List of names for each feature.

show_distr : bool

plots the distribution (over different trials of cross-validation) of feature importance for each feature.

plot_title : str

Title of the importance map figure.

show_all : bool

If true, this will attempt to show the importance values for all the features. Be advised if you have more than 50 features, the figure would illegible. The default is to show only few important features (ranked by their median importance), when there is more than 25 features.

visualize.confusion_matrices(cfmat_array, class_labels, method_names, base_output_path, cmap=<matplotlib.colors.LinearSegmentedColormap object>)[source]

Display routine for the confusion matrix. Entries in confusin matrix can be turned into percentages with display_perc=True.

Use a separate method to iteratve over multiple datasets. confusion_matrix dime: [num_classes, num_classes, num_repetitions, num_datasets]

Parameters:

cfmat_array

class_labels

method_names

base_output_path

cmap

visualize.freq_hist_misclassifications(num_times_misclfd, num_times_tested, method_labels, outpath, separate_plots=False)[source]

Summary of most/least frequently misclassified subjects for further analysis

visualize.metric_distribution(metric, labels, output_path, num_classes=2, metric_label='balanced accuracy')[source]

Distribution plots of various metrics such as balanced accuracy!

metric is expected to be ndarray of size [num_repetitions, num_datasets]

visualize.compare_misclf_pairwise_parallel_coord_plot(cfmat_array, class_labels, method_labels, out_path)[source]

Produces a parallel coordinate plot (unravelling the cobweb plot) comparing the the misclassfication rate of all feature sets for different pairwise classifications.

Parameters:

cfmat_array

class_labels

method_labels

out_path

visualize.compare_misclf_pairwise(cfmat_array, class_labels, method_labels, out_path)[source]

Produces a cobweb plot comparing the the misclassfication rate of all feature sets for different pairwise classifications.

Parameters:

cfmat_array

class_labels

method_labels

out_path

model_comparison.nemenyi_test()[source]

Nemenyi post-hoc analysis.

model_comparison.vertical_nemenyi_plot(data, num_reps, alpha=0.05, cmap=<matplotlib.colors.LinearSegmentedColormap object>)[source]

Vertical Nemenyi plot to compare model ranks and show differences.

freesurfer.aseg_stats_subcortical(fspath, subjid)[source]

Returns all the subcortical volumes found in stats/aseg.stats.

Equivalent of load_fs_segstats.m

freesurfer.aseg_stats_whole_brain(fspath, subjid)[source]

Returns a feature set of whole brain volumes found in Freesurfer output: subid/stats/aseg.stats