API Reference¶
neuropredict (designed with the goal of not needing the user to code) is designed primarily be working on the command line. Hence, it is recommended to follow Usage and examples, although you can peek into its internal API and provide us with any feedback you may have.
-
run_workflow.
fit
(input_specification, meta_data, output_dir, pipeline=None, train_perc=0.5, num_repetitions=200, positive_class=None, feat_sel_size='tenth')[source]¶ Generate comprehensive report on the predictive performance for different feature sets and statistically compare them.
Main entry point for API access.
Parameters: input_specification : multiple
- Either
- path to a file containing list of paths (each line containing path to a valid MLDataset)
- list of paths to MLDatasets saved on disk
- list of MLDatasets (not recommended when feature sets and datasets are big in size and number)
- list of tuples (to specify multiple features), each element containing (X, y) i.e. data and target labels
- a single tuple containing (X, y) i.e. data and target labels
- list of paths to CSV files, each containing one type of features.
When specifying multiple sets of input features, ensure: - all of them contain the same number of samples - each sample belongs to same class across all feature sets.
meta_data : multiple
- a path to a meta data file (see Input formats page)
- a tuple conaining (sample_id_list, classes_dict), where the first element is list of samples IDs and the second element is a dict (keyed in sample IDs) with values representing their classes.
pipeline : object
A sciki-learn pipeline describing the sequence of steps (typically a set of feature selections and dimensionality reduction steps followed by classifier). Default: None, which leads to the selection of a Random Forest classifier with no feature selection.
method_names : list
A list of names to denote the different feature extraction methods
out_results_dir : str
Path to output directory to save the cross validation results to.
train_perc : float, optional
Percetange of subjects to train the classifier on. The percentage is applied to the size of the smallest class to estimate the number of subjects from each class to be reserved for training. The smallest class is chosen to avoid class-imbalance in the training set. Default: 0.8 (80%).
num_repetitions : int, optional
Number of repetitions of cross-validation estimation. Default: 200.
positive_class : str
Name of the class to be treated as positive in calculation of AUC
feat_sel_size : str or int
Number of features to retain after feature selection. Must be a method (tenth or square root of the size of smallest class in training set,
or a finite integer smaller than the data dimensionality.
Returns: results_path : str
Path to pickle file containing full set of CV results.
-
rhst.
run
(dataset_path_file, method_names, out_results_dir, train_perc=0.8, num_repetitions=200, positive_class=None, feat_sel_size='tenth')[source]¶ Parameters: dataset_path_file : str
path to file containing list of paths (each containing a valid MLDataset).
method_names : list
A list of names to denote the different feature extraction methods
out_results_dir : str
Path to output directory to save the cross validation results to.
train_perc : float, optional
Percetange of subjects to train the classifier on. The percentage is applied to the size of the smallest class to estimate the number of subjects from each class to be reserved for training. The smallest class is chosen to avoid class-imbalance in the training set. Default: 0.8 (80%).
num_repetitions : int, optional
Number of repetitions of cross-validation estimation. Default: 200.
positive_class : str
Name of the class to be treated as positive in calculation of AUC
feat_sel_size : str or int
Number of features to retain after feature selection. Must be a method (tenth or square root of the size of smallest class in training set,
or a finite integer smaller than the data dimensionality.
Returns: results_path : str
Path to pickle file containing full set of CV results.
-
visualize.
feature_importance_map
(feat_imp, method_labels, base_output_path, feature_names=None, show_distr=False, plot_title='feature importance', show_all=False)[source]¶ - Generates a map/barplot of feature importance.
- feat_imp must be a list of length num_datasets,
- each an ndarray of size [num_repetitions,num_features[idx]] where num_features[idx] refers to the dimensionality of n-th dataset.
metho_names must be a list of strings of the same length as feat_imp. feature_names must be a list (of ndarrays of strings) same size as feat_imp,
each element being another list of labels corresponding to num_features[idx].Parameters: feat_imp : list
List of numpy arrays, each of length num_features
method_labels : list
List of names for each method (or feature set).
base_output_path : str
feature_names : list
List of names for each feature.
show_distr : bool
plots the distribution (over different trials of cross-validation) of feature importance for each feature.
plot_title : str
Title of the importance map figure.
show_all : bool
If true, this will attempt to show the importance values for all the features. Be advised if you have more than 50 features, the figure would illegible. The default is to show only few important features (ranked by their median importance), when there is more than 25 features.
-
visualize.
confusion_matrices
(cfmat_array, class_labels, method_names, base_output_path, cmap=<matplotlib.colors.LinearSegmentedColormap object>)[source]¶ Display routine for the confusion matrix. Entries in confusin matrix can be turned into percentages with display_perc=True.
Use a separate method to iteratve over multiple datasets. confusion_matrix dime: [num_classes, num_classes, num_repetitions, num_datasets]
Parameters: cfmat_array
class_labels
method_names
base_output_path
cmap
-
visualize.
freq_hist_misclassifications
(num_times_misclfd, num_times_tested, method_labels, outpath, separate_plots=False)[source]¶ Summary of most/least frequently misclassified subjects for further analysis
-
visualize.
metric_distribution
(metric, labels, output_path, num_classes=2, metric_label='balanced accuracy')[source]¶ Distribution plots of various metrics such as balanced accuracy!
metric is expected to be ndarray of size [num_repetitions, num_datasets]
-
visualize.
compare_misclf_pairwise_parallel_coord_plot
(cfmat_array, class_labels, method_labels, out_path)[source]¶ Produces a parallel coordinate plot (unravelling the cobweb plot) comparing the the misclassfication rate of all feature sets for different pairwise classifications.
Parameters: cfmat_array
class_labels
method_labels
out_path
-
visualize.
compare_misclf_pairwise
(cfmat_array, class_labels, method_labels, out_path)[source]¶ Produces a cobweb plot comparing the the misclassfication rate of all feature sets for different pairwise classifications.
Parameters: cfmat_array
class_labels
method_labels
out_path
-
model_comparison.
vertical_nemenyi_plot
(data, num_reps, alpha=0.05, cmap=<matplotlib.colors.LinearSegmentedColormap object>)[source]¶ Vertical Nemenyi plot to compare model ranks and show differences.