slideflow.stats¶
In addition to containing functions used during model training and evaluation, this module provides
the slideflow.SlideMap class designed to assist with visualizing tiles and slides
in two-dimensional space.
Once a model has been trained, tile-level predictions and intermediate layer activations can be calculated
across an entire dataset with slideflow.model.DatasetFeatures.
The slideflow.SlideMap class can then perform dimensionality reduction on these dataset-wide
activations, plotting tiles and slides in two-dimensional space. Visualizing the distribution and clustering
of tile-level and slide-level layer activations can help reveal underlying structures in the dataset and shared
visual features among classes.
The primary method of use is first generating an slideflow.model.DatasetFeatures from a trained
model, then creating an instance of a slideflow.SlideMap by using the from_features class
method:
df = sf.model.DatasetFeatures(model='/path/', ...)
slide_map = sf.SlideMap.from_features(df)
Alternatively, if you would like to map slides from a dataset in two-dimensional space using pre-calculated x and y
coordinates, you can use the from_precalculated class method. In addition to X and Y, this method requires supplying
tile-level metadata in the form of a list of dicts. Each dict must contain the name of the origin slide and the tile
index in the slide TFRecord.
dataset = project.dataset(tile_px=299, tile_um=302)
slides = dataset.slides()
x = np.array(...)
y = np.array(...)
meta = [{'slide': ..., 'index': ...} for i in range(len(x))]
slide_map = sf.SlideMap.from_precalculated(slides, x, y, meta)
SlideMap¶
- class slideflow.SlideMap(slides, cache=None)¶
Two-dimensional slide map for visualization & backend for mosaic maps.
Slides are mapped in 2D either explicitly with pre-specified coordinates, or with dimensionality reduction from post-convolutional layer weights, provided from
slideflow.model.DatasetFeatures.- __init__(slides, cache=None)¶
Backend for mapping slides into two dimensional space. Can use a DatasetFeatures object to map slides according to UMAP of features, or map according to pre-specified coordinates.
- cluster(n_clusters)¶
Performs clustering on data and adds to metadata labels. Requires a DatasetFeatures backend.
Clusters are saved to self.point_meta[i][‘cluster’].
- Parameters
n_clusters (int) – Number of clusters for K means clustering.
- Returns
Array with cluster labels corresponding to tiles in self.point_meta.
- Return type
ndarray
- export_to_csv(filename)¶
Exports calculated UMAP coordinates in csv format.
- Parameters
filename (str) – Path to CSV file in which to save coordinates.
- filter(slides)¶
Filters map to only show tiles from the given slides.
- classmethod from_features(df, exclude_slides=None, prediction_filter=None, recalculate=False, map_slide=None, cache=None, low_memory=False, umap_dim=2)¶
Initializes map from dataset features.
- Parameters
df (
slideflow.model.DatasetFeatures) – DatasetFeatures.exclude_slides (list, optional) – List of slides to exclude.
prediction_filter (list, optional) – only these provided categories.
recalculate (bool, optional) – Force recalculation of umap despite presence of cache.
use_centroid (bool, optional) – Calculate/map centroid activations.
map_slide (str, optional) – Either None, ‘centroid’, or ‘average’. If None, will map all tiles from each slide. Defaults to None.
cache (str, optional) – Path to PKL file to cache coordinates. Defaults to None (caching disabled).
- classmethod from_precalculated(slides, x, y, meta, labels=None, cache=None)¶
Initializes map from precalculated coordinates.
- Parameters
meta (list(dict)) – List of dicts containing metadata for each point on the map (representing a single tfrecord).
labels (list(str)) – Labels assigned to each tfrecord, used for coloring TFRecords according to labels.
cache (str, optional) – Path to PKL file to cache coordinates. Defaults to None (caching disabled).
- get_tiles_in_area(x_lower, x_upper, y_lower, y_upper)¶
- Returns dictionary of slide names mapping to tile indices,
or tiles that fall within the specified location on the umap.
- label_by_logits(index)¶
Displays each point with label equal to the logits (linear from 0-1)
- Parameters
index (int) – Logit index.
- label_by_meta(meta, translation_dict=None)¶
Displays each point labeled by tile metadata (e.g. ‘prediction’)
- label_by_slide(slide_labels=None)¶
- Displays each point as the name of the corresponding slide.
If slide_labels is provided, will use this dict to label slides.
- Parameters
slide_labels (dict, optional) – Dict mapping slide names to labels.
- label_by_uncertainty(index=0)¶
Labels each point with the tile-level uncertainty, if available.
- Parameters
index (int, optional) – Uncertainty index. Defaults to 0.
- load_cache()¶
Load coordinates from PKL cache.
- neighbors(slide_categories=None, algorithm='kd_tree')¶
- Calculates neighbors among tiles in this map, assigning neighboring
statistics to tile metadata ‘num_unique_neighbors’ and ‘percent_matching_categories’.
- save(filename, subsample=None, title=None, cmap=None, xlim=(- 0.05, 1.05), ylim=(- 0.05, 1.05), xlabel=None, ylabel=None, legend=None, dpi=300, **scatter_kwargs)¶
Saves plot of data to a provided filename.
- Parameters
filename (str) – File path to save the image.
subsample (int, optional) – Subsample to only include this many tiles on plot. Defaults to None.
title (str, optional) – Title for plot.
cmap (dict, optional) – Dict mapping labels to colors.
xlim (list, optional) – List of float indicating limit for x-axis. Defaults to (-0.05, 1.05).
ylim (list, optional) – List of float indicating limit for y-axis. Defaults to (-0.05, 1.05).
xlabel (str, optional) – Label for x axis. Defaults to None.
ylabel (str, optional) – Label for y axis. Defaults to None.
legend (str, optional) – Title for legend. Defaults to None.
dpi (int, optional) – DPI for final image. Defaults to 300.
- save_2d_plot(*args, **kwargs)¶
Deprecated function; please use save.
- save_3d_plot(filename, z=None, feature=None, subsample=None)¶
Saves a plot of a 3D umap, with the 3rd dimension representing values provided by argument “z”.
- Parameters
filename (str) – Filename to save image of plot.
z (list, optional) – Values for z axis. Must supply z or feature. Defaults to None.
feature (int, optional) – Int, feature to plot on 3rd axis. Must supply z or feature. Defaults to None.
subsample (int, optional) – Subsample to only include this many tiles on plot. Defaults to None.
- save_cache()¶
Save cache of coordinates to PKL file.
basic_metrics¶
- slideflow.stats.basic_metrics(y_true, y_pred)¶
Generates metrics, including sensitivity, specificity, and accuracy.
calculate_centroid¶
- slideflow.stats.calculate_centroid(act)¶
Calcultes slide-level centroid indices for a provided activations dict.
concordance_index¶
- slideflow.stats.concordance_index(y_true, y_pred)¶
Calculates concordance index from a given y_true and y_pred.
filtered_prediction¶
- slideflow.stats.filtered_prediction(logits, filter=None)¶
Generates a prediction from a logits vector masked by a given filter.
generate_combined_roc¶
- slideflow.stats.generate_combined_roc(y_true, y_pred, save_dir, labels, name='ROC', neptune_run=None)¶
Generates and saves overlapping ROCs.
generate_roc¶
- slideflow.stats.generate_roc(y_true, y_pred, save_dir=None, name='ROC', neptune_run=None)¶
Generates and saves an ROC with a given set of y_true, y_pred values.
generate_scatter¶
- slideflow.stats.generate_scatter(y_true, y_pred, data_dir, name='_plot', plot=True, neptune_run=None)¶
Generate and save scatter plots and calculate R2 for each outcome.
- Parameters
y_true (np.ndarray) – 2D array of labels. Observations are in first dimension, second dim is the outcome.
y_pred (np.ndarray) – 2D array of predictions.
data_dir (str) – Path to directory in which to save plots.
name (str, optional) – Label for filename. Defaults to ‘_plot’.
plot (bool, optional) – Save scatter plots.
neptune_run (optional) – Neptune Run. If provided, will upload plot.
- Returns
R squared.
gen_umap¶
- slideflow.stats.gen_umap(array, dim=2, n_neighbors=50, min_dist=0.1, metric='cosine', **kwargs)¶
Generates and returns a umap from a given array, using umap.UMAP
get_centroid_index¶
- slideflow.stats.get_centroid_index(arr)¶
Calculate index nearest to centroid from a given 2D input array.
metrics_from_dataset¶
- slideflow.stats.metrics_from_dataset(model, model_type, labels, patients, dataset, outcome_names=None, label=None, data_dir=None, num_tiles=0, histogram=False, save_predictions=True, neptune_run=None, pred_args=None)¶
Evaluate performance of a given model on a given TFRecord dataset, generating a variety of statistical outcomes and graphs.
- Parameters
model (tf.keras.Model) – Keras model to evaluate.
model_type (str) – ‘categorical’, ‘linear’, or ‘cph’.
labels (dict) – Dictionary mapping slidenames to outcomes.
patients (dict) – Dictionary mapping slidenames to patients.
dataset (tf.data.Dataset) – Tensorflow dataset.
outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.
label (str, optional) – Label prefix/suffix for saving. Defaults to None.
data_dir (str, optional) – Path to data directory for saving. Defaults to None.
num_tiles (int, optional) – Number of total tiles expected in dataset. Used for progress bar. Defaults to 0.
histogram (bool, optional) – Write histograms to data_dir. Defaults to False.
save_predictions (bool, optional) – Save tile, slide, and patient-level predictions to CSV. Defaults to True.
neptune_run (
neptune.Run, optional) – Neptune run in which to log results. Defaults to None.pred_args (namespace, optional) – Additional arguments to tensorflow and torch backends.
- Returns
metrics [dict], accuracy [float], loss [float]
metrics_from_pred¶
- slideflow.stats.metrics_from_pred(y_true, y_pred, tile_to_slides, labels, patients, model_type, y_std=None, outcome_names=None, label=None, data_dir=None, save_predictions=True, histogram=False, plot=True, neptune_run=None)¶
Generates metrics from a set of predictions.
For multiple outcomes, y_true and y_pred are expected to be a list of numpy arrays (each array corresponding to whole-dataset predictions for a single outcome)
- Parameters
y_true (ndarray) – True labels for the dataset.
y_pred (ndarray) – Predicted labels for the dataset.
tile_to_slides (list(str)) – List of length y_true of slide names.
labels (dict) – Dictionary mapping slidenames to outcomes.
patients (dict) – Dictionary mapping slidenames to patients.
model_type (str) – Either ‘linear’, ‘categorical’, or ‘cph’.
outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.
label (str, optional) – Label prefix/suffix for saving. Defaults to None.
min_tiles (int, optional) – Min tiles per slide to include in metrics. Defaults to 0.
data_dir (str, optional) – Path to data directory for saving. Defaults to None.
save_predictions (bool, optional) – Save tile, slide, and patient-level predictions to CSV. Defaults to True.
histogram (bool, optional) – Write histograms to data_dir. Defaults to False.
plot (bool, optional) – Save scatterplot for linear outcomes. Defaults to True.
neptune_run (
neptune.Run, optional) – Neptune run in which to log results. Defaults to None.
normalize_layout¶
- slideflow.stats.normalize_layout(layout, min_percentile=1, max_percentile=99, relative_margin=0.1)¶
Removes outliers and scales layout to between [0,1].
read_predictions¶
- slideflow.stats.read_predictions(predictions_file, level)¶
Reads predictions from a previously saved CSV file.
permute_importance¶
- slideflow.stats.permute_importance(model, dataset, labels, patients, model_type, data_dir, outcome_names=None, label=None, num_tiles=0, feature_names=None, feature_sizes=None, drop_images=False, neptune_run=None)¶
- Calculate metrics (tile, slide, and patient AUC) from a given model
that accepts clinical, slide-level feature inputs, and permute to find relative feature performance.
- Parameters
model (str) – Path to Tensorflow model.
dataset (tf.data.Dataset) – TFRecord dataset which include three items: raw image data, labels, and slide names.
labels (dict) – Dictionary mapping slidenames to outcomes.
patients (dict) – Dictionary mapping slidenames to patients.
model_type (str) – ‘categorical’, ‘linear’, or ‘cph’.
data_dir (str) – Path to output data directory.
outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.
label (str, optional) – Label prefix/suffix. Defaults to None.
num_tiles (int, optional) – Number of total tiles expected in the dataset. Used for progress bar. Defaults to 0.
feature_names (list, optional) – List of str, names for each of the clinical input features.
feature_sizes (list, optional) – List of int, sizes for each of the clinical input features.
drop_images (bool, optional) – Exclude images (predict from clinical features alone). Defaults to False.
neptune_run (
neptune.Run, optional) – Neptune run in which to log results. Defaults to None.
- Returns
Dictiory of AUCs with keys ‘tile’, ‘slide’, and ‘patient’
predict_from_layer¶
- slideflow.stats.predict_from_layer(model, layer_input, input_layer_name='hidden_0', output_layer_index=None)¶
Generate predictions from a model, providing intermediate layer input.
- Parameters
model (str) – Path to Tensorflow model
layer_input (ndarray) – Dataset to use as input for the given layer, to generate predictions.
input_layer_name (str, optional) – Name of intermediate layer, to which input is provided. Defaults to ‘hidden_0’.
output_layer_index (int, optional) – Excludes layers beyond this index. CPH models include a final concatenation layer (softmax + event tensor) that should be excluded. Defaults to None.
- Returns
Model predictions.
- Return type
ndarray
predict_from_tensorflow¶
- slideflow.stats.predict_from_tensorflow(model, dataset, model_type, pred_args, num_tiles=0, uq_n=30)¶
Generates predictions (y_true, y_pred, tile_to_slide) from a given Tensorflow model and dataset.
- Parameters
model (str) – Path to Tensorflow model.
dataset (tf.data.Dataset) – Tensorflow dataset.
model_type (str, optional) – ‘categorical’, ‘linear’, or ‘cph’. Will not attempt to calculate accuracy for non-categorical models. Defaults to ‘categorical’.
pred_args (namespace) – Namespace containing the property loss, loss function used to calculate loss.
num_tiles (int, optional) – Used for progress bar. Defaults to 0.
uq_n (int, optional) – Number of per-tile inferences to perform is calculating uncertainty via dropout.
evaluate (bool, optional) – Calculate and return accuracy and loss. Dataset must also return y_true.
- Returns
y_true, y_pred, tile_to_slides, accuracy, loss
predict_from_torch¶
- slideflow.stats.predict_from_torch(model, dataset, model_type, pred_args, uq_n=30, **kwargs)¶
Generates predictions (y_true, y_pred, tile_to_slide) from a given PyTorch model and dataset.
- Parameters
model (str) – Path to PyTorch model.
dataset (tf.data.Dataset) – PyTorch dataloader.
pred_args (namespace) – Namespace containing slide_input, update_corrects, and update_loss functions.
model_type (str, optional) – ‘categorical’, ‘linear’, or ‘cph’. If multiple linear outcomes are present, y_true is stacked into a single vector for each image. Defaults to ‘categorical’.
- Returns
y_pred, y_std, tile_to_slides
save_histogram¶
- slideflow.stats.save_histogram(y_true, y_pred, outdir, name='histogram', neptune_run=None, subsample=500)¶
Generates histogram of y_pred, labeled by y_true, saving to outdir.
pred_to_df¶
- slideflow.stats.pred_to_df(y_true, y_pred, tile_to_slides, outcome_names, uncertainty=None)¶
Save tile-level predictions.
Assumes structure of y_true, y_pred, uncertainty is: - List of length num_outcomes, containing numpy arrays - Each np array is either shape (num_tiles) [single linear outcome] or (num_tiles, num_categories) [categorical]
- Parameters
y_true (np.ndarray) – Tile-level labels.
y_pred (np.ndarray) – Tile-level predictions.
tile_to_slides (np.ndarray) – Slides corresponding to each tile.
outcome_names (np.ndarray) – List of outcome names.
uncertainty (bool, optional) – Tile-level uncertainty. Defaults to None.
- Raises
errors.StatsError – If len(y_pred) is 1 but >1 outcome_names provided.
errors.StatsError – If num outcomes in y_true and y_pred are unequal.
- Returns
Pandas DataFrame