Shortcuts

slideflow.stats

In addition to containing functions used during model training and evaluation, this module provides the slideflow.SlideMap class designed to assist with visualizing tiles and slides in two-dimensional space.

Once a model has been trained, tile-level predictions and intermediate layer activations can be calculated across an entire dataset with slideflow.model.DatasetFeatures. The slideflow.SlideMap class can then perform dimensionality reduction on these dataset-wide activations, plotting tiles and slides in two-dimensional space. Visualizing the distribution and clustering of tile-level and slide-level layer activations can help reveal underlying structures in the dataset and shared visual features among classes.

The primary method of use is first generating an slideflow.model.DatasetFeatures from a trained model, then creating an instance of a slideflow.SlideMap by using the from_features class method:

df = sf.model.DatasetFeatures(model='/path/', ...)
slide_map = sf.SlideMap.from_features(df)

Alternatively, if you would like to map slides from a dataset in two-dimensional space using pre-calculated x and y coordinates, you can use the from_precalculated class method. In addition to X and Y, this method requires supplying tile-level metadata in the form of a list of dicts. Each dict must contain the name of the origin slide and the tile index in the slide TFRecord.

dataset = project.dataset(tile_px=299, tile_um=302)
slides = dataset.slides()
x = np.array(...)
y = np.array(...)
meta = [{'slide': ..., 'index': ...} for i in range(len(x))]
slide_map = sf.SlideMap.from_precalculated(slides, x, y, meta)

SlideMap

class slideflow.SlideMap(slides, cache=None)

Two-dimensional slide map for visualization & backend for mosaic maps.

Slides are mapped in 2D either explicitly with pre-specified coordinates, or with dimensionality reduction from post-convolutional layer weights, provided from slideflow.model.DatasetFeatures.

__init__(slides, cache=None)

Backend for mapping slides into two dimensional space. Can use a DatasetFeatures object to map slides according to UMAP of features, or map according to pre-specified coordinates.

Parameters
  • slides (list(str)) – List of slide names

  • cache (str, optional) – Path to PKL file to cache activations. Defaults to None (caching disabled).

cluster(n_clusters)

Performs clustering on data and adds to metadata labels. Requires a DatasetFeatures backend.

Clusters are saved to self.point_meta[i][‘cluster’].

Parameters

n_clusters (int) – Number of clusters for K means clustering.

Returns

Array with cluster labels corresponding to tiles in self.point_meta.

Return type

ndarray

export_to_csv(filename)

Exports calculated UMAP coordinates in csv format.

Parameters

filename (str) – Path to CSV file in which to save coordinates.

filter(slides)

Filters map to only show tiles from the given slides.

Parameters

slides (list(str)) – List of slide names.

classmethod from_features(df, exclude_slides=None, prediction_filter=None, recalculate=False, map_slide=None, cache=None, low_memory=False, umap_dim=2)

Initializes map from dataset features.

Parameters
  • df (slideflow.model.DatasetFeatures) – DatasetFeatures.

  • exclude_slides (list, optional) – List of slides to exclude.

  • prediction_filter (list, optional) – only these provided categories.

  • recalculate (bool, optional) – Force recalculation of umap despite presence of cache.

  • use_centroid (bool, optional) – Calculate/map centroid activations.

  • map_slide (str, optional) – Either None, ‘centroid’, or ‘average’. If None, will map all tiles from each slide. Defaults to None.

  • cache (str, optional) – Path to PKL file to cache coordinates. Defaults to None (caching disabled).

classmethod from_precalculated(slides, x, y, meta, labels=None, cache=None)

Initializes map from precalculated coordinates.

Parameters
  • slides (list(str)) – List of slide names.

  • x (list(int)) – List of X coordinates for tfrecords.

  • y (list(int)) – List of Y coordinates for tfrecords.

  • meta (list(dict)) – List of dicts containing metadata for each point on the map (representing a single tfrecord).

  • labels (list(str)) – Labels assigned to each tfrecord, used for coloring TFRecords according to labels.

  • cache (str, optional) – Path to PKL file to cache coordinates. Defaults to None (caching disabled).

get_tiles_in_area(x_lower, x_upper, y_lower, y_upper)
Returns dictionary of slide names mapping to tile indices,

or tiles that fall within the specified location on the umap.

Parameters
  • x_lower (int, optional) – X-axis lower limit.

  • x_upper (int, optional) – X-axis upper limit.

  • y_lower (int, optional) – Y-axis lower limit.

  • y_upper (int, optional) – Y-axis upper limit.

Returns

Dict mapping slide names to matching tile indices

Return type

dict

label_by_logits(index)

Displays each point with label equal to the logits (linear from 0-1)

Parameters

index (int) – Logit index.

label_by_meta(meta, translation_dict=None)

Displays each point labeled by tile metadata (e.g. ‘prediction’)

Parameters
  • tile_meta (str) – Key to metadata from which to read

  • translation_dict (dict, optional) – If provided, will translate the read metadata through this dictionary.

label_by_slide(slide_labels=None)
Displays each point as the name of the corresponding slide.

If slide_labels is provided, will use this dict to label slides.

Parameters

slide_labels (dict, optional) – Dict mapping slide names to labels.

label_by_uncertainty(index=0)

Labels each point with the tile-level uncertainty, if available.

Parameters

index (int, optional) – Uncertainty index. Defaults to 0.

load_cache()

Load coordinates from PKL cache.

neighbors(slide_categories=None, algorithm='kd_tree')
Calculates neighbors among tiles in this map, assigning neighboring

statistics to tile metadata ‘num_unique_neighbors’ and ‘percent_matching_categories’.

Parameters
  • slide_categories (dict, optional) – Maps slides to categories. Defaults to None. If provided, will be used to calculate ‘percent_matching_categories’ statistic.

  • algorithm (str, optional) – NearestNeighbor algorithm, either ‘kd_tree’, ‘ball_tree’, or ‘brute’. Defaults to ‘kd_tree’.

save(filename, subsample=None, title=None, cmap=None, xlim=(- 0.05, 1.05), ylim=(- 0.05, 1.05), xlabel=None, ylabel=None, legend=None, dpi=300, **scatter_kwargs)

Saves plot of data to a provided filename.

Parameters
  • filename (str) – File path to save the image.

  • subsample (int, optional) – Subsample to only include this many tiles on plot. Defaults to None.

  • title (str, optional) – Title for plot.

  • cmap (dict, optional) – Dict mapping labels to colors.

  • xlim (list, optional) – List of float indicating limit for x-axis. Defaults to (-0.05, 1.05).

  • ylim (list, optional) – List of float indicating limit for y-axis. Defaults to (-0.05, 1.05).

  • xlabel (str, optional) – Label for x axis. Defaults to None.

  • ylabel (str, optional) – Label for y axis. Defaults to None.

  • legend (str, optional) – Title for legend. Defaults to None.

  • dpi (int, optional) – DPI for final image. Defaults to 300.

save_2d_plot(*args, **kwargs)

Deprecated function; please use save.

save_3d_plot(filename, z=None, feature=None, subsample=None)

Saves a plot of a 3D umap, with the 3rd dimension representing values provided by argument “z”.

Parameters
  • filename (str) – Filename to save image of plot.

  • z (list, optional) – Values for z axis. Must supply z or feature. Defaults to None.

  • feature (int, optional) – Int, feature to plot on 3rd axis. Must supply z or feature. Defaults to None.

  • subsample (int, optional) – Subsample to only include this many tiles on plot. Defaults to None.

save_cache()

Save cache of coordinates to PKL file.

show_neighbors(neighbor_df, slide)

Filters map to only show neighbors with a corresponding neighbor DatasetFeatures and neighbor slide.

Parameters
  • neighbor_df (slideflow.DatasetFeatures) – DatasetFeatures containing activations for neighboring slide.

  • slide (str) – Name of neighboring slide.

basic_metrics

slideflow.stats.basic_metrics(y_true, y_pred)

Generates metrics, including sensitivity, specificity, and accuracy.

calculate_centroid

slideflow.stats.calculate_centroid(act)

Calcultes slide-level centroid indices for a provided activations dict.

Parameters

activations (dict) – Dict mapping slide names to ndarray of activations across tiles, of shape (n_tiles, n_features)

Returns

Dict mapping slides to index of tile nearest to centroid dict: Dict mapping slides to activations of tile nearest to centroid

Return type

dict

concordance_index

slideflow.stats.concordance_index(y_true, y_pred)

Calculates concordance index from a given y_true and y_pred.

filtered_prediction

slideflow.stats.filtered_prediction(logits, filter=None)

Generates a prediction from a logits vector masked by a given filter.

Parameters

filter (list(int)) – List of logit indices to include when generating a prediction. All other logits will be masked.

Returns

index of prediction.

Return type

int

generate_combined_roc

slideflow.stats.generate_combined_roc(y_true, y_pred, save_dir, labels, name='ROC', neptune_run=None)

Generates and saves overlapping ROCs.

generate_roc

slideflow.stats.generate_roc(y_true, y_pred, save_dir=None, name='ROC', neptune_run=None)

Generates and saves an ROC with a given set of y_true, y_pred values.

generate_scatter

slideflow.stats.generate_scatter(y_true, y_pred, data_dir, name='_plot', plot=True, neptune_run=None)

Generate and save scatter plots and calculate R2 for each outcome.

Parameters
  • y_true (np.ndarray) – 2D array of labels. Observations are in first dimension, second dim is the outcome.

  • y_pred (np.ndarray) – 2D array of predictions.

  • data_dir (str) – Path to directory in which to save plots.

  • name (str, optional) – Label for filename. Defaults to ‘_plot’.

  • plot (bool, optional) – Save scatter plots.

  • neptune_run (optional) – Neptune Run. If provided, will upload plot.

Returns

R squared.

gen_umap

slideflow.stats.gen_umap(array, dim=2, n_neighbors=50, min_dist=0.1, metric='cosine', **kwargs)

Generates and returns a umap from a given array, using umap.UMAP

get_centroid_index

slideflow.stats.get_centroid_index(arr)

Calculate index nearest to centroid from a given 2D input array.

metrics_from_dataset

slideflow.stats.metrics_from_dataset(model, model_type, labels, patients, dataset, outcome_names=None, label=None, data_dir=None, num_tiles=0, histogram=False, save_predictions=True, neptune_run=None, pred_args=None)

Evaluate performance of a given model on a given TFRecord dataset, generating a variety of statistical outcomes and graphs.

Parameters
  • model (tf.keras.Model) – Keras model to evaluate.

  • model_type (str) – ‘categorical’, ‘linear’, or ‘cph’.

  • labels (dict) – Dictionary mapping slidenames to outcomes.

  • patients (dict) – Dictionary mapping slidenames to patients.

  • dataset (tf.data.Dataset) – Tensorflow dataset.

  • outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.

  • label (str, optional) – Label prefix/suffix for saving. Defaults to None.

  • data_dir (str, optional) – Path to data directory for saving. Defaults to None.

  • num_tiles (int, optional) – Number of total tiles expected in dataset. Used for progress bar. Defaults to 0.

  • histogram (bool, optional) – Write histograms to data_dir. Defaults to False.

  • save_predictions (bool, optional) – Save tile, slide, and patient-level predictions to CSV. Defaults to True.

  • neptune_run (neptune.Run, optional) – Neptune run in which to log results. Defaults to None.

  • pred_args (namespace, optional) – Additional arguments to tensorflow and torch backends.

Returns

metrics [dict], accuracy [float], loss [float]

metrics_from_pred

slideflow.stats.metrics_from_pred(y_true, y_pred, tile_to_slides, labels, patients, model_type, y_std=None, outcome_names=None, label=None, data_dir=None, save_predictions=True, histogram=False, plot=True, neptune_run=None)

Generates metrics from a set of predictions.

For multiple outcomes, y_true and y_pred are expected to be a list of numpy arrays (each array corresponding to whole-dataset predictions for a single outcome)

Parameters
  • y_true (ndarray) – True labels for the dataset.

  • y_pred (ndarray) – Predicted labels for the dataset.

  • tile_to_slides (list(str)) – List of length y_true of slide names.

  • labels (dict) – Dictionary mapping slidenames to outcomes.

  • patients (dict) – Dictionary mapping slidenames to patients.

  • model_type (str) – Either ‘linear’, ‘categorical’, or ‘cph’.

  • outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.

  • label (str, optional) – Label prefix/suffix for saving. Defaults to None.

  • min_tiles (int, optional) – Min tiles per slide to include in metrics. Defaults to 0.

  • data_dir (str, optional) – Path to data directory for saving. Defaults to None.

  • save_predictions (bool, optional) – Save tile, slide, and patient-level predictions to CSV. Defaults to True.

  • histogram (bool, optional) – Write histograms to data_dir. Defaults to False.

  • plot (bool, optional) – Save scatterplot for linear outcomes. Defaults to True.

  • neptune_run (neptune.Run, optional) – Neptune run in which to log results. Defaults to None.

normalize_layout

slideflow.stats.normalize_layout(layout, min_percentile=1, max_percentile=99, relative_margin=0.1)

Removes outliers and scales layout to between [0,1].

read_predictions

slideflow.stats.read_predictions(predictions_file, level)

Reads predictions from a previously saved CSV file.

permute_importance

slideflow.stats.permute_importance(model, dataset, labels, patients, model_type, data_dir, outcome_names=None, label=None, num_tiles=0, feature_names=None, feature_sizes=None, drop_images=False, neptune_run=None)
Calculate metrics (tile, slide, and patient AUC) from a given model

that accepts clinical, slide-level feature inputs, and permute to find relative feature performance.

Parameters
  • model (str) – Path to Tensorflow model.

  • dataset (tf.data.Dataset) – TFRecord dataset which include three items: raw image data, labels, and slide names.

  • labels (dict) – Dictionary mapping slidenames to outcomes.

  • patients (dict) – Dictionary mapping slidenames to patients.

  • model_type (str) – ‘categorical’, ‘linear’, or ‘cph’.

  • data_dir (str) – Path to output data directory.

  • outcome_names (list, optional) – List of str, names for outcomes. Defaults to None.

  • label (str, optional) – Label prefix/suffix. Defaults to None.

  • num_tiles (int, optional) – Number of total tiles expected in the dataset. Used for progress bar. Defaults to 0.

  • feature_names (list, optional) – List of str, names for each of the clinical input features.

  • feature_sizes (list, optional) – List of int, sizes for each of the clinical input features.

  • drop_images (bool, optional) – Exclude images (predict from clinical features alone). Defaults to False.

  • neptune_run (neptune.Run, optional) – Neptune run in which to log results. Defaults to None.

Returns

Dictiory of AUCs with keys ‘tile’, ‘slide’, and ‘patient’

predict_from_layer

slideflow.stats.predict_from_layer(model, layer_input, input_layer_name='hidden_0', output_layer_index=None)

Generate predictions from a model, providing intermediate layer input.

Parameters
  • model (str) – Path to Tensorflow model

  • layer_input (ndarray) – Dataset to use as input for the given layer, to generate predictions.

  • input_layer_name (str, optional) – Name of intermediate layer, to which input is provided. Defaults to ‘hidden_0’.

  • output_layer_index (int, optional) – Excludes layers beyond this index. CPH models include a final concatenation layer (softmax + event tensor) that should be excluded. Defaults to None.

Returns

Model predictions.

Return type

ndarray

predict_from_tensorflow

slideflow.stats.predict_from_tensorflow(model, dataset, model_type, pred_args, num_tiles=0, uq_n=30)

Generates predictions (y_true, y_pred, tile_to_slide) from a given Tensorflow model and dataset.

Parameters
  • model (str) – Path to Tensorflow model.

  • dataset (tf.data.Dataset) – Tensorflow dataset.

  • model_type (str, optional) – ‘categorical’, ‘linear’, or ‘cph’. Will not attempt to calculate accuracy for non-categorical models. Defaults to ‘categorical’.

  • pred_args (namespace) – Namespace containing the property loss, loss function used to calculate loss.

  • num_tiles (int, optional) – Used for progress bar. Defaults to 0.

  • uq_n (int, optional) – Number of per-tile inferences to perform is calculating uncertainty via dropout.

  • evaluate (bool, optional) – Calculate and return accuracy and loss. Dataset must also return y_true.

Returns

y_true, y_pred, tile_to_slides, accuracy, loss

predict_from_torch

slideflow.stats.predict_from_torch(model, dataset, model_type, pred_args, uq_n=30, **kwargs)

Generates predictions (y_true, y_pred, tile_to_slide) from a given PyTorch model and dataset.

Parameters
  • model (str) – Path to PyTorch model.

  • dataset (tf.data.Dataset) – PyTorch dataloader.

  • pred_args (namespace) – Namespace containing slide_input, update_corrects, and update_loss functions.

  • model_type (str, optional) – ‘categorical’, ‘linear’, or ‘cph’. If multiple linear outcomes are present, y_true is stacked into a single vector for each image. Defaults to ‘categorical’.

Returns

y_pred, y_std, tile_to_slides

save_histogram

slideflow.stats.save_histogram(y_true, y_pred, outdir, name='histogram', neptune_run=None, subsample=500)

Generates histogram of y_pred, labeled by y_true, saving to outdir.

pred_to_df

slideflow.stats.pred_to_df(y_true, y_pred, tile_to_slides, outcome_names, uncertainty=None)

Save tile-level predictions.

Assumes structure of y_true, y_pred, uncertainty is: - List of length num_outcomes, containing numpy arrays - Each np array is either shape (num_tiles) [single linear outcome] or (num_tiles, num_categories) [categorical]

Parameters
  • y_true (np.ndarray) – Tile-level labels.

  • y_pred (np.ndarray) – Tile-level predictions.

  • tile_to_slides (np.ndarray) – Slides corresponding to each tile.

  • outcome_names (np.ndarray) – List of outcome names.

  • uncertainty (bool, optional) – Tile-level uncertainty. Defaults to None.

Raises
  • errors.StatsError – If len(y_pred) is 1 but >1 outcome_names provided.

  • errors.StatsError – If num outcomes in y_true and y_pred are unequal.

Returns

Pandas DataFrame

to_onehot

slideflow.stats.to_onehot(val, max)

Converts value to one-hot encoding

Parameters
  • val (int) – Value to encode

  • max (int) – Maximum value (length of onehot encoding)