Shortcuts

slideflow.dataset

The Dataset class in this module is used to organize dataset sources, ROI annotations, clinical annotations, and dataset processing.

Dataset Organization

A source is a set of slides, corresponding Regions of Interest (ROI) annotations (if available), and any tiles extracted from these slides, either as loose tiles or in the binary TFRecord format. Sources are defined in the project dataset configuration JSON file, with the following format:

{
    "SOURCE":
    {
        "slides": "/directory",
        "roi": "/directory",
        "tiles": "/directory",
        "tfrecords": "/directory",
    }
}

A single dataset can have multiple sources. One example of this might be if you were performing a pan-cancer analysis; you would likely have a unique source for each cancer subtype, in order to keep each set of slides and tiles distinct. Another example might be if you are analyzing slides from multiple institutions, and you want to ensure that you are not mixing your training and evaluation datasets.

The Dataset class is initialized from a dataset configuration file, a list of source names to include from the configuration file, and tile size parameters (tile_px and tile_um). Clinical annotations can be provided to this object, which can then be used to filter slides according to outcomes and perform a variety of other class-aware functions.

Filtering

Datasets can be filtered with several different filtering mechanisms:

  • filters: A dictionary can be passed via the filters argument to a Dataset to perform filtering. The keys of this dictionary should be annotation headers, and the values of this dictionary indicate the categorical outcomes which should be included. Any slides with an outcome other than what is provided by this dict will be excluded.

  • filter_blank: A list of headers can be provided to the filter_blank argument; any slide with a blank annotation in one of these columns will be excluded.

  • min_tiles: An int can be provided to min_tiles; any tfrecords with fewer than this number of tiles will be excluded.

Filters can be provided at the time of Dataset instantiation by passing to the initializer:

dataset = Dataset(..., filters={'HPV_status': ['negative', 'positive']})

… or with the Dataset.filter() method:

dataset = dataset.filter(min_tiles=50)

Once applied, all dataset functions and parameters will reflect this filtering criteria, including the Dataset.num_tiles parameter.

Dataset Manipulation

A number of different functions can be applied to Datasets in order to manipulate filters (Dataset.filter(), Dataset.remove_filter(), Dataset.clear_filters()), balance datasets (Dataset.balance()), or clip tfrecords to a maximum number of tiles (Dataset.clip()). The full documentation of these functions is given below. Note: these functions return a Dataset copy with the functions applied, not to the original dataset. Thus, for proper use, assign the result of the function to the original dataset variable:

dataset = dataset.clip(50)

This also means that these functions can be chained for simplicity:

dataset = dataset.balance('HPV_status').clip(50)

Manifest

The Dataset manifest is a dictionary mapping tfrecords to both the total number of slides, as well as the number of slides after any clipping or balancing. For example, after clipping:

dataset = dataset.clip(500)

… the Dataset.manifest() function would return something like:

{
    "/path/tfrecord1.tfrecords":
    {
        "total": 1526,
        "clipped": 500
    },
    "/path/tfrecord2.tfrecords":
    {
        "total": 455,
        "clipped": 455
    }
}

Training/Validation Splitting

Datasets can be split into training and validation datasets with Dataset.train_val_split(), with full documentation given below. The result of this function is two datasets - the first training, the second validation - each a separate instance of Dataset.

Tile and TFRecord Processing

Datasets can also be used to process and extract tiles. Some example methods support tile and tfrecord processing include:

  • Dataset.extract_tiles(): Performs tile extraction for all slides in the dataset.

  • Dataset.extract_tiles_from_tfrecords(): Extract tiles from saved TFRecords, saving in loose .jpg or .png format to a folder.

  • Dataset.resize_tfrecords(): Resizes all images in TFRecords to a new size.

  • Dataset.split_tfrecords_by_roi(): Splits a set of extracted tfrecords according to whether tiles are inside or outside the slide’s ROI.

  • Dataset.tfrecord_report(): Generates a PDF report of the tiles inside a collection of TFRecords.

Tensorflow & PyTorch Datasets

Finally, Datasets can also return either a tf.data.Datasets or torch.utils.data.Dataloader object to quickly and easily create a deep learning dataset ready to be used as model input, with the Dataset.tensorflow() and Dataset.torch() methods, respectively.

Dataset

class slideflow.Dataset(config, sources, tile_px, tile_um, annotations=None, filters=None, filter_blank=None, min_tiles=0)

Object to supervise organization of slides, tfrecords, and tiles across one or more sources in a stored configuration file.

__init__(config, sources, tile_px, tile_um, annotations=None, filters=None, filter_blank=None, min_tiles=0)
balance(headers=None, strategy='category', force=False)

Returns a dataset with prob_weights reflecting balancing per tile, slide, patient, or category.

Saves balancing information to the dataset variable prob_weights, which is used by the interleaving dataloaders when sampling from tfrecords to create a batch.

Tile level balancing will create prob_weights reflective of the number of tiles per slide, thus causing the batch sampling to mirror random sampling from the entire population of tiles (rather than randomly sampling from slides).

Slide level balancing is the default behavior, where batches are assembled by randomly sampling from each slide/tfrecord with equal probability. This balancing behavior would be the same as no balancing.

Patient level balancing is used to randomly sample from individual patients with equal probability. This is distinct from slide level balancing, as some patients may have multiple slides per patient.

Category level balancing takes a list of annotation header(s) and generates prob_weights such that each category is sampled equally. This requires categorical outcomes.

Parameters
  • headers (list of str, optional) – List of annotation headers if balancing by category. Defaults to None.

  • strategy (str, optional) – ‘tile’, ‘slide’, ‘patient’ or ‘category’. Create prob_weights used to balance dataset batches to evenly distribute slides, patients, or categories in a given batch. Tile-level balancing generates prob_weights reflective of the total number of tiles in a slide. Defaults to ‘category.’

  • force (bool, optional) – If using category-level balancing, interpret all headers as categorical variables, even if the header appears to be a float.

Returns

balanced slideflow.dataset.Dataset object.

build_index(force=True)

Builds index files for TFRecords, required for PyTorch.

clear_filters()

Returns a dataset with all filters cleared.

Returns

slideflow.dataset.Dataset object.

clip(max_tiles=0, strategy=None, headers=None)

Returns a dataset clipped to either a fixed maximum number of tiles per tfrecord, or to the min number of tiles per patient or category.

Parameters
  • max_tiles (int, optional) – Clip the maximum number of tiles per tfrecord to this number.

  • strategy (str, optional) – ‘slide’, ‘patient’, or ‘category’. Clip the maximum number of tiles to the minimum tiles seen across slides, patients, or categories. If ‘category’, headers must be provided. Defaults to None.

  • headers (list of str, optional) – List of annotation headers to use if clipping by minimum category count (strategy=’category’). Defaults to None.

Returns

clipped slideflow.dataset.Dataset object.

extract_tiles(save_tiles=False, save_tfrecords=True, source=None, stride_div=1, enable_downsample=True, roi_method='inside', skip_missing_roi=False, skip_extracted=True, tma=False, randomize_origin=False, buffer=None, num_workers=1, q_size=4, qc=None, report=True, **kwargs)

Extract tiles from a group of slides, saving extracted tiles to either loose image or in TFRecord binary format.

Parameters
  • save_tiles (bool, optional) – Save images of extracted tiles to project tile directory. Defaults to False.

  • save_tfrecords (bool, optional) – Save compressed image data from extracted tiles into TFRecords in the corresponding TFRecord directory. Defaults to True.

  • source (str, optional) – Name of dataset source from which to select slides for extraction. Defaults to None. If not provided, will default to all sources in project.

  • stride_div (int, optional) – Stride divisor for tile extraction. A stride of 1 will extract non-overlapping tiles. A stride_div of 2 will extract overlapping tiles, with a stride equal to 50% of the tile width. Defaults to 1.

  • enable_downsample (bool, optional) – Enable downsampling for slides. This may result in corrupted image tiles if downsampled slide layers are corrupted or incomplete. Defaults to True.

  • roi_method (str, optional) – Either ‘inside’, ‘outside’ or ‘ignore’. Indicates whether tiles are extracted inside or outside ROIs, or if ROIs are ignored entirely. Defaults to ‘inside’.

  • skip_missing_roi (bool, optional) – Skip slides missing ROIs. Defaults to False.

  • skip_extracted (bool, optional) – Skip slides that have already been extracted. Defaults to True.

  • tma (bool, optional) – Reads slides as Tumor Micro-Arrays (TMAs), detecting and extracting tumor cores. Defaults to False. Experimental function with limited testing.

  • randomize_origin (bool, optional) – Randomize pixel starting position during extraction. Defaults to False.

  • buffer (str, optional) – Slides will be copied to this directory before extraction. Defaults to None. Using an SSD or ramdisk buffer vastly improves tile extraction speed.

  • num_workers (int, optional) – Extract tiles from this many slides simultaneously. Defaults to 1.

  • q_size (int, optional) – Size of queue when using a buffer. Defaults to 4.

  • qc (str, optional) – ‘otsu’, ‘blur’, ‘both’, or None. Perform blur detection quality control - discarding tiles with detected out-of-focus regions or artifact - and/or otsu’s method. Increases tile extraction time. Defaults to None.

  • report (bool, optional) – Save a PDF report of tile extraction. Defaults to True.

Keyword Arguments
  • normalizer (str, optional) – Normalization strategy. Defaults to None.

  • normalizer_source (str, optional) – Path to normalizer source image. If None, will use slideflow.slide.norm_tile.jpg. Defaults to None.

  • whitespace_fraction (float, optional) – Range 0-1. Discard tiles with this fraction of whitespace. If 1, will not perform whitespace filtering. Defaults to 1.

  • whitespace_threshold (int, optional) – Range 0-255. Defaults to 230. Threshold above which a pixel (RGB average) is whitespace.

  • grayspace_fraction (float, optional) – Range 0-1. Defaults to 0.6. Discard tiles with this fraction of grayspace. If 1, will not perform grayspace filtering.

  • grayspace_threshold (float, optional) – Range 0-1. Defaults to 0.05. Pixels in HSV format with saturation below this threshold are considered grayspace.

  • img_format (str, optional) – ‘png’ or ‘jpg’. Defaults to ‘jpg’. Image format to use in tfrecords. PNG (lossless) for fidelity, JPG (lossy) for efficiency.

  • full_core (bool, optional) – Only used if extracting from TMA. If True, will save entire TMA core as image. Otherwise, will extract sub-images from each core using the given tile micron size. Defaults to False.

  • shuffle (bool, optional) – Shuffle tiles prior to storage in tfrecords. Defaults to True.

  • num_threads (int, optional) – Number of workers threads for each tile extractor.

  • qc_blur_radius (int, optional) – Quality control blur radius for out-of-focus area detection. Used if qc=True. Defaults to 3.

  • qc_blur_threshold (float, optional) – Quality control blur threshold for detecting out-of-focus areas. Only used if qc=True. Defaults to 0.1

  • qc_filter_threshold (float, optional) – Float between 0-1. Tiles with more than this proportion of blur will be discarded. Only used if qc=True. Defaults to 0.6.

  • qc_mpp (float, optional) – Microns-per-pixel indicating image magnification level at which quality control is performed. Defaults to mpp=4 (effective magnification 2.5 X)

  • dry_run (bool, optional) – Determine tiles that would be extracted, but do not export any images. Defaults to None.

extract_tiles_from_tfrecords(dest)

Extracts tiles from a set of TFRecords.

Parameters

dest (str) – Path to directory in which to save tile images. If None, uses dataset default. Defaults to None.

filter(*args, **kwargs)

Return a filtered dataset.

Keyword Arguments
  • filters (dict) – Filters dict to use when selecting tfrecords. See get_dataset() documentation for more information on filtering. Defaults to None.

  • filter_blank (list) – Exclude slides blank in these columns. Defaults to None.

  • min_tiles (int) – Filter out tfrecords that have less than this minimum number of tiles.

Returns

slideflow.dataset.Dataset object.

property filter_blank

Returns the active filter_blank filter, if any.

property filters

Returns the active filters, if any.

harmonize_labels(*args, header=None)

Returns categorical label assignments to int, harmonized with another dataset to ensure consistency between datasets.

Parameters
  • *args (slideflow.Dataset) – Any number of Datasets.

  • header (str) – Categorical annotation header.

Returns

Dict mapping slide names to categories.

is_float(header)

Returns True if labels in the given header can all be converted to float, else False.

labels(headers, use_float=False, assign=None, format='index')

Returns a dict of slide names mapped to patient id and label(s).

Parameters
  • headers (list(str)) Annotation header(s) – May be a list or string.

  • use_float (bool, optional) – If true, convert data into float; if unable, raise TypeError. If false, interpret all data as categorical. If a dict is provided, look up each header to determine type. If ‘auto’, will try to convert all data into float. For each header in which this fails, will interpret as categorical.

  • assign (dict, optional) – Dictionary mapping label ids to label names. If not provided, will map ids to names by sorting alphabetically.

  • format (str, optional) – Either ‘index’ or ‘name.’ Indicates which format should be used for categorical outcomes when returning the label dictionary. If ‘name’, uses the string label name. If ‘index’, returns an int (index corresponding with the returned list of unique outcomes as str). Defaults to ‘index’.

Returns

  1. Dictionary mapping slides to outcome labels in numerical format

    (float for linear outcomes, int of outcome label id for categorical outcomes).

  2. List of unique labels. For categorical outcomes, this will be a

    list of str; indices correspond with the outcome label id.

load_indices()

Reads TFRecord indices. Needed for PyTorch.

manifest(key='path', filter=True)

Generates a manifest of all tfrecords.

Parameters

key (str) – Either ‘path’ (default) or ‘name’. Determines key format in the manifest dictionary.

Returns

Dict mapping key (path or slide name) to number of tiles.

Return type

dict

property min_tiles

Returns the active min_tiles filter, if any (defaults to 0).

property num_tiles

Returns the total number of tiles in the tfrecords in this dataset, after filtering/clipping.

patients()

Returns a list of patient IDs from this dataset.

remove_filter(**kwargs)

Removes a specific filter from the active filters.

Keyword Arguments
  • filters (list of str) – Filter keys. Will remove filters with these keys.

  • filter_blank (list of str) – Will remove these headers stored in filter_blank.

Returns

slideflow.dataset.Dataset object.

resize_tfrecords(tile_px)

Resizes images in a set of TFRecords to a given pixel size.

Parameters

tile_px (int) – Target pixel size for resizing TFRecord images.

rois()

Returns a list of all ROIs.

slide_paths(source=None, apply_filters=True)

Returns a list of paths to either all slides, or slides matching dataset filters.

Parameters
  • source (str, optional) – Dataset source name. Defaults to None (using all sources).

  • filter (bool, optional) – Return only slide paths meeting filter criteria. If False, return all slides. Defaults to True.

slides()

Returns a list of slide names in this dataset.

split_tfrecords_by_roi(destination)

Split dataset tfrecords into separate tfrecords according to ROI.

Will generate two sets of tfrecords, with identical names: one with tiles inside the ROIs, one with tiles outside the ROIs. Will skip any tfrecords that are missing ROIs. Requires slides to be available.

tensorflow(labels=None, batch_size=None, **kwargs)

Returns a Tensorflow Dataset object that interleaves tfrecords from this dataset.

The returned dataset yields a batch of (image, label) for each tile. Labels may be specified either via a dict mapping slide names to outcomes, or a parsing function which accept and image and slide name, returning a dict {‘image_raw’: image(tensor)} and label (int or float).

Parameters
  • labels (dict or str, optional) – Dict or function. If dict, must map slide names to outcome labels. If function, function must accept an image (tensor) and slide name (str), and return a dict {‘image_raw’: image (tensor)} and label (int or float). If not provided, all labels will be None.

  • batch_size (int) – Batch size.

Keyword Arguments
  • onehot (bool, optional) – Onehot encode labels. Defaults to False.

  • incl_slidenames (bool, optional) – Include slidenames as third returned variable. Defaults to False.

  • infinite (bool, optional) – Infinitely repeat data. Defaults to True.

  • rank (int, optional) – Worker ID to identify which worker this represents. Used to interleave results among workers without duplications. Defaults to 0 (first worker).

  • num_replicas (int, optional) – Number of GPUs or unique instances which will have their own DataLoader. Used to interleave results among workers without duplications. Defaults to 1.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on images.

  • seed (int, optional) – Use the following seed when randomly interleaving. Necessary for synchronized multiprocessing distributed reading.

  • chunk_size (int, optional) – Chunk size for image decoding. Defaults to 16.

  • preload_factor (int, optional) – Number of batches to preload. Defaults to 1.

  • augment (str, optional) – Image augmentations to perform. String containing characters designating augmentations. ‘x’ indicates random x-flipping, ‘y’ y-flipping, ‘r’ rotating, and ‘j’ JPEG compression/decompression at random quality levels. Passing either ‘xyrj’ or True will use all augmentations.

  • standardize (bool, optional) – Standardize images to (0,1). Defaults to True.

  • num_workers (int, optional) – Number of DataLoader workers. Defaults to 2.

  • deterministic (bool, optional) – When num_parallel_calls is specified, if this boolean is specified (True or False), it controls the order in which the transformation produces elements. If set to False, the transformation is allowed to yield elements out of order to trade determinism for performance. Defaults to False.

  • drop_last (bool, optional) – Drop the last non-full batch. Defaults to False.

tfrecord_report(dest, normalizer=None)

Creates a PDF report of TFRecords, including 10 example tiles per TFRecord.

Parameters
  • dest (str) – Directory in which to save the PDF report.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on image tiles. Defaults to None.

tfrecords(source=None)

Returns a list of all tfrecords.

Parameters

source (str, optional) – Only return tfrecords from this dataset source. Defaults to None (return all tfrecords in dataset).

Returns

List of tfrecords paths

tfrecords_by_subfolder(subfolder)

Returns a list of all tfrecords in a specific subfolder, ignoring filters.

tfrecords_folders()

Returns folders containing tfrecords.

tfrecords_from_tiles(delete_tiles=False)

Create tfrecord files from a collection of raw images, as stored in project tiles directory

torch(labels, batch_size=None, rebuild_index=False, **kwargs)

Returns a PyTorch DataLoader object that interleaves tfrecords.

The returned dataloader yields a batch of (image, label) for each tile.

Parameters
  • labels (dict or str) – If a dict is provided, expect a dict mapping slide names to outcome labels. If a str, will intepret as categorical annotation header. For linear outcomes, or outcomes with manually assigned labels, pass the first result of dataset.labels(…). If None, returns slide instead of label.

  • batch_size (int) – Batch size.

  • rebuild_index (bool) – Re-build index files even if already present. Defaults to True.

Keyword Arguments
  • onehot (bool, optional) – Onehot encode labels. Defaults to False.

  • incl_slidenames (bool, optional) – Include slidenames as third returned variable. Defaults to False.

  • infinite (bool, optional) – Infinitely repeat data. Defaults to True.

  • rank (int, optional) – Worker ID to identify which worker this represents. Used to interleave results among workers without duplications. Defaults to 0 (first worker).

  • num_replicas (int, optional) – Number of GPUs or unique instances which will have their own DataLoader. Used to interleave results among workers without duplications. Defaults to 1.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on images. Defaults to None.

  • seed (int, optional) – Use the following seed when randomly interleaving. Necessary for synchronized multiprocessing.

  • chunk_size (int, optional) – Chunk size for image decoding. Defaults to 16.

  • preload_factor (int, optional) – Number of batches to preload. Defaults to 1.

  • augment (str, optional) – Image augmentations to perform. Str containing characters designating augmentations. ‘x’ indicates random x-flipping, ‘y’ y-flipping, ‘r’ rotating, ‘j’ JPEG compression/decompression at random quality levels, and ‘b’ random gaussian blur. Passing either ‘xyrjb’ or True will use all augmentations. Defaults to ‘xyrjb’.

  • standardize (bool, optional) – Standardize images to (0,1). Defaults to True.

  • num_workers (int, optional) – Number of DataLoader workers. Defaults to 2.

  • pin_memory (bool, optional) – Pin memory to GPU. Defaults to True.

  • drop_last (bool, optional) – Drop the last non-full batch. Defaults to False.

train_val_split(model_type, labels, val_strategy, splits=None, val_fraction=None, val_k_fold=None, k_fold_iter=None, site_labels=None, read_only=False)

From a specified subfolder in the project’s main TFRecord folder, prepare a training set and validation set.

If a validation split has already been prepared (e.g. K-fold iterations were already determined), the previously generated split will be used. Otherwise, create a new split and log the result in the TFRecord directory so future models may use the same split for consistency.

Parameters
  • model_type (str) – Either ‘categorical’ or ‘linear’.

  • labels (dict) – Dictionary mapping slides to labels. Used for balancing outcome labels in training and validation cohorts.

  • val_strategy (str) – Either ‘k-fold’, ‘k-fold-preserved-site’, ‘bootstrap’, or ‘fixed’.

  • splits (str, optional) – Path to JSON file containing validation splits. Defaults to None.

  • outcome_key (str, optional) – Key indicating outcome label in slide_labels_dict. Defaults to ‘outcome_label’.

  • val_fraction (float, optional) – Proportion of data for validation. Not used if strategy is k-fold. Defaults to None.

  • val_k_fold (int) – K, required if using K-fold validation. Defaults to None.

  • k_fold_iter (int, optional) – Which K-fold iteration to generate starting at 1. Fequired if using K-fold validation. Defaults to None.

  • site_labels (dict, optional) – Dict mapping patients to site labels. Used for site preserved cross validation.

  • read_only (bool) – Prevents writing validation splits to file. Defaults to False.

Returns

training dataset, slideflow.Dataset: validation dataset

Return type

slideflow.Dataset

unclip()

Returns a dataset object with all clips removed.

Returns

slideflow.dataset.Dataset object.

update_annotations_with_slidenames(annotations_file)

Attempts to automatically associate slide names from a directory with patients in a given annotations file, skipping any slide names that are already present in the annotations file.

update_manifest(force_update=False)

Updates tfrecord manifest.

Parameters

forced_update (bool, optional) – Force regeneration of the manifest from scratch.

verify_annotations_slides()

Verify that annotations are correctly loaded.