Shortcuts

slideflow.io.torch

The purpose of this module is to provide a performant, backend-agnostic TFRecord reader and interleaver to use as input for PyTorch models. Its TFRecord reader is a modified and optimized version of https://github.com/vahidk/tfrecord, included as the module slideflow.tfrecord. TFRecord file reading and interleaving is supervised by slideflow.io.torch.interleave(), while the slideflow.io.torch.interleave_dataloader() function provides a PyTorch DataLoader object which can be directly used.

class slideflow.io.torch.InterleaveIterator(tfrecords, img_size, labels=None, incl_slidenames=False, incl_loc=False, rank=0, num_replicas=1, augment=False, standardize=True, num_tiles=None, infinite=True, max_size=None, prob_weights=None, normalizer=None, clip=None, chunk_size=16, preload=8, use_labels=True, model_type='categorical', onehot=False, indices=None, device=None)

Pytorch Iterable Dataset that interleaves tfrecords with the interleave() function below. Serves as a bridge between the python generator returned by interleave() and the pytorch DataLoader class.

__init__(tfrecords, img_size, labels=None, incl_slidenames=False, incl_loc=False, rank=0, num_replicas=1, augment=False, standardize=True, num_tiles=None, infinite=True, max_size=None, prob_weights=None, normalizer=None, clip=None, chunk_size=16, preload=8, use_labels=True, model_type='categorical', onehot=False, indices=None, device=None)

Pytorch IterableDataset that interleaves tfrecords with slideflow.io.torch.interleave().

Parameters
  • tfrecords (list(str)) – Path to tfrecord files to interleave.

  • img_size (int) – Image width in pixels.

  • labels (dict, optional) – Dict mapping slide names to labels. Defaults to None.

  • incl_slidenames (bool, optional) – Include slide names when iterated (returns image, label, slide). Defaults to False.

  • incl_loc (bool, optional) – Include location info. Returns samples in the form (returns …, loc_x, loc_y). Defaults to False.

  • rank (int, optional) – Which GPU replica this dataset is used for. Assists with synchronization across GPUs. Defaults to 0.

  • num_replicas (int, optional) – Total number of GPU replicas. Defaults to 1.

  • augment (str of bool, optional) – Image augmentations to perform. If string, ‘x’ performs horizontal flipping, ‘y’ performs vertical flipping, ‘r’ performs rotation, ‘j’ performs random JPEG compression (e.g. ‘xyr’, ‘xyrj’, ‘xy’). If bool, True performs all and False performs None. Defaults to True.

  • standardize (bool, optional) – Standardize images to mean 0 and variance of 1. Defaults to True.

  • num_tiles (dict, optional) – Dict mapping tfrecord names to number of total tiles. Defaults to None.

  • infinite (bool, optional) – Inifitely loop through dataset. Defaults to True.

  • max_size (bool, optional) – Artificially limit dataset size, useful for metrics. Defaults to None.

  • prob_weights (list(float), optional) – Probability weights for interleaving tfrecords. Defaults to None.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer. Defaults to None.

  • clip (list(int), optional) – Array of maximum tiles to take for each tfrecord. Defaults to None.

  • chunk_size (int, optional) – Chunk size for image decoding. Defaults to 16.

  • preload (int, optional) – Preload this many samples for parallelization. Defaults to 8.

  • use_labels (bool, optional) – Enable use of labels (disabled for non-conditional GANs). Defaults to True.

  • model_type (str, optional) – Used to generate random labels (for StyleGAN2). Not required. Defaults to ‘categorical’.

  • onehot (bool, optional) – Onehot encode outcomes. Defaults to False.

  • indices (numpy.ndarray, optional) – Indices in form of array, with np.loadtxt(index_path, dtype=np.int64) for each tfrecord. Defaults to None.

get_label(idx)

Returns a random label. Used for compatibility with StyleGAN2.

property label_shape

For use with StyleGAN2

reinforce_type(expected_type)

Reinforce the type for DataPipe instance. And the ‘expected_type’ is required to be a subtype of the original type hint to restrict the type requirement of DataPipe instance.

slideflow.io.torch.detect_tfrecord_format(tfr)

Detects tfrecord format.

Parameters

tfr (str) – Path to tfrecord.

Returns

Image file type (png/jpeg)

dict: Feature description dictionary (including or excluding location data as supported)

Return type

str

slideflow.io.torch.get_tfrecord_parser(tfrecord_path, features_to_return=None, decode_images=True, standardize=False, normalizer=None, augment=False, **kwargs)

Gets tfrecord parser using dareblopy reader. Torch implementation; different than sf.io.tensorflow

Parameters
  • tfrecord_path (str) – Path to tfrecord to parse.

  • features_to_return (list or dict, optional) – Designates format for how features should be returned from parser. If a list of feature names is provided, the parsing function will return tfrecord features as a list in the order provided. If a dictionary of labels (keys) mapping to feature names (values) is provided, features will be returned from the parser as a dictionary matching the same format. If None, will return all features as a list.

  • decode_images (bool, optional) – Decode raw image strings into image arrays. Defaults to True.

  • standardize (bool, optional) – Standardize images into the range (0,1). Defaults to False.

  • normalizer (slideflow.norm.StainNormalizer) – Stain normalizer to use on images. Defaults to None.

  • augment (str) – Image augmentations to perform. String containing characters designating augmentations. ‘x’ indicates random x-flipping, ‘y’ y-flipping, ‘r’ rotating, and ‘j’ JPEG compression/decompression at random quality levels. Passing either ‘xyrj’ or True will use all augmentations.

Returns

Parsing function dict: Detected feature description for the tfrecord

Return type

func

slideflow.io.torch.interleave(tfrecords, prob_weights=None, incl_loc=False, clip=None, infinite=True, augment=False, standardize=True, normalizer=None, num_threads=4, chunk_size=8, num_replicas=1, rank=0, indices=None, device=None)

Returns a generator that interleaves records from a collection of tfrecord files, sampling from tfrecord files randomly according to balancing if provided (requires manifest). Assumes TFRecord files are named by slide.

Different than tensorflow backend implementation (sf.io.tensorflow). Supports Pytorch. Use interleave_dataloader for the torch DataLoader class; use this function directly to get images from a generator with no PyTorch data processing.

Parameters
  • tfrecords (list(str)) – List of paths to TFRecord files.

  • prob_weights (dict, optional) – Dict mapping tfrecords to probability of including in batch. Defaults to None.

  • incl_loc (bool, optional) – Include loc_x and loc_y as additional returned variables. Defaults to False.

  • clip (dict, optional) – Dict mapping tfrecords to number of tiles to take per tfrecord. Defaults to None.

  • infinite (bool, optional) – Create an finite dataset. WARNING: If infinite is False && balancing is used, some tiles will be skipped. Defaults to True.

  • labels (dict, optional) – Dict mapping slide names to outcome labels, used for balancing. Defaults to None.

  • augment (str) – Image augmentations to perform. String containing characters designating augmentations. ‘x’ indicates random x-flipping, ‘y’ y-flipping, ‘r’ rotating, and ‘j’ JPEG compression/decompression at random quality levels. Passing either ‘xyrj’ or True will use all augmentations.

  • standardize (bool, optional) – Standardize images to (0,1). Defaults to True.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on images. Defaults to None.

  • manifest (dict, optional) – Dataset manifest containing number of tiles per tfrecord.

  • num_threads (int, optional) – Number of threads to use decoding images. Defaults to 4.

  • chunk_size (int, optional) – Chunk size for image decoding. Defaults to 16.

  • num_replicas (int, optional) – Number of total workers reading the dataset with this interleave function, defined as number of gpus * number of torch DataLoader workers. Used to interleave results among workers without duplications. Defaults to 1.

  • rank (int, optional) – Worker ID to identify which worker this represents. Used to interleave results among workers without duplications. Defaults to 0 (first worker).

slideflow.io.torch.interleave_dataloader(tfrecords, img_size, batch_size, *, num_replicas=1, labels=None, preload_factor=1, num_workers=2, pin_memory=True, persistent_workers=True, drop_last=False, **kwargs)

Prepares a PyTorch DataLoader with a new InterleaveIterator instance, interleaving tfrecords and processing labels and tiles, with support for scaling the dataset across GPUs and dataset workers.

Parameters
  • tfrecords (list(str)) – List of paths to TFRecord files.

  • img_size (int) – Tile size in pixels.

  • batch_size (int) – Batch size.

Keyword Arguments
  • prob_weights (dict, optional) – Dict mapping tfrecords to probability of including in batch. Defaults to None.

  • clip (dict, optional) – Dict mapping tfrecords to number of tiles to take per tfrecord. Defaults to None.

  • onehot (bool, optional) – Onehot encode labels. Defaults to False.

  • incl_slidenames (bool, optional) – Include slidenames as third returned variable. Defaults to False.

  • incl_loc (bool, optional) – Include loc_x and loc_y as additional returned variables. Defaults to False.

  • infinite (bool, optional) – Infinitely repeat data. Defaults to True.

  • rank (int, optional) – Worker ID to identify this worker. Used to interleave results. among workers without duplications. Defaults to 0 (first worker).

  • num_replicas (int, optional) – Number of GPUs or unique instances which will have their own DataLoader. Used to interleave results among workers without duplications. Defaults to 1.

  • labels (dict, optional) – Dict mapping slide names to outcome labels, used for balancing. Defaults to None.

  • normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on images. Defaults to None.

  • chunk_size (int, optional) – Chunk size for image decoding. Defaults to 16.

  • preload_factor (int, optional) – Number of batches to preload in each SlideflowIterator. Defaults to 1.

  • manifest (dict, optional) – Dataset manifest containing number of tiles per tfrecord.

  • balance (str, optional) – Batch-level balancing. Options: category, patient, and None. If infinite is not True, will drop tiles to maintain proportions across the interleaved dataset.

  • augment (str, optional) – Image augmentations to perform. String containing characters designating augmentations. ‘x’ indicates random x-flipping, ‘y’ y-flipping, ‘r’ rotating, and ‘j’ JPEG compression/decompression at random quality levels. Passing either ‘xyrj’ or True will use all augmentations.

  • standardize (bool, optional) – Standardize images to (0,1). Defaults to True.

  • num_workers (int, optional) – Number of DataLoader workers. Defaults to 2.

  • persistent_workers (bool, optional) – Sets the DataLoader persistent_workers flag. Defaults to True.

  • pin_memory (bool, optional) – Pin memory to GPU. Defaults to True.

  • drop_last (bool, optional) – Drop the last non-full batch. Defaults to False.

slideflow.io.torch.serialized_record(slide, image_raw, loc_x=0, loc_y=0)

Returns a serialized example for TFRecord storage, ready to be written by a TFRecordWriter.