Partition based methods¶
Module for Partition-Based Selection Methods.
- class selector.partition.DirectedSphereExclusion(r0=None, ref_index=0, p=2.0, eps=0.0, tol=0.05, n_iter=10, random_seed=42)¶
Select samples using Directed Sphere Exclusion (DISE) algorithm.
In a nutshell, this algorithm iteratively excludes any sample within a given radius from any already selected sample. The radius of the exclusion sphere is an adjustable parameter. Compared to Sphere Exclusion algorithm, the Directed Sphere Exclusion algorithm achieves a more evenly distributed subset selection by abandoning the random selection approach and instead imposing a directed selection.
Reference sample is chosen based on the ref_index, which is excluded from the selected subset. All samples are sorted (ascending order) based on their Minkowski p-norm distance from the reference sample. Looping through sorted samples, the sample is selected if it is not already excluded. If selected, all its neighboring samples within a sphere of radius r (i.e., exclusion sphere) are excluded from being selected. When the selected number of points is greater than specified subset size, the selection process terminates. The r0 is used as the initial radius of exclusion sphere, however, it is optimized to select the desired number of samples.
Gobbi, A., and Lee, M.-L. (2002). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317–323. https://doi.org/10.1021/ci025554v
- algorithm(X, max_size)¶
Return selected samples based on directed sphere exclusion algorithm.
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- max_size: int
Maximum number of samples to select.
- selected: list
List of indices of selected samples.
- select_from_cluster(X, size, cluster_ids=None)¶
Return selected samples from a cluster based on directed sphere exclusion algorithm
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- size: int
Number of samples to be selected.
- cluster_ids: np.ndarray
Indices of samples that form a cluster.
- selected: list
List of indices of selected samples.
- class selector.partition.GridPartitioning(cells, grid_method='equisized_independent', max_dim=None, random_seed=42)¶
Selecting points using the Grid Partitioning algorithm.
Points are partitioned into grids using an algorithm (equisized independent or equisized dependent). A point is selected from each of the grids while the number of selected points is less than the number requested and while the grid has available points remaining, looping until the number of requested points is satisfied. If at the end, the number of points needed is less than the number of grids available to select from, the points are chosen from the grids with the greatest diversity.
Adapted from https://doi.org/10.1016/S1093-3263(99)00016-9.
- select_from_cluster(arr, num_selected, cluster_ids=None)¶
Grid partitioning algorithm for selecting points from cluster.
- arr: np.ndarray
Coordinate array of points
- num_selected: int
Number of molecules that need to be selected.
- cluster_ids: np.ndarray
Indices of molecules that form a cluster
- selected: list
List of ids of selected molecules
- class selector.partition.Medoid(start_id=0, func_distance=<function Medoid.<lambda>>, scaling=10)¶
Selecting points using an algorithm adapted from KDTree.
Points are initially used to construct a KDTree. Eucleidean distances are used for this algorithm. The first point selected is based on the starting_idx provided and becomes the first query point. An approximation of the furthest point to the query point is found using find_furthest_neighbor and is selected. find_nearest_neighbor is then done to eliminate close neighbors to the new selected point. Medoid is then calculated from previously selected points and is used as the new query point for find_furthest_neighbor, repeating the process. Terminates upon selecting requested number of points or if all available points exhausted.
Adapted from: https://en.wikipedia.org/wiki/K-d_tree#Construction
- select_from_cluster(arr, num_selected, cluster_ids=None)¶
Main function for selecting points using the KDTree algorithm.
- arr: np.ndarray
Coordinate array of points
- num_selected: int
Number of molecules that need to be selected.
- cluster_ids: np.ndarray
Indices of molecules that form a cluster
- selected: list
List of ids of selected molecules