Dissimilarity based methods¶
Module for Dissimilarity-Based Selection Methods.
- class selector.dissimilarity.MaxMin(func_distance=None)¶
Selecting samples using MaxMin algorithm.
MaxMin is possibly the most widely used method for dissimilarity-based compound selection. When presented with a dataset of samples, the initial point is chosen as the dataset’s medoid center. Next, the second point is chosen to be that which is furthest from this initial point. Subsequently, all following points are selected via the following logic:
Find the minimum distance from every point to the already-selected ones.
Select the point which has the maximum distance among those calculated in the previous step.
[1] Ashton, Mark, et al., Identification of diverse database subsets using property‐based and fragment‐based molecular descriptions, Quantitative Structure‐Activity Relationships 21.6 (2002): 598-604.
- select_from_cluster(X, size, labels=None)¶
Return selected samples from a cluster based on MaxMin algorithm.
- X: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this X is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray
Indices of samples that form a cluster.
- selectedlist
List of indices of selected samples.
- class selector.dissimilarity.MaxSum(func_distance=None)¶
Selecting samples using MaxSum algorithm.
Whereas the goal of the MaxMin algorithm is to maximize the minimum distance between any pair of distinct elements in the selected subset of a dataset, the MaxSum algorithm aims to maximize the sum of distances between all pairs of elements in the selected subset. When presented with a dataset of samples, the initial point is chosen as the dataset’s medoid center. Next, the second point is chosen to be that which is furthest from this initial point. Subsequently, all following points are selected via the following logic:
Determine the sum of distances from every point to the already-selected ones.
Select the point which has the maximum sum of distances among those calculated in the previous step.
[1] Borodin, Allan, Hyun Chul Lee, and Yuli Ye, Max-sum diversification, monotone submodular functions and dynamic updates, Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 2012.
- select_from_cluster(X, size, labels=None)¶
Return selected samples from a cluster based on MaxSum algorithm.
- X: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this X is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray
Indices of samples that form a cluster.
- selectedlist
List of indices of selected samples.
- class selector.dissimilarity.OptiSim(r0=None, k=10, tol=0.01, eps=0, p=2, start_id=0, random_seed=42, n_iter=10)¶
Selecting samples using OptiSim algorithm.
The OptiSim algorithm selects samples from a dataset by first choosing the medoid center as the initial point. Next, points are randomly chosen and added to a subsample if they exist outside of radius r from all previously selected points (otherwise, they are discarded). Once k number of points have been added to the subsample, the point with the greatest minimum distance to the previously selected points is chosen. Then, the subsample is cleared and the process is repeated.
[1] J. Chem. Inf. Comput. Sci. 1997, 37, 6, 1181–1188. https://doi.org/10.1021/ci970282v
- algorithm(X, max_size) list¶
Return selected samples based on OptiSim algorithm.
- Xnp.ndarray
Coordinate array of samples.
- max_sizeint
Maximum number of samples to select.
- selectedlist
List of indices of selected samples.
- select_from_cluster(X, size, labels=None)¶
Return selected samples from a cluster based on OptiSim algorithm.
- Xnp.ndarray
Coordinate array of samples.
- sizeint
Number of samples to be selected.
- labels: np.ndarray
Indices of samples that form a cluster.
- selectedlist
List of indices of selected samples.