Datasets and code for ClustMe and ClustML visual quality measures of grouping patterns in monochrome scatterplots
Authors/Creators
Description
Code and datasets S1 and S2 used in the paper ClustMe: A Visual Quality Measure for Ranking Monochrome Scatterplots based on Cluster Patterns. Computer Graphics Forum 38(3): 225-236 (2019) and to appear in ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings, SAGE Information Visualization Journal.
Table of contents
Code is written with R4.3.1 language. Data are stored in RData, images and csv formats.
CONTENT:
- /_1_TRAINING_MERGER_ON_GMM_PARAMETERS_S1
Pipeline used to train all CARET ML models to train and find the best merger used in ClustML.
These functions use data S1. Refer to README.txt file therein
- /_2_ClustMe_vs_ClustML_257data_S2
Run the script CompareClustMLvsClustMe_Data257.R to plot the comparative scatterplot of ClustMe and ClustML scores.
- /_3_USAGE_SCENARIO_GENOMICS
Check the script to set options, then run: run_analysis_of_genomic_data_with_ClustML.R
Process Thousand genome project data (coming as PCA from IBD pairs stored in PCA_of_genomic_data.RData)
Compute plots for the usage scenario and summary plot of statistics of all scatterplots based on pairs of PCA.
Compute the interactive plot for selecting clusters and highlight them in another scatterplot.
- /CLUSTML_VQM
Contains the main ClustML function (ClustML_Pipeline() in ClustML_VQM.R) to compute a GMM over scatterplot (x,y) data and compute the ClustML score. It uses treebag_up_PP_PCA_BoxCox_SpatialSign.RData is a CARET classification model to take merging pairwise decisions. This model is the best obtained by training on 2-component GMM evaluated for containing one or more-than-on cluster by 34 human subjects.
- /DATASETS
Contains datasets from study S1 and S2, with ClustML (CARET model) results and human judgments.
Scatterplot stimuli can be plot using "plotSP" function from plotDataXY.R (see example in that code)
- ./DATA_S1_ORIGINAL_PARAMETER_JUDGEMENT_DATA
1000_2gaussians_param_34judgment_ClustMe_EXP1.csv contains 34 human judgments of each of 1000s 2-component GMM scatterplots and the 8 parameters used to generate a sample from these GMM models.
"XYposCSVfilename": name of the file in ../DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe
"Nsample": sample size generated from the GMM = number of points in the scatterplot.
"MuA1","MuA2": mean along axes 1 and 2 of component A of the GMM
"SigmaA1","SigmaA2": variance along axes 1 and 2 of component A of the GMM
"ThetaA": angle of the component A of the GMM
"MuB1","MuB2": mean along axes 1 and 2 of component B of the GMM
"SigmaB1","SigmaB2": variance along axes 1 and 2 of component B of the GMM
"ThetaB": angle of the component B of the GMM
"Tau": proportion of component A
"Alpha": rotation from horizontal of the full mixture
"Score_1",...,"Score_34": Human judgment (1 = see one cluster, 2 = see more-than-one cluster)
"probMore","probSingle": proportion of judgments seeing more-than-one/one clusters
- ./DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe
png image files stimuli shown to the human subjects, and whose filename is used in ../DATA_S1_ORIGINAL_PARAMETER_JUDGEMENT_DATA
1000_2gaussians_param_34judgment_ClustMe_EXP1.csv
- ./DATA_S1_ORIGINAL_Scatterplots_XY_ClustMe
zzzz.csv file containing x and y coordinates of points displayed in file zzzz.png stored in folder ../DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe
- ./DATA_S2
Data used in Study S2
Data_257.RData: contains list of filenames and x,y positions of points of the scatterplot stimuli
Data257_435pairwiseRanking_CARETmodels.csv /.RData rankings are given by ClustML using various CARET models as merging classifiers trained on S1 data.
Data257_435pairwiseRanking_31HumanJudgments.csv /.RData ranking given by 31 human judgments
The row name is filename1@@@@@filename2, where filename1 and 2 correspond to names in Data_257
Each cell contains the filename judged by the column header model/subject, as showing the most complex cluster patterns, BOTH if they are both judged of similar complexity.
- /DEMO
Run Demo_ClustML_VQM.R to demonstrate how to use the ClustML_Pipeline function to compute the ClustML score of a scatterplot.
Files
ClustML.zip
Files
(3.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c0de5f090312a51769460fdcb34b04f3
|
3.0 GB | Preview Download |
Additional details
Related works
- Is described by
- Publication: 10.1111/cgf.13684 (DOI)
- Is referenced by
- Publication: 10.1109/VISUAL.2019.8933620 (DOI)
- Preprint: 10.48550/arXiv.2106.00599 (DOI)
- Publication: 10.1109/TVCG.2023.3327201 (DOI)
- Preprint: 10.48550/arXiv.2209.10042 (DOI)
Dates
- Available
-
2023-11-27