Published November 27, 2023 | Version v1
Dataset Open

Datasets and code for ClustMe and ClustML visual quality measures of grouping patterns in monochrome scatterplots

  • 1. ROR icon Geisinger Health System
  • 2. ROR icon Qatar Computing Research Institute
  • 3. ROR icon Hamad bin Khalifa University
  • 4. University of Stuttgart

Description

Code and datasets S1 and S2 used in the paper ClustMe: A Visual Quality Measure for Ranking Monochrome Scatterplots based on Cluster Patterns. Computer Graphics Forum 38(3): 225-236 (2019) and to appear in ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings, SAGE Information Visualization Journal.

Table of contents

Code is written with R4.3.1 language. Data are stored in RData, images and csv formats.

CONTENT:

  • /_1_TRAINING_MERGER_ON_GMM_PARAMETERS_S1

Pipeline used to train all CARET ML models to train and find the best merger used in ClustML.

These functions use data S1. Refer to README.txt file therein  

  • /_2_ClustMe_vs_ClustML_257data_S2

Run the script CompareClustMLvsClustMe_Data257.R to plot the comparative scatterplot of ClustMe and ClustML scores.

  • /_3_USAGE_SCENARIO_GENOMICS

Check the script to set options, then run: run_analysis_of_genomic_data_with_ClustML.R

Process Thousand genome project data (coming as PCA from IBD pairs stored in PCA_of_genomic_data.RData)

Compute plots for the usage scenario and summary plot of statistics of all scatterplots based on pairs of PCA.

Compute the interactive plot for selecting clusters and highlight them in another scatterplot.

  • /CLUSTML_VQM  

Contains the main ClustML function (ClustML_Pipeline() in ClustML_VQM.R) to compute a GMM over scatterplot (x,y) data and compute the ClustML score. It uses treebag_up_PP_PCA_BoxCox_SpatialSign.RData is a CARET classification model to take merging pairwise decisions. This model is the best obtained by training on 2-component GMM evaluated for containing one or more-than-on cluster by 34 human subjects.

  • /DATASETS

Contains datasets from study S1 and S2, with ClustML (CARET model) results and human judgments.

Scatterplot stimuli can be plot using "plotSP" function from plotDataXY.R (see example in that code)

  • ./DATA_S1_ORIGINAL_PARAMETER_JUDGEMENT_DATA

1000_2gaussians_param_34judgment_ClustMe_EXP1.csv contains 34 human judgments of each of 1000s 2-component GMM scatterplots and the 8 parameters used to generate a sample from these GMM models.

"XYposCSVfilename": name of the file in ../DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe

"Nsample": sample size generated from the GMM = number of points in the scatterplot.

"MuA1","MuA2": mean along axes 1 and 2 of component A of the GMM

"SigmaA1","SigmaA2": variance along axes 1 and 2 of component A of the GMM  

"ThetaA": angle of the component A of the GMM

"MuB1","MuB2": mean along axes 1 and 2 of component B of the GMM

"SigmaB1","SigmaB2": variance along axes 1 and 2 of component B of the GMM

"ThetaB": angle of the component B of the GMM

"Tau": proportion of component A

"Alpha": rotation from horizontal of the full mixture

"Score_1",...,"Score_34": Human judgment (1 = see one cluster, 2 = see more-than-one cluster)

"probMore","probSingle": proportion of judgments seeing more-than-one/one clusters

  • ./DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe

png image files stimuli shown to the human subjects, and whose filename is used in ../DATA_S1_ORIGINAL_PARAMETER_JUDGEMENT_DATA

1000_2gaussians_param_34judgment_ClustMe_EXP1.csv

  • ./DATA_S1_ORIGINAL_Scatterplots_XY_ClustMe

zzzz.csv file containing x and y coordinates of points displayed in file zzzz.png stored in folder ../DATA_S1_ORIGINAL_Scatterplots_IMG_ClustMe

  • ./DATA_S2

Data used in Study S2

Data_257.RData: contains list of filenames and x,y positions of points of the scatterplot stimuli

Data257_435pairwiseRanking_CARETmodels.csv /.RData rankings are given by ClustML using various CARET models as merging classifiers trained on S1 data.

Data257_435pairwiseRanking_31HumanJudgments.csv /.RData ranking given by 31 human judgments

The row name is filename1@@@@@filename2, where filename1 and 2 correspond to names in Data_257

Each cell contains the filename judged by the column header model/subject, as showing the most complex cluster patterns, BOTH if they are both judged of similar complexity.

  • /DEMO

Run Demo_ClustML_VQM.R to demonstrate how to use the ClustML_Pipeline function to compute the ClustML score of a scatterplot.

Files

ClustML.zip

Files (3.0 GB)

Name Size Download all
md5:c0de5f090312a51769460fdcb34b04f3
3.0 GB Preview Download

Additional details

Related works

Is described by
Publication: 10.1111/cgf.13684 (DOI)
Is referenced by
Publication: 10.1109/VISUAL.2019.8933620 (DOI)
Preprint: 10.48550/arXiv.2106.00599 (DOI)
Publication: 10.1109/TVCG.2023.3327201 (DOI)
Preprint: 10.48550/arXiv.2209.10042 (DOI)

Dates

Available
2023-11-27