cognitivefactory/interactive-clustering-comparative-study
Authors/Creators
Description
Interactive Clustering : Comparative Studies
Several comparative studies of cognitivefactory-interactive-clustering functionalities on NLP datasets.
- GitHub repository : https://github.com/cognitivefactory/interactive-clustering-comparative-study/tree/1.0.0
Quick description of Interactive Clustering
Interactive clustering is a method intended to assist in the design of a training data set.
This iterative process begins with an unlabeled dataset, and it uses a sequence of two substeps :
- the user defines constraints on data sampled by the machine ;
- the machine performs data partitioning using a constrained clustering algorithm.
Thus, at each step of the process :
- the user corrects the clustering of the previous steps using constraints, and
- the machine offers a corrected and more relevant data partitioning for the next step.
Description of studies
Several studies are provided here:
- efficience: Aims to confirm the technical efficience of the method by verifying its convergence to a ground truth and by finding the best implementation to increase convergence speed.
- computation time: Aims to estimate the time needed for algorithms to reach their objectives.
- annotation time: Aims to estimate the time needed to annotated constraints.
- constraints number: Aims to estimate the number of constraints needed to have a relevant annotated dataset.
- relevance: Aims to confirm the relevance of clustering results.
- rentability: Aims to predict the rentability of one more iteration.
- inter annotator: Aims to estimate the inter-annotators score during constraints annotation.
- annotation errors and conflicts fix: Aims to evaluate errors impact and verify conflicts fix importance on labeling.
- annotation subjectivity: Aims to estimate the labeling difference impact on clustering results.
Results
All results are zipped in .tar.gz files and versioned on Zenodo: Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255.
Warning ! These experiments can use a huge disk space and contain hundreds or even thousands of files (1 per execution attempt). See the table below before extracting the files.
| STUDY NAME | FOLDER SIZE | .tar.gz FILE SIZE |
|-----------------------------------|------------:|--------------------:|
| 1_efficience_study | 1.4 Go | 0.7 Go |
| 2_computation_time_study | 1.1 Go | 0.1 Go |
| 3_annotation_time_study | 0.1 Go | 0.1 Go |
| 4_constraints_number_study | 12.0 Go | 2.7 Go |
| 5_relevance_study | 0.1 Go | 0.1 Go |
| 6_rentability_study | 1.3 Go | 0.1 Go |
| 7_inter_annotators_score_study | 0.1 Go | 0.1 Go |
| 8_annotation_error_fix_study | 28.0 Go | 3.5 Go |
| 9_annotation_subjectivity_study | 82.0 Go | 11.3 Go |
Associated PhD report
Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l'Annotation : Application à la Modélisation de Textes en Intentions à l'aide d'un Clustering Interactif. Université de Lorraine.
How to cite
Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255
Files
cognitivefactory/interactive-clustering-comparative-study-1.0.0.zip
Files
(19.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:01b4418fb97f2389d712cfd9fc532941
|
764.1 MB | Download |
|
md5:49b54648d00f596f51816c5ac7475496
|
42.7 MB | Download |
|
md5:31e879e6f7298aa754b017686fe393cd
|
5.2 MB | Download |
|
md5:0c42a038b545db08fe02a192b6a3d3b5
|
2.7 GB | Download |
|
md5:e4edbad2e87d03e4dc5d4527dcc21bd0
|
94.2 MB | Download |
|
md5:6f99c0bf2cbb2501b91248f87df87bcc
|
172.8 MB | Download |
|
md5:9069a4bd38cc78c85a62b427b9d514f1
|
5.0 MB | Download |
|
md5:58f55fcd39f1b31bea51a08482fa8ade
|
3.5 GB | Download |
|
md5:1a531d004dd04ceb0301350305e510d1
|
11.6 GB | Download |
|
md5:22029c6b061e3ea959cb576394e50123
|
53.4 MB | Preview Download |
|
md5:250484c697ec0d328479f0a0d7ab9e8e
|
823 Bytes | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/cognitivefactory/interactive-clustering-comparative-study/tree/1.0.0 (URL)
References
- Schild, E. (2021). cognitivefactory/interactive-clustering. Zenodo. https://doi.org/10.5281/zenodo.4775251
- Erwan SCHILD. (2022). cognitivefactory/interactive-clustering-gui. Zenodo. https://doi.org/10.5281/zenodo.4775270
- Erwan SCHILD. (2023). cognitivefactory/features-maximization-metric. Zenodo. https://doi.org/10.5281/zenodo.7646382
- Erwan SCHILD. (2022). French trainset for chatbots dealing with usual requests on bank cards (2.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7307432
- Erwan SCHILD, & Marie ADLER. (2023). Subset of 'MLSUM: The Multilingual Summarization Corpus' for constraints annotation experiment (1.0.0 [subset: fr+train+filtered]) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8399302