Synthetic UML Diagram Dataset (PlantUML)
Description
This dataset comprises synthetic UML diagrams, explicitly focusing on activity and sequence diagrams generated using PlantUML—a text-based tool for creating visual diagrams. By leveraging randomized text strings based on PlantUML syntax, we produced a diverse and scalable collection that emulates standard UML diagrams. Each diagram is accompanied by its corresponding PlantUML code, facilitating a clear understanding of the visual representation's textual foundation. Data from smaller datasets is reused in the larger datasets, as each model was trained on the data separately, as described in the original paper. It's recommended just to use the Extra Large dataset when interested in using the data in its entirety.
Each category is divided into four subsets based on size (approximately):
-
Small: 6,000 training diagrams and 1,500 testing diagrams.
-
Medium: 12,000 training diagrams and 3,000 testing diagrams.
-
Large: 24,000 training diagrams and 6,000 testing diagrams.
-
Extra Large: 120,000 training diagrams and 30,000 testing diagrams.
Files
PlantUML_Data_bundle.zip
Files
(12.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c607c85079789f6ef04438b50a11e9b5
|
12.3 GB | Preview Download |