Published December 20, 2023
| Version 1.0.0
Dataset
Open
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification
Creators
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
- LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
- Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
- Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.
Dataset Composition:
- curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
- curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot
Intended Use:
- Fine-tuning and advancing Homepage2Vec or similar website classification models
- Research on LLM-generated datasets for text classification tasks
- Exploration of multilingual website classification
Additional Information:
- Project and report repository: https://github.com/CS-433/ml-project-2-mlp
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Files
curlie-gpt3.5-10k.csv
Files
(1.0 MB)
Name | Size | Download all |
---|---|---|
md5:40fa7530cf7ca386423200d06ea54d23
|
501.9 kB | Preview Download |
md5:dd10e6b2a1c9b42fb8a76cf092f7a795
|
501.9 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Project deliverable: https://github.com/CS-433/ml-project-2-mlp (URL)