TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital Management

Gascó, Luis; Hermenegildo, Fabregat Marcos; Laura, García-Sardiña; Daniel, Déniz Cerpa; Estrella, Paula; Alvaro, Rodrigo; Rabih, Zbib

doi:10.5281/zenodo.15038364

Published March 17, 2025 | Version 0.3.0

Dataset Open

TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital Management

1. Avature

🚨 Current Status: Release of Task B Development set. To check when new data will be uploaded, please consult the calendar of the task

TalentCLEF2025 corpus - Task B Development set release

Introduction:

The first edition of TalentCLEF aims to develop and evaluate models designed to facilitate three essential tasks:

Finding/ranking candidates for job positions based on their experience and professional skills.
Implementing upskilling and reskilling strategies that promote the coninuous development of workers
Detecting emerging skills and skills gaps of importance in organizations.

With that aim, the task is divided into two tasks:

Task A - Multilingual Job Title Matching. This task involves developing systems to identify and rank the job titles most similar to a given one by generating a ranked list of similar titles from a specified knowledge base for each job title in a provided test set.
Task B - Job Title-Based Skill Prediction. Task B requires developing systems that can retrieve relevant skills associated with a specified job title.

This data repository contains the data for these two tasks. The data is being released progressively according to the task schedule.

The task evaluation takes place on Codabench (Task A and Task B). Participants must register for the competition through CLEF Lab Registration Page to be part of the evaluation campaign.

File structure:

For a detailed description of the data structure, you can refer to the TalentCLEF2025 data description page, where it is thoroughly explained.

The files is organized into two *.zip files, TaskA.zip and TaskB.zip, each containing training, validation and test folders to support different stages of model development. So far, only the training set for both tasks has been released, but in future releases, as the tasks progress, additional data will be added to the different subfolders for each task.

TaskA includes language-specific subfolders within the training and validation directories, covering English, Spanish, German, and Chinese job title data. The tr*aining folders for TaskA contain language-specific .tsv files for each respective language. Validation folders include three essential files—queries, corpus_elements, and q_rels—for evaluating model relevance to search queries. TaskA’s test folder has queries and corpus_elements files for testing retrieval.

TaskA/
│
├── training/
│   ├── english/
│   │   └── taskA_training_en.tsv
│   ├── spanish/
│   │   └── taskA_training_es.tsv
│   └── german/
│       └── taskA_training_de.tsv
│
├── validation/
│   ├── english/
│   │   ├── queries
│   │   ├── corpus_elements
│   │   └── qrels
│   ├── spanish/
│   ├── german/
│   └── chinese/
│
└── test/
    ├── queries
    └── corpus_elements

TaskB follows a similar structure but without language-specific subfolders, providing general .tsv files for training, validation, and testing. This consistent file organization enables efficient data access and structured updates as new data versions are published.

TaskB/
│
├── training/
│   ├── job2skill.tsv
│   ├── jobid2terms.json
│   └── skillid2terms.json
│
├── validation/
│   ├── queries
│   ├── corpus_elements
│   └── qrels
│
└── test/
    ├── queries
    └── corpus_elements

Tutorials:

Notebook	Link
Data Download and Load using Python	Link to Colab
Prepare submission file and run evaluation	Link to Colab
Task A - Development set Baseline generation	Link to Colab

Resources:

Files

sampleset_TaskA.zip

Files (5.4 MB)

Name	Size	Download all
sampleset_TaskA.zip md5:d197e9f2388fbaae12707e8f92d238ef	4.0 kB	Preview Download
sampleset_TaskB.zip md5:593a2fc935cfb69c08da13f6f3831ce5	870 Bytes	Preview Download
TaskA.zip md5:fbb3c89336d19da567b75680adcd42a7	1.1 MB	Preview Download
TaskB.zip md5:faf66c4d6a83a550421b03bdfc03eff2	4.4 MB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	645	29
Downloads	522	24
Data volume	826.7 MB	52.5 MB

TalentCLEF 2025 corpus: Skill and Job Title Intelligence for Human Capital Management

Creators

Description

TalentCLEF2025 corpus - Task B Development set release

Introduction:

File structure:

Files

sampleset_TaskA.zip

Files (5.4 MB)