# Skill Extraction Dataset

## Dataset description

This dataset is a collection of hard skill entities extracted from a corpus of resumes.
It is designed to benchmark the differences in skill extraction performance between
human annotators and automatic systems. The resource contains two types of labels:

1. Human-Annotated Labels:\*\* Created during an organized student workshop at the EHL
   Business School. Multiple annotations per CV were collected to establish a reliable
   consensus for the ground truth.
2. Automatic System Labels:\*\* Generated by a state-of-the-art supervised machine
   learning system and conversational LLM (see related paper).

## Data Structure

The TSV file contains 3 columns:

1. human – This column represents the combined set of skills identified by multiple
   annotators.
2. supervised – Contains the skills automatically extracted using a fine-tuned
   state-of-the-art supervised language del (see related paper for model details).
3. llm_mixtral – Contains the skills automatically extracted using a state-of-the-art
   conversational large language (see related paper for details).
