OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets
Description
Overview
The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519
If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:
@inproceedings{yamashita2024openresume,
title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},
author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},
booktitle={2024 IEEE International Conference on Big Data (BigData)},
year={2024},
organization={IEEE}
}
@inproceedings{yamashita2023james,
title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},
author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},
booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
year={2023},
organization={IEEE}
}
Data Contents and Organization
The dataset consists of two primary components:
- Realistic Data: An anonymized dataset utilizing differential privacy techniques.
- Synthetic Data: A synthetic dataset generated from real-world job transition graphs.
The dataset includes the following features:
- Anonymized User Identifiers: Unique IDs for anonymized users.
- Anonymized Company Identifiers: Unique IDs for anonymized companies.
- Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
- Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.
Detailed information on how the OpenResume dataset is constructed can be found in our paper.
Dataset Extension
Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.
- Applicable Tasks:
- Next Job Title Prediction (Career Path Prediction)
- Next Company Prediction (Career Path Prediction)
- Turnover Prediction
- Link Prediction
- Required Skill Prediction (with ESCO dataset integration)
- Existing Skill Prediction (with ESCO dataset integration)
- Job Description Classification (with ESCO dataset integration)
- Job Title Classification (with ESCO dataset integration)
- Text Feature-Based Model Development (with ESCO dataset integration)
- LLM Development for Resume-Related Tasks (with ESCO dataset integration)
- And more!
Intended Uses
The primary objective of OpenResume is to provide an open resource for:
- Evaluating and comparing newly developed career models in a standardized manner.
- Fostering AI advancements in career trajectory modeling and job market analytics.
With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.
While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.
Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.
Ethical and Responsible Use
The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.
Related Work
JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023
Fake Resume Attacks: Data Poisoning on Online Job Platforms
Michiharu Yamashita, Thanh Tran, and Dongwon Lee
The ACM Web Conference 2024 (WWW), 2024