OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets

Yamashita, Michiharu; Tran, Thanh; Lee, Dongwon

doi:10.1109/BigData62323.2024.10825519

Published January 16, 2025 | Version v1

Dataset Restricted

OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets

Overview

The OpenResume dataset is designed for researchers and practitioners in career trajectory modeling and job-domain machine learning, as described in the IEEE BigData 2024 paper. It includes both anonymized realistic resumes and synthetically generated resumes, offering a comprehensive resource for developing and benchmarking predictive models across a variety of career-related tasks. By employing anonymization and differential privacy techniques, OpenResume ensures that research can be conducted while maintaining privacy. The dataset is available in this repository. Please see the paper for more details: 10.1109/BigData62323.2024.10825519

If you find this paper useful in your research or use this dataset in any publications, projects, tools, or other forms, please cite:

@inproceedings{yamashita2024openresume,

  title={{OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets}},

  author={Yamashita, Michiharu and Tran, Thanh and Lee, Dongwon},

  booktitle={2024 IEEE International Conference on Big Data (BigData)},

  year={2024},

  organization={IEEE}

}

@inproceedings{yamashita2023james,

  title={{JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning}},

  author={Yamashita, Michiharu and Shen, Jia Tracy and Tran, Thanh and Ekhtiari, Hamoon and Lee, Dongwon},

  booktitle={2023 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},

  year={2023},

  organization={IEEE}

}

Data Contents and Organization

The dataset consists of two primary components:

Realistic Data: An anonymized dataset utilizing differential privacy techniques.
Synthetic Data: A synthetic dataset generated from real-world job transition graphs.

The dataset includes the following features:

Anonymized User Identifiers: Unique IDs for anonymized users.
Anonymized Company Identifiers: Unique IDs for anonymized companies.
Normalized Job Titles: Job titles standardized into the ESCO taxonomy.
Job Durations: Start and end dates, either anonymized or synthetically generated with differential privacy.

Detailed information on how the OpenResume dataset is constructed can be found in our paper.

Dataset Extension

Job titles in the OpenResume dataset are normalized into the ESCO occupation taxonomy. You can easily integrate the OpenResume dataset with ESCO job and skill databases to perform additional downstream tasks.

Applicable Tasks:
- Next Job Title Prediction (Career Path Prediction)
- Next Company Prediction (Career Path Prediction)
- Turnover Prediction
- Link Prediction
- Required Skill Prediction (with ESCO dataset integration)
- Existing Skill Prediction (with ESCO dataset integration)
- Job Description Classification (with ESCO dataset integration)
- Job Title Classification (with ESCO dataset integration)
- Text Feature-Based Model Development (with ESCO dataset integration)
- LLM Development for Resume-Related Tasks (with ESCO dataset integration)
- And more!

Intended Uses

The primary objective of OpenResume is to provide an open resource for:

Evaluating and comparing newly developed career models in a standardized manner.
Fostering AI advancements in career trajectory modeling and job market analytics.

With its manageable size, the dataset allows for quick validation of model performance, accelerating innovation in the field. It is particularly useful for researchers who face barriers in accessing proprietary datasets.

While OpenResume is an excellent tool for research and model development, it is not intended for commercial, real-world applications. Companies and job platforms are expected to rely on proprietary data for their operational systems. By excluding sensitive attributes such as race and gender, OpenResume minimizes the risk of bias propagation during model training.

Our goal is to support transparent, open research by providing this dataset. We encourage responsible use to ensure fairness and integrity in research, particularly in the context of ethical AI practices.

Ethical and Responsible Use

The OpenResume dataset was developed with a strong emphasis on privacy and ethical considerations. Personal identifiers and company names have been anonymized, and differential privacy techniques have been applied to protect individual privacy. We expect all users to adhere to ethical research practices and respect the privacy of data subjects.

Related Work

JAMES: Normalizing Job Titles with Multi-Aspect Graph Embeddings and Reasoning
Michiharu Yamashita, Jia Tracy Shen, Thanh Tran, Hamoon Ekhtiari, and Dongwon Lee
IEEE Int'l Conf. on Data Science and Advanced Analytics (DSAA), 2023

Fake Resume Attacks: Data Poisoning on Online Job Platforms
Michiharu Yamashita, Thanh Tran, and Dongwon Lee
The ACM Web Conference 2024 (WWW), 2024

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

The dataset will be used strictly for research purposes only.
The request must be submitted from the official and active email address of your university, research institution, or company to allow for verification.
You will not share the dataset with any individual or entity not included in this request.
You will appropriately cite the paper referenced in the dataset description in any publication, project, or tool that utilizes this dataset.
You acknowledge full responsibility for any use of the dataset and agree that the authors are not liable for any outcomes resulting from use beyond the intended purposes.

Please create a Zenodo account and submit your request after logging in. When submitting your request, please also include a statement confirming that you agree to the dataset usage conditions above.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	717	717
Downloads	173	173
Data volume	104.2 MB	104.2 MB

OpenResume: Advancing Career Trajectory Modeling with Anonymized and Synthetic Resume Datasets

Creators

Description

Overview

Data Contents and Organization

Dataset Extension

Intended Uses

Ethical and Responsible Use

Related Work

Files

Restricted

Request access