There is a newer version of the record available.

Published April 7, 2026 | Version 1.0.0
Dataset Open

Doctoral Theses in France (1985–2025)

  • 1. ROR icon Université Paris Sciences et Lettres
  • 2. ROR icon Université Paris Dauphine-PSL
  • 3. ROR icon Friedrich-Alexander-Universität Erlangen-Nürnberg
  • 4. ROR icon German Institute for Global and Area Studies

Description

Overview

This dataset provides a comprehensive, structured collection of doctoral theses defended in France between 1985 and 2025. Each record corresponds to a single defended PhD thesis and includes detailed information on the thesis itself, the individuals involved (author, supervisors, jury members), and associated institutions.

The dataset is provided in parquet and csv formats. However, to facilitate retrieval of column types and to improve loading perfirmances, we strongly recommend to use the parquet version of the dataset.

Motivation

The dataset was created to support quantitative and computational analyses of doctoral training and academic networks in France. While existing platforms provide rich metadata, they are often fragmented, incomplete, or difficult to exploit at scale. This dataset addresses these limitations by:

  • Aggregating multiple authoritative sources
  • Correcting inconsistencies in identifiers and metadata
  • Enriching records with derived and external features

Data Sources

The dataset is derived from the following primary sources:

  • Thèses.fr: Main source of thesis metadata (titles, abstracts, participants, institutions), both the bulk dataset and API have been used
  • IdRef: Authority database for disambiguated person and institution identifiers
  • Thèses En Ligne (TEL): Open-access repository for thesis manuscripts and additional metadata
  • SUDOC: National academic library catalogue providing bibliographic records and identifiers (PPN)

Temporal Coverage

  • Theses defended: 1985–2025 (NB: Coverage for recent years may be incomplete due to reporting delays in source systems.)
  • Data collection date: March 31, 2026

Dataset Structure

  • Unit of observation: One row per thesis
  • Main categories of variables:
    • Thesis identifiers & status
    • Author information
    • Supervisor information
    • Jury composition
    • Institutional affiliations
    • Content & topics

The dataset includes both raw metadata and derived features.

A complete description of the features is available in the feature codebook

Data Enhancements

Several transformations were applied to improve data quality:

  • Correction of invalid or inconsistent IdRef identifiers
  • Enrichment of person records using authority data
  • Gender inference based on first names (automated + manual review)
  • Feature engineering for temporal and relational analysis
  • Integration of external identifiers (TEL, SUDOC)

Missing Data and Limitations

  • Jury information is often missing for older theses due to historical data collection practices
  • Doctoral school information is almost non-existent before 2006 due to institutional evolution
  • Gender data is incomplete and partially inferred
  • Recent years may be underrepresented due to reporting delays

Users should account for these factors, especially in longitudinal analyses.

Potential Use Cases

  • Analysis of academic networks (supervision, jury participation)
  • Study of doctoral education and career trajectories
  • Gender and diversity analyses in academia
  • Institutional collaboration and structure
  • Evolution of research fields and disciplines
  • Natural language processing on thesis content

The dataset is designed to be interoperable and extensible:

  • Persistent identifiers enable linkage with external datasets
  • Compatible with bibliometric, network analysis, and NLP workflows
  • Can be enriched with publication data, rankings, or full-text corpora

Ethical Considerations

  • Gender inference may introduce classification errors
  • Data represents academic individuals and should be used responsibly

Reproducibility

  • Data collected from publicly available sources
  • Code to reproduce the dataset is publicly available on GitHub
  • Data snapshot reflects the state of sources as of March 31, 2026. Future updates may improve completeness, especially for recent years.
  • The dataset's version numbers implement the Semantic Versioning 2.0.0 scheme. Files released are immutable and updates to a file shall trigger a version increment.

Citation

If you use this dataset, please cite the associated paper:

APA:

Aboucaya, W., & Jasim, D. (2026). Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions. Data in Brief, 112947. doi:10.1016/j.dib.2026.112947

BibTex:

@article{10.1016/j.dib.2026.112947,
    title = {Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions},
    journal = {Data in Brief},
    pages = {112947},
    year = {2026},
    issn = {2352-3409},
    doi = {https://doi.org/10.1016/j.dib.2026.112947},
    url = {https://www.sciencedirect.com/science/article/pii/S235234092600497X},
    author = {William Aboucaya and Dastan Jasim},
}

Contact

For questions, feedback, or contributions, please send an issue to the GitHub repository.

Files

features_codebook.pdf

Files (2.8 GB)

Name Size Download all
md5:841d9f5e0a3acdc81ea5aaaa7d7d379d
175.4 kB Preview Download
md5:0e6b713a92f506bfe1ed3a3e48fb6a61
1.9 GB Preview Download
md5:cb6eefdec6c5bd1f416e727cb3fc56a1
929.1 MB Download

Additional details

Additional titles

Alternative title (English)
Doctoral Theses in France (1985–2025): A Linked Dataset of PhDs, Academic Networks, and Institutions
Translated title (French)
Thèses de doctorat en France (1985-2025) : un ensemble de données interconnectées sur les doctorats, les réseaux universitaires et les établissements

Dates

Collected
2026-03-31
Data harvesting (most importantly, queries to the Thèses.fr API) took place at this date. Theses published after this date, even if they were defended before 2026, are not available in the dataset.

Software

Repository URL
https://github.com/WilliamAboucaya/phd-theses-france
Programming language
Python
Development Status
Active