Doctoral Theses in France (1985–2025)
Authors/Creators
Description
Overview
This dataset provides a comprehensive, structured collection of doctoral theses defended in France between 1985 and 2025. Each record corresponds to a single defended PhD thesis and includes detailed information on the thesis itself, the individuals involved (author, supervisors, jury members), and associated institutions.
The dataset is provided in parquet and csv formats. However, to facilitate retrieval of column types and to improve loading perfirmances, we strongly recommend to use the parquet version of the dataset.
Motivation
The dataset was created to support quantitative and computational analyses of doctoral training and academic networks in France. While existing platforms provide rich metadata, they are often fragmented, incomplete, or difficult to exploit at scale. This dataset addresses these limitations by:
- Aggregating multiple authoritative sources
- Correcting inconsistencies in identifiers and metadata
- Enriching records with derived and external features
Data Sources
The dataset is derived from the following primary sources:
- Thèses.fr: Main source of thesis metadata (titles, abstracts, participants, institutions), both the bulk dataset and API have been used
- IdRef: Authority database for disambiguated person and institution identifiers
- Thèses En Ligne (TEL): Open-access repository for thesis manuscripts and additional metadata
- SUDOC: National academic library catalogue providing bibliographic records and identifiers (PPN)
Temporal Coverage
- Theses defended: 1985–2025 (NB: Coverage for recent years may be incomplete due to reporting delays in source systems.)
- Data collection date: March 31, 2026
Dataset Structure
- Unit of observation: One row per thesis
- Main categories of variables:
- Thesis identifiers & status
- Author information
- Supervisor information
- Jury composition
- Institutional affiliations
- Content & topics
The dataset includes both raw metadata and derived features.
A complete description of the features is available in the feature codebook
Data Enhancements
Several transformations were applied to improve data quality:
- Correction of invalid or inconsistent IdRef identifiers
- Enrichment of person records using authority data
- Gender inference based on first names (automated + manual review)
- Feature engineering for temporal and relational analysis
- Integration of external identifiers (TEL, SUDOC)
Missing Data and Limitations
- Jury information is often missing for older theses due to historical data collection practices
- Doctoral school information is almost non-existent before 2006 due to institutional evolution
- Gender data is incomplete and partially inferred
- Recent years may be underrepresented due to reporting delays
Users should account for these factors, especially in longitudinal analyses.
Potential Use Cases
- Analysis of academic networks (supervision, jury participation)
- Study of doctoral education and career trajectories
- Gender and diversity analyses in academia
- Institutional collaboration and structure
- Evolution of research fields and disciplines
- Natural language processing on thesis content
The dataset is designed to be interoperable and extensible:
- Persistent identifiers enable linkage with external datasets
- Compatible with bibliometric, network analysis, and NLP workflows
- Can be enriched with publication data, rankings, or full-text corpora
Ethical Considerations
- Gender inference may introduce classification errors
- Data represents academic individuals and should be used responsibly
Reproducibility
- Data collected from publicly available sources
- Code to reproduce the dataset is publicly available on GitHub
- Data snapshot reflects the state of sources as of March 31, 2026. Future updates may improve completeness, especially for recent years.
- The dataset's version numbers implement the Semantic Versioning 2.0.0 scheme. Files released are immutable and updates to a file shall trigger a version increment.
Citation
If you use this dataset, please cite the associated paper:
APA:
Aboucaya, W., & Jasim, D. (2026). Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions. Data in Brief, 112947. doi:10.1016/j.dib.2026.112947
BibTex:
@article{10.1016/j.dib.2026.112947,
title = {Doctoral Theses in France (1985-2025): A Linked Dataset of PhDs, Academic Networks, and Institutions},
journal = {Data in Brief},
pages = {112947},
year = {2026},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2026.112947},
url = {https://www.sciencedirect.com/science/article/pii/S235234092600497X},
author = {William Aboucaya and Dastan Jasim},
}
Contact
For questions, feedback, or contributions, please send an issue to the GitHub repository.
Files
features_codebook.pdf
Additional details
Additional titles
- Alternative title (English)
- Doctoral Theses in France (1985–2025): A Linked Dataset of PhDs, Academic Networks, and Institutions
- Translated title (French)
- Thèses de doctorat en France (1985-2025) : un ensemble de données interconnectées sur les doctorats, les réseaux universitaires et les établissements
Dates
- Collected
-
2026-03-31Data harvesting (most importantly, queries to the Thèses.fr API) took place at this date. Theses published after this date, even if they were defended before 2026, are not available in the dataset.
Software
- Repository URL
- https://github.com/WilliamAboucaya/phd-theses-france
- Programming language
- Python
- Development Status
- Active