Published 2024 | Version v2
Dataset Open

DUMAS metadata for French Master students' electronic dissertations

  • 1. ROR icon CY Cergy Paris Université

Description

In DUMAS dissertations from Master students are made available by the librarians of the institution of higher education where the thesis was defended. Furthermore, a minimum grade is necessary – officially 16/20 – in order to screen for the quality of the archived manuscripts. In late 2023, the archive hosted 52.787 documents : Master theses, Medicine, etc. Some of them dated back to the 1940s, but their existence is mostly anecdotical, since the archive gained momentum in the late 2000s. While DUMAS is the largest archive for electronic dissertations in France, specialized sites have also been created alongside for certain disciplines; we can for example mention the TRHAA database, (Research Dissertations in Art History of Art and Archaeology), maintained by the National Institute of Art History. This is a notable difference with doctoral dissertations, which are all centralized in France on the theses.fr site.

In France, Master students’ dissertations overall contribute only moderately to the advancement of scientific knowledge, due to failure to be published in scientific journals, or even to be archived properly. However, in the case where the work carried out is of quality, these writings are likely to provide interesting insights into their subject of study. Among other advantages, students produce this work within the timeframe of an academic year, and often tackle current topics and quickly propose initial avenues for reflection, literature reviews, which can then be possibly further explored by experienced researchers. In fact, a few months to a year pass between the choice of a research topic and the submission of te dissertation, which allows greater responsiveness to current themes. In contrast, the production cycle of a research article is measured in years. Furthermore, beyond their personal qualities, students theoretically benefit from the supervision of an experienced researcher, who adds added value to the writing in which they invest.

Hardly any research has been carried out on the topic, which is why we made this dataset available.

Methods (English)

For the vast majority of the theses it provides access to, the DUMAS archive lists the following metadata: author(s), title of the dissertation, type of document, defense date, type of dissertation (Master thesis, Medical thesis, etc.), men, discipline(s), keywords, language of the document, license and rights of use (e.g., Creative Commons license), unique identifier, number of pages, identity of both the student and his/her supervisor (we anonymized the data). The metadata were collected on December 24, 2023, for 52,787 manuscripts, via a web scraping algorithm coded in Python and based on the Selenium library. 

To specify the discipline associated with a manuscript when posting the document, one must choose from a list of pre-existing labels – they can select up to three from a tree. They can choose a label the root of the tree, by selecting Social Sciences and Humanities (SSH) for instance (level 0), or choose a more detailed label, such as SSH/Educational sciences (level 1).  A large number of labels are only represented for a limited number of occurrences, such as SHS/demography, while others, such as for example SHS/Educational Sciences, constitute essential parts of the database, with 5384 entries. We grouped labels to construct meaningful categories, comprising a reasonably large number of manuscripts (like Life Sciences, or in French, Sciences de la Vie). Without aiming to be exhaustive, we provide some examples of these operations in the following paragraph.

Using the labels, we identified 563 manuscripts in the field of computer science (level 0), and even if sub-disciplines were defined, we retained this level of granularity for the present analysis. A label placed questionably at the base of the tree, such as quantitative finance, is for example grouped with a level 1 label, SSH/economy, to form the Economy and Finance category. Conversely, the level 0 Life Sciences label brings together 33,849 entries, covering both professional theses in medicine and pharmacy, which are predominant, and dissertations in plant biology, which are rarer. In this configuration, it appeared more relevant to retain level 1 categories, such as Agricultural Sciences, or Veterinary Sciences, and bring together the manuscripts whose themes were poorly represented. It should be noted that we have retained certain taxonomic choices made by the designers of the database. data, revealing disciplines such as Psychology in the global field of social sciences and humanities (SSH), although this choice could be debated.  At the scale of the entire archive, 2.9% of the dissertation were written in English, 96.9% in French, and the remaining 0.2% in a mixture of different languages (Spanish, Italian, etc.).

Files

dumas_Zenodo.csv

Files (20.5 MB)

Name Size Download all
md5:db7ad830a5ce69180770dd2c146d3d39
20.5 MB Preview Download

Additional details

References