Published June 4, 2026 | Version v1
Presentation Open

Enhancing ICPSR's Subject Thesaurus For The Artificial Intelligence Era

  • 1. ROR icon University of Michigan–Ann Arbor
  • 2. ROR icon Inter-university Consortium for Political and Social Research

Description

A subject thesaurus is a controlled vocabulary that describes information resources using standardized language and hierarchical relationships between concepts. Consistent terms guide users toward preferred terminology, reduce ambiguity, and enable reliable searching and filtering across large collections.

In 1999-2003, ICPSR created a thesaurus of social science terms and applied those terms to its data resources. The ICPSR subject thesaurus has grown and been maintained since that time, but not always consistently. New terms have been added as needed and not applied retroactively. Subject terms from outside the thesaurus appear in many studies, most self-published. Efforts to align the subject thesaurus with other controlled vocabularies, such as Library of Congress Subject Headings, are hampered by differences in structure and context.

While research on ICPSR’s data curation practices has suggested that subject terms contribute significantly to usage, improvements in artificial intelligence raise questions about the best uses for subject thesauri in the context of large language models.

ICPSR is attempting to refine and modernize our subject thesaurus to maximize its relevance in the AI discovery environment. Our first step in this process is to explore the current state of the thesaurus and how it has changed over time. We’ll share early findings from this project, looking at things like: number of subject terms; distribution of terms across studies; comparison between most-used and most-searched terms; hierarchical relationships between terms (broader, narrower, preferred, and related terms); and potentially, semantic relationships between terms (topical clustering).

Additionally, we’ll touch on the role of subject thesauri in the age of AI. How can controlled vocabularies be integrated into AI-driven discovery systems? What do LLMs do better than thesauri, and when are thesauri the most useful? Our findings will inform improvements to the ICPSR subject thesaurus, with potential applications to other repositories facing similar challenges.

Files

IASSIST subject thesaurus presentation.pdf

Files (601.6 kB)

Name Size Download all
md5:565d51fbc98c49474560de8c2d3498da
601.6 kB Preview Download