robomustib/TextSimilarityGrader: TextSimilarityGrader: A Python Tool for Automated Fuzzy Evaluation of Speech-to-Text Transcripts in Research Contexts
Authors/Creators
Description
Abstract
In large-scale psychological and linguistic studies, manual coding of speech-to-text transcripts is time-consuming and prone to human error. Furthermore, automated transcription services (ASR) often introduce phonetic errors, typos, or misinterpretations (e.g., "Appple" instead of "Apple"), rendering exact-string-matching algorithms ineffective for automated grading.
TextSimilarityGrader is an open-source Python utility designed to solve this problem. It automates the evaluation of transcript files (JSON or TXT) against a set of expected keywords/answers. By utilizing fuzzy string matching (based on Gestalt Pattern Matching), the tool identifies correct answers even when the transcript contains spelling errors, dialect variations, or ASR artifacts. This allows for rapid, standardized scoring (0/1) of thousands of audio transcripts with high reliability.
Motivation and Problem Statement
Researchers utilizing ASR (Automatic Speech Recognition) tools like Gladia, OpenAI Whisper, or Google STT often face a "post-processing bottleneck." While the audio is transcribed quickly, verifying if a participant said a specific target word requires reading through thousands of files. Simple "Ctrl+F" search scripts fail when the ASR makes minor mistakes (e.g., transcribing "Buß" instead of "Bus").
Methodology
The software implements a multi-stage evaluation pipeline:
- Data Ingestion: The tool parses various transcript formats, including nested JSON structures (common in API outputs) and plain text.
- Normalization: Input text is cleaned (lowercased, punctuation removed, special character normalization) to ensure comparability.
- Fuzzy Logic Matching & Mathematical Foundation:
The core engine utilizes the
difflib.SequenceMatcherclass, which implements the Ratcliff/Obershelp pattern recognition algorithm. The similarity ratio S is calculated as: S = (2 * M) / T Where:
- M is the number of matching characters.
- T is the total number of characters in both sequences (T = len(a) + len(b)).
This yields a normalized score S between 0.0 and 1.0, where 1.0 indicates an identical match. 4. Threshold-Based Grading: A similarity threshold (default ≥ 0.75) determines validity. The score assignment follows a binary classification logic:
- Score = 1 (Correct) if S ≥ 0.75
- Score = 0 (Incorrect) if S < 0.75
Note: A dynamic constraint is applied to short words (≤ 3 characters) to minimize false positives. 5. Reporting: Results are exported to an Excel file, listing the detected word, the full context sentence, the calculated similarity score, and the final point allocation.
Key Features
- ASR-Agnostic: Works with Gladia JSON, generic JSON, and .txt files.
- Error Tolerance: Robust against ASR hallucinations, stuttering, and phonetic misspellings.
- Batch Processing: Capable of processing thousands of files in a single run.
- Visual Validation: The output Excel sheet allows researchers to manually verify "close calls" by reviewing the similarity percentage and extracted context.
- Reproducibility: Includes a test suite (
tests/) to generate mock data with intentional typos, validating the grading logic before real data processing.
Workflow
The tool operates in three steps:
- Template Generation: Scans the data folder and creates an Excel template (
Solutions.xlsx). - Definition: The researcher enters the expected target words into the Excel template.
- Evaluation: The script
evaluate.pyprocesses the files and generatesGrading_Results.xlsx.
Technical Implementation
- Language: Python 3.x
- Dependencies:
pandas(Dataframe manipulation),openpyxl(Excel I/O). - License: MIT License
Related Works
This tool serves as the evaluation module for the Gladia Batch Transcriber workflow but can be used independently with any text-based data source.
Files
robomustib/TextSimilarityGrader-TextSimilarityGrader.zip
Files
(47.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:8125957f59109da9eecda901c582e04d
|
47.8 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/robomustib/TextSimilarityGrader/tree/TextSimilarityGrader (URL)
Software
- Repository URL
- https://github.com/robomustib/TextSimilarityGrader