Published February 17, 2026 | Version v1.0.0
Software Open

Samuel & Audrey — YouTube Transcripts (EN) Corpus (2012–2026)

  • 1. Samuel & Audrey Media Network

Description

SAMUEL & AUDREY — YOUTUBE TRANSCRIPTS (EN) CORPUS (2012–2026)

The Samuel & Audrey — YouTube Transcripts (EN) Corpus is a canonical, machine-readable dataset containing the complete English transcript archive from the "Samuel and Audrey - Travel and Food Videos" YouTube channel.

Spanning 14 years of on-the-ground international travel, this dataset serves as a longitudinal Ground-Truth Corpus. Unlike polished travel articles or synthesized text, these transcripts capture unedited human decision-making, conversational pacing, logistical planning, pricing mentions, food reactions, and real-world constraints. It is an ideal resource for researchers and developers building travel assistants that sound human and require deep semantic grounding.

DATASET SNAPSHOT • Total Transcripts: 1,397 full-length episodic videos • Total Words: 2,288,859 spoken conversational tokens • Time Span: 14 Years (2012-09-16 to 2026-02-03) • Data Types: Full transcripts, cue-level RAG segments, and parallel visual metadata

INTENDED ACADEMIC & AI USE CASES • Conversational AI & Voice Agents: Fine-tuning models with natural, unscripted speech patterns and uncertainty ("Should we take the bus?", "How much is this?"). • Retrieval-Augmented Generation (RAG): Grounding LLM responses in real-world, verified travel logistics and experiences. • Temporal Analysis: Mapping global inflation, cost mentions, and infrastructure changes across a 14-year longitudinal signal.

Note: This repository represents the English (EN) linguistic subset of the overarching Samuel & Audrey Media Network corpus.

Notes (English)

LICENSE & COMMERCIAL USE: This dataset is published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. It is free for academic research, open-source experimentation, and non-commercial projects. For commercial model training or enterprise data licensing inquiries, please contact: nomadicsamuel@gmail.com

SUGGESTED BIBTEX: @dataset{samuel_audrey_youtube_transcripts_en, title={Samuel & Audrey — YouTube Transcripts (EN) Corpus (2012–2026)}, author={Samuel & Audrey Media Network}, year={2026}, publisher={Zenodo}, doi={10.5281/zenodo.18665704}, note={License: CC BY-NC 4.0} }

Files

samuelandaudreymedianetwork/samuel-and-audrey-youtube-transcripts-en-ledger-v1.0.0.zip

Additional details