AI-Culture Commons Cultural Datasets: DOLMA + JSON + CSV
Authors/Creators
Description
AI-Culture Commons DOLMA/JSON/CSV Corpus
12 languages · CC-BY-4.0
The AI-Culture multilingual corpus contains 5K articles providing comprehensive philosophical and cultural content, exploring the intersection of technology, artificial intelligence, and human culture, perfectly aligned across 12 languages. All content maintains identical parallel structure across translations with zero duplication and editor-curated quality.
This project is maintained by a non-profit digital humanities team committed to advancing humane AI through meticulously curated, thoroughly clean cultural datasets.
Dataset Overview
| File | Format | Size | Structure |
|---|---|---|---|
ai-culture.jsonl.gz |
DOLMA JSONL (gzipped) | 66 MB | 5K Plain Text + 5K Original HTML |
ai-culture.json |
JSON | 254 MB | 5K Plain Text + 5K Original HTML |
ai-culture.csv |
CSV | 460 MB | Parallel pairs: Original ↔ Translation |
Languages
Perfect machine-validated alignment across 12 languages: English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin Chinese, Hindi, Hebrew.
Content Characteristics
All our datasets guarantee four core principles:
-
Extremely clean: All content is original, editor-curated text without any user comments, scraped texts, ads, tracking scripts, JavaScript, cookies, or unwanted noise. All source articles were produced by our editorial team and professionally edited.
-
Transparent process: Both clean text and original HTML source are preserved in all datasets, with full pipeline documentation (see below).
-
Free license: Clear free license - usage is free for any purpose including commercial use, with attribution required only when feasible.
-
Rich intellectual content: Long-form essays that foster philosophical reasoning, cultural awareness, and literary sensitivity in models. Our datasets provide models with deep philosophical-intellectual context and diverse connections between culture, philosophy, literature, and technology—particularly AI. The content curation is specifically designed to help train more intellectually critical and philosophically grounded AI models.
Pipeline & Validation
The corpus was created with an open-source pipeline [GitHub link] that:
- Processes files from local project directories (no web crawling required)
- Extracts and processes content through a multi-stage pipeline:
- HTML files: Compacts HTML structure, extracts titles via BeautifulSoup, and converts body content to clean text using html2text with enhanced CJK character handling
- PDF files: Reads pre-converted TXT files from Word document sources that generated the PDFs
- Text processing: Removes control characters, normalizes Unicode (NFKC), handles bidirectional text spacing, and collapses excessive whitespace
- Runs language-aware word counting (smart algorithms for Chinese/Japanese/Korean vs. space-separated languages) and assigns domain labels based on file paths
- Generates:
ai-culture.jsonl.gz– DOLMA-compatible newline-delimited JSONai-culture.json– one compact record per articleai-culture.csv– parallel text pairs with metadata
- Runs multi-layer integrity validation including dataset loading, structure verification, and sample inspection across all formats. Includes supplementary datasets library compatibility tests for Hugging Face Hub integration
All scripts include a zero-duplicate guarantee. We maintain machine-validated alignment between languages.
CC BY 4.0 Licenses
- Multicultural Project: https://degeneration-of-nation.org - Critical philosophical commentary
- License Page: CC-BY-4.0
- Original Project: https://hitdarderut-haaretz.org - Cultural, philosophical, and literary analysis
- License Page: CC-BY-4.0
Data Schema
DOLMA Format Schema
The DOLMA file uses newline-delimited JSON with gzip compression, compatible with RedPajama/Dolma training pipelines:
{
"id": "en/philosophy-of-learning81",
"text": "The First Algorithmic Era...",
"added": "2025-08-01T14:37:12Z",
"source": "hitdarderut-haaretz",
"metadata": {
"language": "en",
"title": "An Essay on the Fermi Paradox",
"url": "https://degeneration-of-nation.org/en/philosophy-of-learning81",
"translation_of": "https://hitdarderut-haaretz.org/filosofia81",
"source_format": "html",
"domain": "philosophy",
"license": "CC-BY-4.0",
"timestamp": "2025-07-15T00:00:00Z",
"word_count": 1250,
"char_count": 7500,
"sha256": "a1b2c3d4...",
"html_raw": "<!DOCTYPE html>..."
}
}
JSON Format Schema
{
"id": "string", // e.g., "he/actualia6" or "en/alternative-commentary6"
"language": "string", // Language code
"title": "string", // Article title from HTML
"content": "string", // Full text content without HTML
"html": "string", // Complete HTML source
"url": "string", // URL of the translated content
"original_url": "string" // URL of original content
}
CSV Format Schema
The CSV file contains parallel text pairs with the following columns:
{
"article_code": "string", // Unique identifier for each article
"source_lang": "string", // Source language code
"target_lang": "string", // Target language code (en, fr, de, etc.)
"section_name": "string", // Content section (philosophy-of-learning, culture&literature, etc.)
"source_text": "string", // Clean text content in source language
"translated_text": "string", // Clean text content in target language
"source_html": "string", // Complete HTML source (original)
"translated_html": "string", // Complete HTML source (translation)
"source_url": "string", // URL of original article
"translated_url": "string" // URL of translated article
}
Files
ai-culture.csv
Additional details
Related works
- Is supplement to
- Software: https://github.com/AI-Culture-Commons/ai-culture-html-multilingual-datasets/tree/v1.0.3 (URL)