Published August 11, 2025 | Version v7
Dataset Open

AI-Culture Commons Cultural Datasets: DOLMA + JSON + CSV

Authors/Creators

Description

AI-Culture Commons DOLMA/JSON/CSV Corpus

12 languages · CC-BY-4.0

The AI-Culture multilingual corpus contains 5K articles providing comprehensive philosophical and cultural content, exploring the intersection of technology, artificial intelligence, and human culture, perfectly aligned across 12 languages. All content maintains identical parallel structure across translations with zero duplication and editor-curated quality.

This project is maintained by a non-profit digital humanities team committed to advancing humane AI through meticulously curated, thoroughly clean cultural datasets.

Dataset Overview

File Format Size Structure
ai-culture.jsonl.gz DOLMA JSONL (gzipped) 66 MB 5K Plain Text + 5K Original HTML
ai-culture.json JSON 254 MB 5K Plain Text + 5K Original HTML
ai-culture.csv CSV 460 MB Parallel pairs: Original ↔ Translation

Languages

Perfect machine-validated alignment across 12 languages: English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin Chinese, Hindi, Hebrew.

Content Characteristics

All our datasets guarantee four core principles:

  1. Extremely clean: All content is original, editor-curated text without any user comments, scraped texts, ads, tracking scripts, JavaScript, cookies, or unwanted noise. All source articles were produced by our editorial team and professionally edited.

  2. Transparent process: Both clean text and original HTML source are preserved in all datasets, with full pipeline documentation (see below).

  3. Free license: Clear free license - usage is free for any purpose including commercial use, with attribution required only when feasible.

  4. Rich intellectual content: Long-form essays that foster philosophical reasoning, cultural awareness, and literary sensitivity in models. Our datasets provide models with deep philosophical-intellectual context and diverse connections between culture, philosophy, literature, and technology—particularly AI. The content curation is specifically designed to help train more intellectually critical and philosophically grounded AI models.

Pipeline & Validation

The corpus was created with an open-source pipeline [GitHub link] that:

  1. Processes files from local project directories (no web crawling required)
  2. Extracts and processes content through a multi-stage pipeline:
    • HTML files: Compacts HTML structure, extracts titles via BeautifulSoup, and converts body content to clean text using html2text with enhanced CJK character handling
    • PDF files: Reads pre-converted TXT files from Word document sources that generated the PDFs
    • Text processing: Removes control characters, normalizes Unicode (NFKC), handles bidirectional text spacing, and collapses excessive whitespace
  3. Runs language-aware word counting (smart algorithms for Chinese/Japanese/Korean vs. space-separated languages) and assigns domain labels based on file paths
  4. Generates:
    • ai-culture.jsonl.gz – DOLMA-compatible newline-delimited JSON
    • ai-culture.json – one compact record per article
    • ai-culture.csv – parallel text pairs with metadata
  5. Runs multi-layer integrity validation including dataset loading, structure verification, and sample inspection across all formats. Includes supplementary datasets library compatibility tests for Hugging Face Hub integration

All scripts include a zero-duplicate guarantee. We maintain machine-validated alignment between languages.

CC BY 4.0 Licenses

Data Schema

DOLMA Format Schema

The DOLMA file uses newline-delimited JSON with gzip compression, compatible with RedPajama/Dolma training pipelines:

{
  "id": "en/philosophy-of-learning81",
  "text": "The First Algorithmic Era...",
  "added": "2025-08-01T14:37:12Z",
  "source": "hitdarderut-haaretz",
  "metadata": {
    "language": "en",
    "title": "An Essay on the Fermi Paradox",
    "url": "https://degeneration-of-nation.org/en/philosophy-of-learning81",
    "translation_of": "https://hitdarderut-haaretz.org/filosofia81",
    "source_format": "html",
    "domain": "philosophy",
    "license": "CC-BY-4.0",
    "timestamp": "2025-07-15T00:00:00Z",
    "word_count": 1250,
    "char_count": 7500,
    "sha256": "a1b2c3d4...",
    "html_raw": "<!DOCTYPE html>..."
  }
}

JSON Format Schema

{
  "id": "string",          // e.g., "he/actualia6" or "en/alternative-commentary6"
  "language": "string",    // Language code
  "title": "string",       // Article title from HTML
  "content": "string",     // Full text content without HTML
  "html": "string",        // Complete HTML source
  "url": "string",         // URL of the translated content
  "original_url": "string" // URL of original content
}

CSV Format Schema

The CSV file contains parallel text pairs with the following columns:

{
 "article_code": "string", // Unique identifier for each article
"source_lang": "string", // Source language code
"target_lang": "string", // Target language code (en, fr, de, etc.)
"section_name": "string", // Content section (philosophy-of-learning, culture&literature, etc.)
"source_text": "string", // Clean text content in source language
"translated_text": "string", // Clean text content in target language
"source_html": "string", // Complete HTML source (original)
"translated_html": "string", // Complete HTML source (translation)
"source_url": "string", // URL of original article
"translated_url": "string" // URL of translated article
}

Files

ai-culture.csv

Files (778.9 MB)

Name Size Download all
md5:f4c8a2dbbe7a89ece116e4117c1e8763
459.4 MB Preview Download
md5:2784cfd157929f93cd86cce8d04fc3d2
253.7 MB Preview Download
md5:d10db7cb390a3f7a4f4667468851ebc9
65.9 MB Download

Additional details