AI-Culture Commons Cultural Datasets: DOLMA + JSON + CSV

Ben Zippor

doi:10.5281/zenodo.16789405

Published August 11, 2025 | Version v7

Dataset Open

AI-Culture Commons Cultural Datasets: DOLMA + JSON + CSV

Ben Zippor

AI-Culture Commons DOLMA/JSON/CSV Corpus

12 languages · CC-BY-4.0

The AI-Culture multilingual corpus contains 5K articles providing comprehensive philosophical and cultural content, exploring the intersection of technology, artificial intelligence, and human culture, perfectly aligned across 12 languages. All content maintains identical parallel structure across translations with zero duplication and editor-curated quality.

This project is maintained by a non-profit digital humanities team committed to advancing humane AI through meticulously curated, thoroughly clean cultural datasets.

Dataset Overview

File	Format	Size	Structure
`ai-culture.jsonl.gz`	DOLMA JSONL (gzipped)	66 MB	5K Plain Text + 5K Original HTML
`ai-culture.json`	JSON	254 MB	5K Plain Text + 5K Original HTML
`ai-culture.csv`	CSV	460 MB	Parallel pairs: Original ↔ Translation

Languages

Perfect machine-validated alignment across 12 languages: English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin Chinese, Hindi, Hebrew.

Content Characteristics

All our datasets guarantee four core principles:

Extremely clean: All content is original, editor-curated text without any user comments, scraped texts, ads, tracking scripts, JavaScript, cookies, or unwanted noise. All source articles were produced by our editorial team and professionally edited.
Transparent process: Both clean text and original HTML source are preserved in all datasets, with full pipeline documentation (see below).
Free license: Clear free license - usage is free for any purpose including commercial use, with attribution required only when feasible.
Rich intellectual content: Long-form essays that foster philosophical reasoning, cultural awareness, and literary sensitivity in models. Our datasets provide models with deep philosophical-intellectual context and diverse connections between culture, philosophy, literature, and technology—particularly AI. The content curation is specifically designed to help train more intellectually critical and philosophically grounded AI models.

Pipeline & Validation

The corpus was created with an open-source pipeline [GitHub link] that:

Processes files from local project directories (no web crawling required)
Extracts and processes content through a multi-stage pipeline:
- HTML files: Compacts HTML structure, extracts titles via BeautifulSoup, and converts body content to clean text using html2text with enhanced CJK character handling
- PDF files: Reads pre-converted TXT files from Word document sources that generated the PDFs
- Text processing: Removes control characters, normalizes Unicode (NFKC), handles bidirectional text spacing, and collapses excessive whitespace
Runs language-aware word counting (smart algorithms for Chinese/Japanese/Korean vs. space-separated languages) and assigns domain labels based on file paths
Generates:
- ai-culture.jsonl.gz – DOLMA-compatible newline-delimited JSON
- ai-culture.json – one compact record per article
- ai-culture.csv – parallel text pairs with metadata
Runs multi-layer integrity validation including dataset loading, structure verification, and sample inspection across all formats. Includes supplementary datasets library compatibility tests for Hugging Face Hub integration

All scripts include a zero-duplicate guarantee. We maintain machine-validated alignment between languages.

CC BY 4.0 Licenses

Multicultural Project: https://degeneration-of-nation.org - Critical philosophical commentary
- License Page: CC-BY-4.0
Original Project: https://hitdarderut-haaretz.org - Cultural, philosophical, and literary analysis
- License Page: CC-BY-4.0

Data Schema

DOLMA Format Schema

The DOLMA file uses newline-delimited JSON with gzip compression, compatible with RedPajama/Dolma training pipelines:

{
  "id": "en/philosophy-of-learning81",
  "text": "The First Algorithmic Era...",
  "added": "2025-08-01T14:37:12Z",
  "source": "hitdarderut-haaretz",
  "metadata": {
    "language": "en",
    "title": "An Essay on the Fermi Paradox",
    "url": "https://degeneration-of-nation.org/en/philosophy-of-learning81",
    "translation_of": "https://hitdarderut-haaretz.org/filosofia81",
    "source_format": "html",
    "domain": "philosophy",
    "license": "CC-BY-4.0",
    "timestamp": "2025-07-15T00:00:00Z",
    "word_count": 1250,
    "char_count": 7500,
    "sha256": "a1b2c3d4...",
    "html_raw": "<!DOCTYPE html>..."
  }
}

JSON Format Schema

{
  "id": "string",          // e.g., "he/actualia6" or "en/alternative-commentary6"
  "language": "string",    // Language code
  "title": "string",       // Article title from HTML
  "content": "string",     // Full text content without HTML
  "html": "string",        // Complete HTML source
  "url": "string",         // URL of the translated content
  "original_url": "string" // URL of original content
}

CSV Format Schema

The CSV file contains parallel text pairs with the following columns:

{
  "article_code": "string", // Unique identifier for each article
  "source_lang": "string", // Source language code
  "target_lang": "string", // Target language code (en, fr, de, etc.)
  "section_name": "string", // Content section (philosophy-of-learning, culture&literature, etc.)
  "source_text": "string", // Clean text content in source language
  "translated_text": "string", // Clean text content in target language
  "source_html": "string", // Complete HTML source (original)
  "translated_html": "string", // Complete HTML source (translation)
  "source_url": "string", // URL of original article
  "translated_url": "string" // URL of translated article
}

Files

ai-culture.csv

Files (778.9 MB)

Name	Size	Download all
ai-culture.csv md5:f4c8a2dbbe7a89ece116e4117c1e8763	459.4 MB	Preview Download
ai-culture.json md5:2784cfd157929f93cd86cce8d04fc3d2	253.7 MB	Preview Download
ai-culture.jsonl.gz md5:d10db7cb390a3f7a4f4667468851ebc9	65.9 MB	Download

Additional details

Is supplement to: Software: https://github.com/AI-Culture-Commons/ai-culture-html-multilingual-datasets/tree/v1.0.3 (URL)

Repository URL: https://github.com/AI-Culture-Commons/ai-culture-html-multilingual-datasets

	All versions	This version
Views	91	48
Downloads	63	44
Data volume	21.9 GB	15.6 GB

AI-Culture Commons Cultural Datasets: DOLMA + JSON + CSV

Authors/Creators

Description

AI-Culture Commons DOLMA/JSON/CSV Corpus

12 languages · CC-BY-4.0

Dataset Overview

Languages

Content Characteristics

Pipeline & Validation

CC BY 4.0 Licenses

Data Schema

DOLMA Format Schema

JSON Format Schema

CSV Format Schema

Files

ai-culture.csv

Files (778.9 MB)

Additional details

Related works

Software