Selective divergence between Grokipedia and Wikipedia articles

Mohammadi, Saeedeh; Yasseri, Taha

doi:10.5281/zenodo.19286583

Published March 28, 2026 | Version v3

Dataset Open

Selective divergence between Grokipedia and Wikipedia articles

1. University College Dublin
2. Trinity College Dublin
3. Technological University Dublin

`main_dataset.csv`

This dataset consists of paired articles on identical topics collected from Grokipedia (G) and Wikipedia (W). Each row corresponds to a single topic and contains metadata, structural features, linguistic statistics, similarity measures, and bias/factuality scores for both sources.

Identification and Existence Flags

title: Canonical topic title.
slug: URL-friendly identifier for the topic.
exists_grokipedia: Binary indicator of whether the topic exists in Grokpedia.
exists_wikipedia: Binary indicator of whether the topic exists in Wikipedia.

Structural and Content Features

All variables prefixed with a_ refer to Grokipedia, and b_ refer to Wikipedia.

Document Structure

paragraph_count: Number of paragraphs.
heading_count_h1–h4: Number of headings at each HTML level.
section_count_h2_h4: Number of sections defined by H2–H4 headings.
link_count: Number of internal and external hyperlinks.
image_count: Number of embedded images.
reference_count: Number of references/citations.

Normalized Density Measures

refs_per_1k_words: References per 1,000 words.
links_per_1k_words: Links per 1,000 words.
headings_per_1k_words: Headings per 1,000 words.

Word Counts

clean_word_count: Number of cleaned (tokenized, stopword-filtered) words.
clean_words_alpha: Alphabetic cleaned words only.
raw_visible_words_alpha: Alphabetic visible words before cleaning.

Lexical and Semantic Similarity

Lexical Similarity

lexical_tfidf_cosine: Cosine similarity between TF-IDF vectors.
lexical_jaccard_unigram: Jaccard similarity over unigram sets.
ngram_overlap_1/2/3: Overlap of unigrams, bigrams, and trigrams.

Semantic Similarity

semantic_embed_cosine: Cosine similarity between sentence embeddings.
bertscore_f1: BERTScore F1 semantic similarity.
stylistic_similarity: Composite stylistic similarity metric.

Linguistic and Readability Features

Computed separately for Grokpedia and Wikipedia.

Syntactic and Lexical Properties

avg_sentence_len: Mean sentence length (words).
lexical_diversity: Type-token ratio.
lexical_density: Proportion of content words.

Readability

flesch_kincaid: Flesch–Kincaid grade level.
gunning_fog: Gunning Fog index.
reading_time_min: Estimated reading time in minutes.

POS Distributions

pos_noun, pos_verb, pos_adj, pos_adv: Proportions of part-of-speech categories.

Raw Text Statistics

char_count: Character count.
word_count: Word count.
sentence_count: Sentence count.

Topic and Clustering Metadata

topic_gpt: Topic label generated by GPT-based topic modeling.
clst_k_means: Cluster ID from k-means clustering.
topic_k_means: Human-interpretable topic label from k-means.

Bias, Leaning, and Factuality Measures

Political Leaning

The party leaning metric from this dataset was used to extract the following metrics.

leaning_Grokipedia: Estimated political leaning score for Grokipedia.
leaning_Wikipedia: Estimated political leaning score for Wikipedia.
leaning_diff_G_minus_W: Difference in leaning (Grokipedia − Wikipedia).

Bias Scores

The bias score from this dataset was used to extract the following metrics.

bias_Grokipedia: Overall bias score for Grokipedia.
bias_Wikipedia: Overall bias score for Wikipedia.
bias_diff_G_minus_W: Bias difference between sources.

Factuality

The factuality score from this dataset was used to extract the following metrics.

factual_Grokipedia: Factuality score for Grokpedia.
factual_Wikipedia: Factuality score for Wikipedia.
factual_diff_G_minus_W: Difference in factuality.

Combined Similarity Score

combined_score: Composite score aggregating multiple similarity metrics.
combined_score_scaled: Composite score aggregating multiple similarity metrics scaled to -1 to 1.

`ref_domains_per_article_all.csv`

Each row represents a single referenced domain within a specific article.

Identification and Metadata

title: Title of the article in which the reference appears.
platform: Platform hosting the article (Grokpedia, Wikipedia).
rank: Rank of the domain within the article, ordered by frequency of appearance.
domain: Referenced domain name (e.g., nytimes.com, foxnews.com).

Reference Frequency Measures

count_in_article: Number of times the domain is cited within the article.
n_refs_found_article: Total number of references found in the article.

Source Quality and Ideology

These variables are extracted from this dataset.

bias: political bias score of the domain.
factual_reporting: Factual reliability score of the domain.

These variables are extracted from this dataset.

leaning_score: political leaning score of the domain.

Files

main_dataset.csv

Files (137.4 MB)

Name	Size	Download all
main_dataset.csv md5:1630d22a4cebf3a1b6e26c39f0e81d1e	15.9 MB	Preview Download
ref_domains_per_article_all.csv md5:25be63d1a0bab7a3ef01c41034ab9d46	121.5 MB	Preview Download

Additional details

Taighde Éireann - Research Ireland
IRCLA/2022/3217
Taighde Éireann - Research Ireland
18/CRT/6049

	All versions	This version
Views	116	36
Downloads	84	30
Data volume	3.8 GB	977.8 MB

Selective divergence between Grokipedia and Wikipedia articles

Authors/Creators

Description

main_dataset.csv

Identification and Existence Flags

Structural and Content Features

Document Structure

Normalized Density Measures

Word Counts

Lexical and Semantic Similarity

Lexical Similarity

Semantic Similarity

Linguistic and Readability Features

Syntactic and Lexical Properties

Readability

POS Distributions

Raw Text Statistics

Topic and Clustering Metadata

Bias, Leaning, and Factuality Measures

Political Leaning

Bias Scores

Factuality

Combined Similarity Score

ref_domains_per_article_all.csv

Identification and Metadata

Reference Frequency Measures

Source Quality and Ideology

Files

main_dataset.csv

Files (137.4 MB)

Additional details

Funding

`main_dataset.csv`

`ref_domains_per_article_all.csv`