Published March 28, 2026 | Version v3
Dataset Open

Selective divergence between Grokipedia and Wikipedia articles

  • 1. ROR icon University College Dublin
  • 2. ROR icon Trinity College Dublin
  • 3. ROR icon Technological University Dublin

Description

main_dataset.csv

This dataset consists of paired articles on identical topics collected from Grokipedia (G) and Wikipedia (W). Each row corresponds to a single topic and contains metadata, structural features, linguistic statistics, similarity measures, and bias/factuality scores for both sources. 

Identification and Existence Flags

  • title: Canonical topic title.

  • slug: URL-friendly identifier for the topic.

  • exists_grokipedia: Binary indicator of whether the topic exists in Grokpedia.

  • exists_wikipedia: Binary indicator of whether the topic exists in Wikipedia.

Structural and Content Features

All variables prefixed with a_ refer to Grokipedia, and b_ refer to Wikipedia.

Document Structure

  • paragraph_count: Number of paragraphs.

  • heading_count_h1–h4: Number of headings at each HTML level.

  • section_count_h2_h4: Number of sections defined by H2–H4 headings.

  • link_count: Number of internal and external hyperlinks.

  • image_count: Number of embedded images.

  • reference_count: Number of references/citations.

Normalized Density Measures

  • refs_per_1k_words: References per 1,000 words.

  • links_per_1k_words: Links per 1,000 words.

  • headings_per_1k_words: Headings per 1,000 words.

Word Counts

  • clean_word_count: Number of cleaned (tokenized, stopword-filtered) words.

  • clean_words_alpha: Alphabetic cleaned words only.

  • raw_visible_words_alpha: Alphabetic visible words before cleaning.

Lexical and Semantic Similarity

Lexical Similarity

  • lexical_tfidf_cosine: Cosine similarity between TF-IDF vectors.

  • lexical_jaccard_unigram: Jaccard similarity over unigram sets.

  • ngram_overlap_1/2/3: Overlap of unigrams, bigrams, and trigrams.

Semantic Similarity

  • semantic_embed_cosine: Cosine similarity between sentence embeddings.

  • bertscore_f1: BERTScore F1 semantic similarity.

  • stylistic_similarity: Composite stylistic similarity metric.

Linguistic and Readability Features

Computed separately for Grokpedia and Wikipedia.

Syntactic and Lexical Properties

  • avg_sentence_len: Mean sentence length (words).

  • lexical_diversity: Type-token ratio.

  • lexical_density: Proportion of content words.

Readability

  • flesch_kincaid: Flesch–Kincaid grade level.

  • gunning_fog: Gunning Fog index.

  • reading_time_min: Estimated reading time in minutes.

POS Distributions

  • pos_noun, pos_verb, pos_adj, pos_adv: Proportions of part-of-speech categories.

Raw Text Statistics

  • char_count: Character count.

  • word_count: Word count.

  • sentence_count: Sentence count.

Topic and Clustering Metadata

  • topic_gpt: Topic label generated by GPT-based topic modeling.

  • clst_k_means: Cluster ID from k-means clustering.

  • topic_k_means: Human-interpretable topic label from k-means.

Bias, Leaning, and Factuality Measures

Political Leaning

The party leaning metric from this dataset was used to extract the following metrics. 

  • leaning_Grokipedia: Estimated political leaning score for Grokipedia.

  • leaning_Wikipedia: Estimated political leaning score for Wikipedia.

  • leaning_diff_G_minus_W: Difference in leaning (Grokipedia − Wikipedia).

Bias Scores

The bias score from this dataset was used to extract the following metrics. 

  • bias_Grokipedia: Overall bias score for Grokipedia.

  • bias_Wikipedia: Overall bias score for Wikipedia.

  • bias_diff_G_minus_W: Bias difference between sources.

Factuality

The factuality score from this dataset was used to extract the following metrics. 

  • factual_Grokipedia: Factuality score for Grokpedia.

  • factual_Wikipedia: Factuality score for Wikipedia.

  • factual_diff_G_minus_W: Difference in factuality.

Combined Similarity Score

  • combined_score: Composite score aggregating multiple similarity metrics.

  • combined_score_scaled: Composite score aggregating multiple similarity metrics scaled to -1 to 1.

 

ref_domains_per_article_all.csv

 Each row represents a single referenced domain within a specific article.

Identification and Metadata

  • title: Title of the article in which the reference appears.

  • platform: Platform hosting the article (Grokpedia, Wikipedia).

  • rank: Rank of the domain within the article, ordered by frequency of appearance.

  • domain: Referenced domain name (e.g., nytimes.com, foxnews.com).

Reference Frequency Measures

  • count_in_article: Number of times the domain is cited within the article.

  • n_refs_found_article: Total number of references found in the article.

 

Source Quality and Ideology

These variables are extracted from this dataset.

  • bias: political bias score of the domain.

  • factual_reporting: Factual reliability score of the domain.

These variables are extracted from this dataset.

  • leaning_score: political leaning score of the domain.

 

Files

main_dataset.csv

Files (137.4 MB)

Name Size Download all
md5:1630d22a4cebf3a1b6e26c39f0e81d1e
15.9 MB Preview Download
md5:25be63d1a0bab7a3ef01c41034ab9d46
121.5 MB Preview Download

Additional details

Funding

Taighde Éireann - Research Ireland
IRCLA/2022/3217
Taighde Éireann - Research Ireland
18/CRT/6049