Selective divergence between Grokipedia and Wikipedia articles
Authors/Creators
Description
main_dataset.csv
This dataset consists of paired articles on identical topics collected from Grokipedia (G) and Wikipedia (W). Each row corresponds to a single topic and contains metadata, structural features, linguistic statistics, similarity measures, and bias/factuality scores for both sources.
Identification and Existence Flags
-
title: Canonical topic title.
-
slug: URL-friendly identifier for the topic.
-
exists_grokipedia: Binary indicator of whether the topic exists in Grokpedia.
-
exists_wikipedia: Binary indicator of whether the topic exists in Wikipedia.
Structural and Content Features
All variables prefixed with a_ refer to Grokipedia, and b_ refer to Wikipedia.
Document Structure
-
paragraph_count: Number of paragraphs.
-
heading_count_h1–h4: Number of headings at each HTML level.
-
section_count_h2_h4: Number of sections defined by H2–H4 headings.
-
link_count: Number of internal and external hyperlinks.
-
image_count: Number of embedded images.
-
reference_count: Number of references/citations.
Normalized Density Measures
-
refs_per_1k_words: References per 1,000 words.
-
links_per_1k_words: Links per 1,000 words.
-
headings_per_1k_words: Headings per 1,000 words.
Word Counts
-
clean_word_count: Number of cleaned (tokenized, stopword-filtered) words.
-
clean_words_alpha: Alphabetic cleaned words only.
-
raw_visible_words_alpha: Alphabetic visible words before cleaning.
Lexical and Semantic Similarity
Lexical Similarity
-
lexical_tfidf_cosine: Cosine similarity between TF-IDF vectors.
-
lexical_jaccard_unigram: Jaccard similarity over unigram sets.
-
ngram_overlap_1/2/3: Overlap of unigrams, bigrams, and trigrams.
Semantic Similarity
-
semantic_embed_cosine: Cosine similarity between sentence embeddings.
-
bertscore_f1: BERTScore F1 semantic similarity.
-
stylistic_similarity: Composite stylistic similarity metric.
Linguistic and Readability Features
Computed separately for Grokpedia and Wikipedia.
Syntactic and Lexical Properties
-
avg_sentence_len: Mean sentence length (words).
-
lexical_diversity: Type-token ratio.
-
lexical_density: Proportion of content words.
Readability
-
flesch_kincaid: Flesch–Kincaid grade level.
-
gunning_fog: Gunning Fog index.
-
reading_time_min: Estimated reading time in minutes.
POS Distributions
-
pos_noun, pos_verb, pos_adj, pos_adv: Proportions of part-of-speech categories.
Raw Text Statistics
-
char_count: Character count.
-
word_count: Word count.
-
sentence_count: Sentence count.
Topic and Clustering Metadata
-
topic_gpt: Topic label generated by GPT-based topic modeling.
-
clst_k_means: Cluster ID from k-means clustering.
-
topic_k_means: Human-interpretable topic label from k-means.
Bias, Leaning, and Factuality Measures
Political Leaning
The party leaning metric from this dataset was used to extract the following metrics.
-
leaning_Grokipedia: Estimated political leaning score for Grokipedia.
-
leaning_Wikipedia: Estimated political leaning score for Wikipedia.
-
leaning_diff_G_minus_W: Difference in leaning (Grokipedia − Wikipedia).
Bias Scores
The bias score from this dataset was used to extract the following metrics.
-
bias_Grokipedia: Overall bias score for Grokipedia.
-
bias_Wikipedia: Overall bias score for Wikipedia.
-
bias_diff_G_minus_W: Bias difference between sources.
Factuality
The factuality score from this dataset was used to extract the following metrics.
-
factual_Grokipedia: Factuality score for Grokpedia.
-
factual_Wikipedia: Factuality score for Wikipedia.
-
factual_diff_G_minus_W: Difference in factuality.
Combined Similarity Score
-
combined_score: Composite score aggregating multiple similarity metrics.
- combined_score_scaled: Composite score aggregating multiple similarity metrics scaled to -1 to 1.
ref_domains_per_article_all.csv
Each row represents a single referenced domain within a specific article.
Identification and Metadata
-
title: Title of the article in which the reference appears.
-
platform: Platform hosting the article (Grokpedia, Wikipedia).
-
rank: Rank of the domain within the article, ordered by frequency of appearance.
-
domain: Referenced domain name (e.g.,
nytimes.com,foxnews.com).
Reference Frequency Measures
-
count_in_article: Number of times the domain is cited within the article.
-
n_refs_found_article: Total number of references found in the article.
Source Quality and Ideology
These variables are extracted from this dataset.
-
bias: political bias score of the domain.
-
factual_reporting: Factual reliability score of the domain.
These variables are extracted from this dataset.
-
leaning_score: political leaning score of the domain.
Files
main_dataset.csv
Files
(137.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1630d22a4cebf3a1b6e26c39f0e81d1e
|
15.9 MB | Preview Download |
|
md5:25be63d1a0bab7a3ef01c41034ab9d46
|
121.5 MB | Preview Download |
Additional details
Funding
- Taighde Éireann - Research Ireland
- IRCLA/2022/3217
- Taighde Éireann - Research Ireland
- 18/CRT/6049