Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar

Roberto, Ulloa

doi:10.5281/zenodo.10636247

Published February 8, 2024 | Version v1

Dataset Open

Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar

Roberto, Ulloa (Contact person)

Main dataset (main.csv)

The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

id: Unique identifier of the file (corresponds to the last part of the filename)
filename: Name of the file associated with the row (the file is in serp_html.zip)
engine: The search engine used (Google Scholar or Semantic Scholar).
browser: The web browser used for the search (Firefox or Chrome)
region: The geographical region where the search was made.
year: The year when the search was made
month: The month when the search was made
day: The day when the search was made
query: The full search query that was used
query_type: The type of the search query (health or technology)
topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')
trt: Treatment variable associated with the search (benefits or risks).
url: The URL of the (article) search result
title: The title of the (article) search result.
authorship: The author(s) of the (article) search result.
abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
abstract_hash: Hash value of the abstract for data integrity
link_n: The total number of results in the search page
rank: The rank of the search result on the search engine results page.
annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')
valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Manually annotated abstracts resulting from the searches.

Raw search engine result pages (serp_html.zip)

The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.

Files

main.csv

Files (493.8 MB)

Name	Size
annotated-abstracts_v0.6.xlsx md5:3362fb62f275d65c4b98b6857df220f9	168.8 kB	Download
main.csv md5:13f2c8b84be54d5e8e187868a7523eca	15.9 MB	Preview Download
serp_html.zip md5:19093853f17068435fd50aba340d76b8	477.7 MB	Preview Download

Additional details

Is supplement to: Preprint: arXiv:2311.09969 (arXiv)

Repository URL: https://github.com/robertour/search-scholar/
Programming language: R , Python

	All versions	This version
Views	123	123
Downloads	116	116
Data volume	14.4 GB	14.4 GB

Main dataset (main.csv)

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Raw search engine result pages (serp_html.zip)

main.csv

Files (493.8 MB)

Related works

Software

Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar

Authors/Creators

Description

Main dataset (main.csv)

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Raw search engine result pages (serp_html.zip)

Files

main.csv

Files (493.8 MB)

Additional details

Related works

Software