Published February 8, 2024 | Version v1
Dataset Open

Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar

Description

Main dataset (main.csv)

The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:

  1. id: Unique identifier of the file (corresponds to the last part of the filename)
  2. filename: Name of the file associated with the row (the file is in serp_html.zip)
  3. engine: The search engine used (Google Scholar or Semantic Scholar).
  4. browser: The web browser used for the search (Firefox or Chrome)
  5. region: The geographical region where the search was made.
  6. year: The year when the search was made
  7. month: The month when the search was made
  8. day: The day when the search was made
  9. query: The full search query that was used
  10. query_type: The type of the search query (health or technology)
  11. topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media',  'vaccines', 'coffee')
  12. trt: Treatment variable associated with the search (benefits or risks).
  13. url: The URL of the (article) search result
  14. title: The title of the (article) search result.
  15. authorship: The author(s) of the (article) search result.
  16. abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
  17. abstract_hash: Hash value of the abstract for data integrity
  18. link_n: The total number of results in the search page
  19. rank: The rank of the search result on the search engine results page.
  20. annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks',  '4. Confirms neither benefits nor risks', '1. Confirms benefits',   '2. Confirms risks', '5. Abstract not related to {topic}')
  21. valence:  -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits

Annotated abstracts (annotated-abstracts_v0.6.xlsx)

Manually annotated abstracts resulting from the searches.

Raw search engine result pages (serp_html.zip)

The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.

 

Files

main.csv

Files (493.8 MB)

Name Size Download all
md5:3362fb62f275d65c4b98b6857df220f9
168.8 kB Download
md5:13f2c8b84be54d5e8e187868a7523eca
15.9 MB Preview Download
md5:19093853f17068435fd50aba340d76b8
477.7 MB Preview Download

Additional details

Related works

Is supplement to
Preprint: arXiv:2311.09969 (arXiv)

Software

Repository URL
https://github.com/robertour/search-scholar/
Programming language
R , Python