Published February 8, 2024
| Version v1
Dataset
Open
Examining bias perpetuation in academic search engines: an algorithm audit of Google and Semantic Scholar
Authors/Creators
Description
Main dataset (main.csv)
The main file contains an entry (N=28530) per search result in all collected pages. It comprises the following columns:
- id: Unique identifier of the file (corresponds to the last part of the filename)
- filename: Name of the file associated with the row (the file is in serp_html.zip)
- engine: The search engine used (Google Scholar or Semantic Scholar).
- browser: The web browser used for the search (Firefox or Chrome)
- region: The geographical region where the search was made.
- year: The year when the search was made
- month: The month when the search was made
- day: The day when the search was made
- query: The full search query that was used
- query_type: The type of the search query (health or technology)
- topic: The topic associated with the search query ('covid vaccines', 'cryptocurrencies', 'internet', 'social media', 'vaccines', 'coffee')
- trt: Treatment variable associated with the search (benefits or risks).
- url: The URL of the (article) search result
- title: The title of the (article) search result.
- authorship: The author(s) of the (article) search result.
- abstract_id: Unique identifier for the abstract of the (article) search result which connects with annotated-abstracts_v0.6.xlsx
- abstract_hash: Hash value of the abstract for data integrity
- link_n: The total number of results in the search page
- rank: The rank of the search result on the search engine results page.
- annotation: Any annotations associated with the (article's abstract) search result. One of: '3. Confirms both benefits and risks', '4. Confirms neither benefits nor risks', '1. Confirms benefits', '2. Confirms risks', '5. Abstract not related to {topic}')
- valence: -1 for abstracts containing risks, 0 for neutral abstracts, 1 for abstracts only containing benefits
Annotated abstracts (annotated-abstracts_v0.6.xlsx)
Manually annotated abstracts resulting from the searches.
Raw search engine result pages (serp_html.zip)
The zip contains an HTML per search engine result page collected (N=2853). See column filename from the main dataset.
Files
main.csv
Additional details
Related works
- Is supplement to
- Preprint: arXiv:2311.09969 (arXiv)
Software
- Repository URL
- https://github.com/robertour/search-scholar/
- Programming language
- R , Python