﻿This Pozsgai_et_al_search_inconsistency_Readme.txt file was generated on 2021-09-D17 by Gabor Pozsgai



GENERAL INFORMATION

1. Title of Dataset: Irreproducibility in searches of scientific literature: A comparative analysis

2. Author Information
	A. Principal Investigator Contact Information
		Name: Gabor Pozsgai
		Institution: Fujian Agriculture and Forestry University
		Address: 15 Shangxiadian Road, Cangshan, Fuzhou, Fujian, China 350002
		Email: pozsgaig@coleoptera.hu

	B. Associate or Co-investigator Contact Information
		Name: Gábor L. Lövei
		Institution: Aarhus University
		Address: Department of Agroecology, Flakkebjerg Research Centre, Aarhus University, 4200 Slagelse, Denmark

	C. Associate or Co-investigator Contact Information
		Name: Liette Vasseur
		Institution: Brock University
		Address: UNESCO Chair on Community Sustainability: From Local to Global, Dept. Biol. Sci., Brock University, Canada

	D. Associate or Co-investigator Contact Information
		Name: Geoff Gurr
		Institution: Charles Sturt University
		Address: Graham Centre for Agricultural Innovation (Charles Sturt University and NSW Department of Primary Industries), Orange NSW 2800, Australia

	E. Associate or Co-investigator Contact Information
		Name: Péter Batáry
		Institution: University of Goettingen
		Address: Agroecology, University of Goettingen, 37077 Goettingen, Germany

	F. Associate or Co-investigator Contact Information
		Name: János Korponai
		Institution: Eötvös Lórand University
		Address: Eötvös Lórand University, 1053 Budapest, Hungary

	G. Associate or Co-investigator Contact Information
		Name: Nick A. Littlewood
		Institution: University of Cambridge
		Address: University of Cambridge, Cambridge, UK

	H. Associate or Co-investigator Contact Information
		Name: Jian Liu
		Institution: Charles Sturt University
		Address: Graham Centre for Agricultural Innovation (Charles Sturt University and NSW Department of Primary Industries), Orange NSW 2800, Australia

	I. Associate or Co-investigator Contact Information
		Name: Arnold Móra
		Institution: University of Pécs
		Address: University of Pécs, 7622 Pécs, Hungary

	J. Associate or Co-investigator Contact Information
		Name: John Obrycki
		Institution: University of Kentucky
		Address: University of Kentucky, Lexington, USA

	K. Associate or Co-investigator Contact Information
		Name: Olivia Reynolds
		Institution: cesar
		Address: cesar, 293 Royal Parade, Parkville, Victoria 3052, Australia

	L. Associate or Co-investigator Contact Information
		Name: Jenni A. Stockan
		Institution: The James Hutton Institute
		Address: Department of Ecological Sciences, The James Hutton Institute, Aberdeen, UK

	M. Associate or Co-investigator Contact Information
		Name: Heather VanVolkenburg
		Institution: Brock University
		Address: UNESCO Chair on Community Sustainability: From Local to Global, Dept. Biol. Sci., Brock University, Canada
	N. Associate or Co-investigator Contact Information
		Name: Jie Zhang
		Institution: Fujian Agriculture and Forestry University
		Address: 15 Shangxiadian Road, Cangshan, Fuzhou, Fujian, China 350002

	O. Associate or Co-investigator Contact Information
		Name: Wenwu Zhou
		Institution: Zhejiang University
		Address: State Key Laboratory of Rice Biology, Key Laboratory of Molecular Biology of Crop Pathogens and Insects, Ministry of Agriculture, Zhejiang University, Hangzhou, China

	P. Associate or Co-investigator Contact Information
		Name: Minsheng You
		Institution: Fujian Agriculture and Forestry University
		Address: 15 Shangxiadian Road, Cangshan, Fuzhou, Fujian, China 350002
		Email: msyou@fafu.edu.cn



3. Date of data collection (single date, range, approximate date) 2018-11-07: 

4. Geographic location of data collection: Worldwide (China, Denmark, Canada, Australia, Germany, Hungary, United Kingdom, USA)  

5. Information about funding sources that supported the collection of the data: 
Operational Funds of the 111 Program by the State Administration of Foreign Experts Affaris, PR China KRA16001A
Operational Funds of the State Key Laboratory by Fujian Provincial Administration of Science and Technology, PR China KJG18001A
Higher Education Institutional Excellence Programme of the Ministry of Human Capacities in Hungary 20765‐3/2018/FEKUTSTRAT
“Innovation for sustainable and healthy living and environment” thematic programme of the University of Pecs TUDFO/47138/2019‐ITM


SHARING/ACCESS INFORMATION

1. Licenses/restrictions placed on the data: 

2. Links to publications that cite or use the data: Pozsgai, G., Lövei, G. L., Vasseur, L., Gurr, G., Batáry, P., Korponai, J., Littlewood, N. A., Liu, J., Móra, A., Obrycki, J., Reynolds, O., Stockan, J. A., VanVolkenburg, H., Zhang, J., Zhou, W., & You, M. (2020). Irreproducibility in searches of scientific literature: a comparative analysis. Ecology and Evolution https://doi.org/10.1002/ece3.8154

3. Links to other publicly accessible locations of the data: 

4. Links/relationships to ancillary data sets: 

5. Was data derived from another source? yes
	A. If yes, list source(s): 
Web of Science
Scopus
PubMed
Google Scholar

6. Recommended citation for this dataset: 
Pozsgai et al. (2021): Irreproducibility in searches of scientific literature: a comparative analysis. https://doi.org/10.5061/dryad.djh9w0w17

DATA & FILE OVERVIEW

1. File List: 
ecol_search_hits.csv: The number of result hits for each search in the queried search platforms. 

ecol_search_twenty.csv: The first twenty papers resulted in each search in the queried search platforms. 


ecol_search_lines.csv: The exact search expressions for each search conducted.


2. Relationship between files, if important: ecol_search_lines.csv is linked to Ecol_search_hits.csv and ecol_search_twenty.csv through the 'Search_line' field.

3. Additional related data collected that was not included in the current data package: none

4. Are there multiple versions of the dataset? no


METHODOLOGICAL INFORMATION

1. Description of methods used for collection/generation of data: 
Three major scientific search platforms, PubMed, Scopus, and Web of Science, and Google Scholar, were used in this study. We generated keyword expressions (search strings) with two complexity levels using keywords that focused on an ecological topic and ran standardized searches from various institutions in the world (see below), all within a limited timeframe.

Simple search strings contained only one main keyphrase, without using logical (Boolean) operators, whereas complex ones contained both inclusion and exclusion criteria for additional, related, keywords and key phrases (i.e. two-word expressions within quotation marks). In complex search strings Boolean operators were also used. The simple keyword was “ecosystem services” while the complex one was “ecosystem service” AND “promoting” AND “crop” NOT “livestock”. Search language was set to English in every case, and only titles, abstracts and keywords were searched. Since there is no option in Google Scholar to limit the search to titles, keywords, and abstracts, we used the default search in this case. Since different search platforms use slightly different expressions for the same query, exact search term formats were generated for each search.

Searches were conducted on one or two machines at each of the 12 institutions in Australia, Canada, China, Denmark, Germany, Hungary, UK, and the USA (Supplementary material 2), using three commonly used browsers (Mozilla Firefox, Internet Explorer, and Google Chrome). Searches were run manually (i.e. no APIs were used) according to strict protocols, which allowed standardization of search date, exact search term for every run, and the data recording procedure. Not all platforms were queried from every location: Google products are not available in China, and Scopus was not available at some institutions (Supplementary material 2). The original version of the protocol is provided in Supplementary material 3. The first run was conducted at 11:00 Australian Eastern Standard Time (01:00 GMT) on 13 April 2018 and the last search run at 18:16, Eastern Daylight Time (22:16 GMT, 13 April 2018). After each search run, the number of hits was recorded and the bibliographic data of the first 20 articles were extracted and saved in a file format that the website offered (.csv, .txt). Once search combinations were completed, the browsers’ cache was emptied, to make sure the testers’ previous searches did not influence the results, and the process was repeated. At four locations (Flakkebjerg, Denmark; Fuzhou, China; St. Catharines, Canada; Orange, Australia) the searches were also repeated on two different computers. This resulted in 228, 132, 228, and 144 search runs for Web of Science, Scopus, PubMed, and Google Scholar, respectively.
https://doi.org/10.1002/ece3.8154

2. Methods for processing the data: 
Results were collected from each contributor, bibliographic information was automatically extracted from the identically structured saved files using a loop in the R statistical software (R Core Team, 2012), and stored in a standardized MySQL database, allowing unique publications to be distinguished. If unique identifiers for individual articles were missing, authors, titles, or the combination of these were searched for, and uniqueness was double-checked across the entire dataset. Saved data files with non-standard structures were dealt with manually. All data cleaning and manipulations were done by R.
https://github.com/pozsgaig/search_location

3. Instrument- or software-specific information needed to interpret the data: 

R Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria. Retrieved from http://www.r-project.org/, Version 3.5.3, accessed 11/03/2019

Villacorta, P. J. (2018). welchADF: Welch-James Statistic for robust hypothesis testing under heterocedasticity and non-normality. Retrieved from https://cran.r-project.org/package=welchADF. Version 0.3, accessed 11/03/2019.

4. Standards and calibration information, if appropriate: 

5. Environmental/experimental conditions: 

6. Describe any quality-assurance procedures performed on the data: Duplicated papers were removed.

7. People involved with sample collection, processing, analysis and/or submission: 
All authors.

DATA-SPECIFIC INFORMATION FOR: ecol_search_hits.csv
<repeat this section for each dataset, folder or file, as appropriate>

1. Number of variables: 17

2. Number of cases/rows: 542

3. Variable List: 
"Search_line": Unique identifier for each search, links data to unique seraches through ecol_search_lines.csv.
"Search_engine": The platform used to conduct search. Values can be "Scopus", "PubMed", "WoS" (Web of Science), and "GScholar" (Google Scholar)" 
"Topic": Unused variable
"Keyword_complexity": Two levels of complexity in the search expression. Values "simple_KWE" and "complex_KWE"
"Browser": Internet browser the search was conducted with.
"Cache": Whether or not cache was emptied during search. Values can be "Cache" and "Cleaned_cache".
"Search.term": The exact search term used.
"Affiliation": Affiliation hosting the server the search ran through.
"Search_date": The date on which the search was conducted.
"Search_start_time": The local time the search process was started at.
"Search_end_time": The local time the search process was terminated at.
"Time_zone": The local time zone the search was conducted in.
"Number_hits": The number of search hits yielded from one particular search (data unit).
"Computer": Identifier for the replicating computer.
"Replica": If search was replicated from the same computer, this is a unique identifier for the replication.
"GMT_Search_start_time": The time the search process was started at converted to Greenich Mean Time.
"GMT_Search_end_time":The time the search process was started at converted to Greenich Mean Time.
 <list variable name(s), description(s), unit(s)and value labels as appropriate for each>

4. Missing data codes: 
No missing data

5. Specialized formats or other abbreviations used: 

DATA-SPECIFIC INFORMATION FOR: ecol_search_twenty.csv

1. Number of variables: 15


2. Number of cases/rows: 8108

3. Variable List: 
"Author": Authors of the article.
"Title": Title of the article.
"Publication": The journal the article was published in.
"Publication_Year": The year when the article published.
"DOI": DOI of the article.
"Own_ID": The unique identifier given to the article by the search platform. 
"PM_ID": PubMed identifier of the article.
"Search_line": An identifier linking data to unique seraches through ecol_search_lines.csv.
"Affiliation": Affiliation hosting the server the search ran through.
"Computer": Identifier for the replicating computer.
"Replica": If search was replicated from the same computer, this is a unique identifier for the replication.
"First_auth": First author of the paper.
"V2": Simplified title of the paper
"all_id": Unique identifier of the paper, consisting of merged strings of the first author, year of publication, and title of the paper.
"num_id": A unique numerical identifier for the paper. 


4. Missing data codes: "", NA


5. Specialized formats or other abbreviations used: 


DATA-SPECIFIC INFORMATION FOR: ecol_search_lines.csv
<repeat this section for each dataset, folder or file, as appropriate>

1. Number of variables: 7

2. Number of cases/rows: 48

3. Variable List: 
"Search_line": Unique identifier for each search, links unique seraches to data.
"Search_engine": The platform used to conduct search. Values can be "Scopus", "PubMed", "WoS" (Web of Science), and "GScholar" (Google Scholar)" 
"Topic": Unused variable
"Keyword_complexity": Two levels of complexity in the search expression. Values "simple_KWE" and "complex_KWE"
"Browser": Internet browser the search was conducted with.
"Cache": Whether or not cache was emptied during search. Values can be "Cache" and "Cleaned_cache".
"Search.term": The exact search term used.


4. Missing data codes: 
No missing data

5. Specialized formats or other abbreviations used: 
