## Raw data for "COVID-19 evidence syntheses with artificial intelligence: an empirical study of systematic reviews"
## --------------------------------
## !!! DEPENDENCIES: "wget", "jq", and common GNU Linux/MinGW commands (grep, ls, wc...). Tested with "brew" in MacOS 11.
##
##
## The following commands were used to download and analyse the systematic reviews included in our article.
##
## STEP 1. Go to https://app.iloveevidence.com/loves/5e6fdb9669c00e4ac072701d > Advanced search > Make an empty search > Filter results by "Systematic review" (leave the "Reporting data?" option empty) > "Epistemonikos date" from 01 Dec 2019 to 15 Aug 2021 > Export results > A file named "References.ris" is downloaded (you need to sign up and be logged in). The original file used in the analysis is provided.
##
## STEP 2. Execute the first block of commands in the folder where "References.ris" is located. It will produce several files (that are also provided: "dois.txt", "json.txt", "url.txt"), as well as a folder named "files" with all downloaded reviews. Hashes for our version of the downloaded reviews are provided in "SHASUM.asc", you can download them with the links available in "url.txt" executing the 4th command.
##
## Comment: Each review is named as the number of the line of "url.txt" where its download link can be found: file "1.pdf" corresponds to the first link, "2.pdf" to the second, etc...). Be aware that the results might not be exactly equal due to changes in the original files at the hosting servers, the Unpaywall database, or the "References.ris" file.
##
## STEP 3. Download OpenSemanticSearch (opensemanticsearch.org) and add the "files" folder. Let it index the files (it will take several hours in a competent machine).
##
## STEP 4. Execute the commands in the second block to get "search-results.txt" (with each keyword found followed by the files where it appears) and "search-files.txt" (just with the found files, with no repetitions) (both also provided).
##
## The third block of commands will provide 1) articles with the same publication date as the one provided by standard input (in format YYYY-MM-DD), and 2) title, publication date and filename of the provided DOIs (for example, DOI "10.3390/ijerph18115888" returns filename 3036.pdf)
##
## Finally, the fourth block of commands will provide the numbers used to calculate Figure 1's data (some additions and subtractions required, as "Duplicated" numbers were consolidated in the published figure, but it adds up).
##
## Statistical tests are also reproducible. The datafile is "stats-data.csv", and commands and results can be found on "stats-tests.txt". Download "R", import the data with 'data <- read-csv("/PATH/TO/stats-data.csv")', and execute the provided commands in any order.
##
## If you have any doubts, please contact jrterceroh@gmail.com. Thank you.


### GET LINKS AND DOWNLOAD FILES ###
grep 'DO  - ' References.ris|sed 's/^.*- //'|sed -e 's/\r//g'|sort|uniq -i > dois.txt
while IFS= read -r i; do curl -s -f 'https://api.unpaywall.org/v2/'$i'?email=medicina@ugr.es' >> json.txt; done <dois.txt
jq -r '.best_oa_location.url_for_pdf // ([.oa_locations[].url_for_pdf // empty]|.[0]) // empty' json.txt | sort | uniq > url.txt
mkdir files; num=1; while IFS= read -r i; do wget $i -O files/$num.pdf; let num++ ;done < url.txt
### EXTRACT BROKEN FILES AND RETRY ###
fdupes -dN files ; mkdir broken ; grep -L '^%PDF-' files/* | xargs -J % mv % broken/ ; grep -l -i '<!DOCTYPE html' broken/* | rename 's/.pdf/.html/'
count=1; while IFS= read -r i; do [ ! -e files/"$count".pdf ] && wget "$i" -O files/$count.pdf; let count++; done <url.txt


### SEARCH COMMANDS, FOR LISTS OF RESULTS AND FILES (execute in the OpenSemanticSearch VM) ###
while IFS= read -r i; do echo "$i"; curl -s "localhost:8983/solr/opensemanticsearch/select?&q=("$(echo $i|sed 's/ /%20/g' | sed 's/.*/"&"/g')")&fl=path_basename_s&wt=json&rows=10000" | jq -r '.response.docs[].path_basename_s';echo; done <keywords.txt >search-results.txt
while IFS= read -r i; do curl -s "localhost:8983/solr/opensemanticsearch/select?&q=("$(echo $i|sed 's/ /%20/g' | sed 's/.*/"&"/g')")&fl=path_basename_s&wt=json&rows=10000" | jq -r '.response.docs[].path_basename_s'; done <keywords.txt | sort | uniq >search-files.txt


### GET POSSIBLE CONTROLS (execute in the data folder) ###
while IFS= read -r i; do echo "$i"; jq -r 'select(.published_date=="'"$i"'")|.title,.doi' json.txt ; echo; done
while IFS= read -r i; do jq 'select(.doi=="'"$i"'")|.title,.published_date' json.txt; jq -r 'select(.doi=="'"$i"'")|.best_oa_location.url_for_pdf // ([.oa_locations[].url_for_pdf // empty]|.[0]) // empty' json.txt|sort|uniq|xargs -J % grep -i -n "%" url.txt|cut -f1 -d:; echo; done


### FIGURE 1 DATA ###
# COLUMN 1: TOTALS
grep 'TI  - ' References.ris|wc -l; grep 'DO  - ' References.ris|wc -l ; jq '.title' json.txt |wc -l ; jq 'select(.is_oa==true)|.title' json.txt |wc -l ; jq -r '.best_oa_location.url_for_pdf // ([.oa_locations[].url_for_pdf // empty]|.[0]) // empty' json.txt|wc -l; ls -1 files|wc -l
# COLUMN 2: AFTER DEDUPLICATION
wc -l dois.txt ; jq '.title' json.txt |sort|uniq -i|wc -l ; jq 'select(.is_oa==true)|.title' json.txt |sort|uniq -i|wc -l ; wc -l url.txt
