# Dataset to reproduce the analysis in "The APC-Effect: Stratification in Open Access Publishing".

The code used to create this dataset is available at https://doi.org/10.5281/zenodo.7198844.

This release contains three files:

- `papers_with_concepts.parquet`: This is the main file that we used to conduct 
our analysis (exported from Hadoop via `hdfs dfs -getmerge`). However, due to the 
relatively old versions of Hadoop and Spark on our cluster, we were not able to 
re-read the file using current software packages, such as the `arrow` package
for `R`. For this reason, we provide a .csv version below.
- `papers_with_concepts_wo_title.csv.zip`: This is a derivative of the above file,
with two differences: (1) we removed the titles. Titles contained quoted parts 
(e.g., "Reply to: "The APC-Effect", by Thomas Klebel, and Tony Ross-Hellauer"), 
which makes importing it complicated. Since titles are not used for any part of
the analysis, we removed them. (2) The file is not a .parquet, but a compressed
.csv. This should make it easier to import it and re-use in the long term.
- `multilevel_sample_large.csv`: This file was used to conduct the mixture 
modelling presented in the paper. Its generation is documented in the file
`21-sample-for-multilevel-model.R` which is contained in the code, available at
https://doi.org/10.5281/zenodo.7198844. In basic terms, it is a sub-sample of
 about 75,000 papers of the larger files. The only 
addition is the column `total_weight`, which was calculated as 
`work_frac * concept_frac`.

Below we only discuss handling the .csv.zip file.

## File handling
The uncompressed file is fairly large (~3.5GB), which makes processing in memory
difficult. For our paper, we analysed the data using a full Hadoop/Spark cluster,
but more lightweight alternatives are available.

For example, it is possible to use the `arrow` package to open the dataset 
without reading it fully into memory (after locally unzipping). Data can then be
aggregated using the `dplyr` package.

```r
df <- open_dataset("papers_with_concepts_wo_title.csv", format = "csv")

# the top part of the data
df %>%
  head() %>%
  collect()

# validating that the file was read correctly:
# - level_sum should be "0"
# - level_na should be "0"
df %>%
  summarise(level_sum = sum(level),
            level_na = sum(is.na(level))) %>%
  collect()

# Find number of cases (papers)
df %>%
  distinct(id) %>%
  count() %>%
  collect()
# this should return: 1,572,417
```


## Deduplication
The file contains 1,572,417 cases, but 10,621,177 rows. There are four
levels of "duplication" that need to be accounted for when analysing:

1. A given paper might appear twice in the data if both first and last author are
at an institution that is in the Leiden Ranking. Both cases will have
`work_frac = 1`, because we use full counting.
2. A given researchers might appear more than once per paper, because they have 
more than one affiliation. Here: `work_frac = 1 / n_affiliations`. 
3. A given paper might appear more than once because it is tagged to more than one
top-level concept in OpenAlex. `concept_frac` will sum to one for this paper,
after other duplications are taken into account.
4. Even if there is no duplication from above (single author, single institution,
single concept), a paper might appear more than once, due to the way we matched
the Leiden Ranking data. For each paper, we have a `publication_year`. The 
Leiden Ranking has date ranges for their indicator, spanning four years (e.g.,
2016-2019). For analyses that looked over time, we matched `publication_year`
with `last_year_of_period`. For our static analyses of the years 2016-2019, 
we filtered on `first_year_of_period`, i.e. `filter(first_year_of_period == 2016)`
in order to hold the data from the Leiden Ranking constant.

For these reasons, any analysis using this dataset must first consider these
levels of "duplication" and find the unique set of papers that can serve to 
answer a specific query. The file `multilevel_sample_large.csv` does not contain
duplication due to case 4 from above; cases 1-3 still apply.

## Column description

- id: Paper ID from OpenAlex
- doi: DOI of Paper
- venue_id: char, Venue id (journal) of paper (OpenAlex)
- author_position: char, Author position (OpenAlex)
- institution_id: char, ID of author institution (OpenAlex)
- work_frac: numeric, fractional weight, for authors with multiple affiliations
- APC: logical, whether journal has an APC or not (DOAJ)
- waiver: logical, whether journal offers waivers in principle (DOAJ) (not used for analysis)
- APC_in_dollar: numeric, APC value, converted to USD. if `APC == FALSE` this is `NA`
- University: char, University Name (Leiden Ranking)
- country: char, University Country (Leiden Ranking)
- country_code: char, Country Code (World Bank Metadata)
- Period: char, Measuring period for indicators (Leiden Ranking)
- P_top10: numeric, Main study indicator (Leiden Ranking)
- publication_year: numeric, Year of paper publication (OpenAlex)
- first_year_of_period: numeric, first year of `Period`
- last_year_of_period: numeric, last year of `Period`
- concept_id: char, concept ID (OpenAlex)
- score: numeric, score for concept ID (OpenAlex)
- field: full name of concept (OpenAlex)
- level: hierarchy level of concept (OpenAlex) (all 0)
- concept_frac: fractional weight towards concept

