Characterise respondents that answer “Not applicable” to code sharing.
# X53 is personal view on code sharing
df %>%
mutate(X53 = factor(X53, levels = opinion_levels)) %>%
tabyl(disciplines_recoded_wos, X53) %>%
custom_adorns() %>%
knitr::kable()
## mutate: converted 'X53' from character to factor (0 new NA)
| disciplines_recoded_wos | Very important | Somewhat important | Neither important nor unimportant | Somewhat unimportant | Very unimportant | I don’t know | Not applicable | Total |
|---|---|---|---|---|---|---|---|---|
| Arts & Humanities | 14.3% (1) | 28.6% (2) | 0.0% (0) | 14.3% (1) | 14.3% (1) | 14.3% (1) | 14.3% (1) | 100.0% (7) |
| Life Sciences & Biomedicine | 29.5% (26) | 33.0% (29) | 17.0% (15) | 4.5% (4) | 4.5% (4) | 3.4% (3) | 8.0% (7) | 100.0% (88) |
| Physical Sciences | 25.9% (7) | 33.3% (9) | 25.9% (7) | 3.7% (1) | 7.4% (2) | 0.0% (0) | 3.7% (1) | 100.0% (27) |
| Social Sciences | 22.2% (8) | 25.0% (9) | 25.0% (9) | 5.6% (2) | 5.6% (2) | 2.8% (1) | 13.9% (5) | 100.0% (36) |
| Technology | 30.8% (12) | 33.3% (13) | 25.6% (10) | 0.0% (0) | 0.0% (0) | 2.6% (1) | 7.7% (3) | 100.0% (39) |
| NA | 0.0% (0) | 0.0% (0) | 0.0% (0) | 0.0% (0) | 0.0% (0) | 0.0% (0) | 100.0% (1) | 100.0% (1) |
| Total | 27.3% (54) | 31.3% (62) | 20.7% (41) | 4.0% (8) | 4.5% (9) | 3.0% (6) | 9.1% (18) | 100.0% (198) |
The answers “I don’t know” and “Not applicable” are clearly over-represented among SSH in the question on the personal relevance of code sharing.
# OA publishing
# X68 = pubs total
# X69 = pubs OA
# X51 = personal opinion on OA
# X27 = institutional opinion on OA
oa_practice <- df %>%
select(X27, X51, X68, X69) %>%
mutate(OA_rate = X69/X68,
X51 = factor(X51, levels = opinion_levels),
X27 = factor(X27, levels = opinion_levels))
## select: dropped 69 variables (X1, X3, X7, X8, X9, …)
## mutate: converted 'X27' from character to factor (0 new NA)
## converted 'X51' from character to factor (0 new NA)
## new variable 'OA_rate' (double) with 57 unique values and 1% NA
oa_practice
## # A tibble: 198 × 5
## X27 X51 X68 X69 OA_rate
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Neither important nor unimportant Neither important nor … 15 6 0.4
## 2 Neither important nor unimportant Very important 75 30 0.4
## 3 I don't know Very important 60 3 0.05
## 4 Neither important nor unimportant Very unimportant 12 11 0.917
## 5 Neither important nor unimportant Neither important nor … 10 7 0.7
## 6 Very important Very important 33 25 0.758
## 7 Neither important nor unimportant Very important 3 2 0.667
## 8 Neither important nor unimportant Somewhat important 10 6 0.6
## 9 Neither important nor unimportant Somewhat important 55 35 0.636
## 10 Somewhat important Very important 4 4 1
## # … with 188 more rows
ggplot(oa_practice, aes(OA_rate)) +
geom_histogram(bins = 20)
Values that should not exist, i.e. more OA articles than total articles.
-> Removing those two cases
oa_practice <- oa_practice %>%
filter(OA_rate <= 1 | is.na(OA_rate))
## filter: removed 2 rows (1%), 196 rows remaining
set.seed(123)
oa_practice %>%
pivot_longer(c(X27, X51), names_to = "variable") %>%
mutate(variable_rec = recode(variable, X27 = "Perceived institutional view",
X51 = "Personal view")) %>%
ggplot(aes(OA_rate, fct_rev(value))) +
geom_boxplot(width = .5, outlier.alpha = 0) +
geom_jitter(height = .1, width = .01, colour = custom_blue, alpha = .5) +
facet_wrap(vars(variable_rec), nrow = 2) +
scale_x_continuous(labels = scales::percent) +
labs(x = "% of publications which are OA", y = "Importance of OA publishing")
## pivot_longer: reorganized (X27, X51) into (variable, value) [was 196x5, now 392x5]
## mutate: new variable 'variable_rec' (character) with 2 unique values and 0% NA
General trend holds: if respondents think OA is more important, they
also have more OA publications, and vice versa.
More strict tests for personal view:
m1 <- aov(OA_rate ~ X51, data = oa_practice)
summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F)
## X51 4 1.723 0.4307 4.39 0.00204 **
## Residuals 189 18.542 0.0981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
pairwise.t.test(oa_practice$OA_rate, oa_practice$X51, p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: oa_practice$OA_rate and oa_practice$X51
##
## Very important Somewhat important
## Somewhat important 0.491 -
## Neither important nor unimportant 0.042 0.020
## Somewhat unimportant 0.042 0.020
## Very unimportant 0.179 0.076
## Neither important nor unimportant
## Somewhat important -
## Neither important nor unimportant -
## Somewhat unimportant 0.485
## Very unimportant 0.975
## Somewhat unimportant
## Somewhat important -
## Neither important nor unimportant -
## Somewhat unimportant -
## Very unimportant 0.491
##
## P value adjustment method: BH
Some group differences are “statistically significant”. But this also comes down to the method used for the correction. Overall, it seems reasonable to state that higher OA share is related to stronger opinion in favour.
# effect size
eta <- DescTools::EtaSq(m1)[1]
eta
## [1] 0.08501282
Not versed with these values, would need to look into how specifically interpret.
# Data sharing
# X72 = pubs total
# X73 = pubs OA
# X52 = personal opinion on OD
# X28 = institutional opinion on OD
od_practice <- df %>%
select(X28, X52, X72, X73) %>%
mutate(OD_rate = X73/X72,
X52 = factor(X52, levels = opinion_levels),
X28 = factor(X28, levels = opinion_levels))
## select: dropped 69 variables (X1, X3, X7, X8, X9, …)
## mutate: converted 'X28' from character to factor (0 new NA)
## converted 'X52' from character to factor (0 new NA)
## new variable 'OD_rate' (double) with 19 unique values and 28% NA
od_practice
## # A tibble: 198 × 5
## X28 X52 X72 X73 OD_rate
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Neither important nor unimportant Neither important nor … 0 NA NA
## 2 Neither important nor unimportant Very important 20 18 0.9
## 3 Neither important nor unimportant Very important 10 10 1
## 4 Neither important nor unimportant Very important 2 2 1
## 5 Neither important nor unimportant Somewhat important 3 0 0
## 6 Very important Very important 0 NA NA
## 7 Neither important nor unimportant Very important 3 0 0
## 8 Neither important nor unimportant Somewhat important 100 0 0
## 9 Neither important nor unimportant Somewhat important 1 0 0
## 10 Very important Very important 1 1 1
## # … with 188 more rows
ggplot(od_practice, aes(OD_rate)) +
geom_histogram(bins = 20)
Values that should not exist, i.e. more open datasets than total
datasets -> Removing this case.
od_practice <- od_practice %>%
filter(OD_rate <= 1 | is.na(OD_rate)) %>%
filter(!is.na(OD_rate))
## filter: removed one row (1%), 197 rows remaining
## filter: removed 55 rows (28%), 142 rows remaining
set.seed(123)
od_practice %>%
pivot_longer(c(X28, X52), names_to = "variable") %>%
mutate(variable_rec = recode(variable, X28 = "Perceived institutional view",
X52 = "Personal view")) %>%
ggplot(aes(OD_rate, fct_rev(value))) +
geom_boxplot(width = .5, outlier.alpha = 0) +
geom_jitter(height = .1, width = .01, colour = custom_blue, alpha = .5) +
facet_wrap(vars(variable_rec), nrow = 2) +
scale_x_continuous(labels = scales::percent) +
labs(x = "% of datasets which are open", y = "Importance of OD sharing")
## pivot_longer: reorganized (X28, X52) into (variable, value) [was 142x5, now 284x5]
## mutate: new variable 'variable_rec' (character) with 2 unique values and 0% NA
m2 <- aov(OD_rate ~ X28 + X52, data = od_practice)
summary(m2)
## Df Sum Sq Mean Sq F value Pr(>F)
## X28 6 1.501 0.2501 1.368 0.232
## X52 6 0.953 0.1589 0.869 0.520
## Residuals 129 23.582 0.1828
Differences are not statistically significant. However, this model is likely not suitable to the data at hand (zero and one inflation, general structure of response in interval [0, 1], etc.)
A better alternative is to look at chisquare tables. The following table is for the institutional view.
inst_table <- od_practice %>%
filter(!(X28 %in% c("Not applicable", "I don't know"))) %>%
mutate(X28 = fct_drop(X28)) %>%
mutate(od_cats = cut_interval(OD_rate, 4) %>% fct_rev()) %>%
tabyl(od_cats, X28)
## filter: removed 10 rows (7%), 132 rows remaining
## mutate: changed 0 values (0%) of 'X28' (0 new NA)
## mutate: new variable 'od_cats' (factor) with 4 unique values and 0% NA
chisq.test(inst_table)
##
## Pearson's Chi-squared test
##
## data: inst_table
## X-squared = 22.109, df = 12, p-value = 0.03632
inst_table %>%
custom_adorns() %>%
knitr::kable()
| od_cats | Very important | Somewhat important | Neither important nor unimportant | Somewhat unimportant | Very unimportant | Total |
|---|---|---|---|---|---|---|
| (0.75,1] | 11.3% (6) | 22.6% (12) | 39.6% (21) | 15.1% (8) | 11.3% (6) | 100.0% (53) |
| (0.5,0.75] | 20.0% (2) | 20.0% (2) | 20.0% (2) | 10.0% (1) | 30.0% (3) | 100.0% (10) |
| (0.25,0.5] | 14.3% (3) | 42.9% (9) | 28.6% (6) | 14.3% (3) | 0.0% (0) | 100.0% (21) |
| [0,0.25] | 2.1% (1) | 12.5% (6) | 39.6% (19) | 16.7% (8) | 29.2% (14) | 100.0% (48) |
| Total | 9.1% (12) | 22.0% (29) | 36.4% (48) | 15.2% (20) | 17.4% (23) | 100.0% (132) |
personal_table <- od_practice %>%
filter(!(X52 %in% c("Not applicable", "I don't know"))) %>%
mutate(X52 = fct_drop(X52)) %>%
mutate(od_cats = cut_interval(OD_rate, 4) %>% fct_rev()) %>%
tabyl(od_cats, X52)
## filter: removed 4 rows (3%), 138 rows remaining
## mutate: changed 0 values (0%) of 'X52' (0 new NA)
## mutate: new variable 'od_cats' (factor) with 4 unique values and 0% NA
chisq.test(personal_table)
##
## Pearson's Chi-squared test
##
## data: personal_table
## X-squared = 10.622, df = 12, p-value = 0.5616
personal_table %>%
custom_adorns() %>%
knitr::kable()
| od_cats | Very important | Somewhat important | Neither important nor unimportant | Somewhat unimportant | Very unimportant | Total |
|---|---|---|---|---|---|---|
| (0.75,1] | 41.8% (23) | 41.8% (23) | 12.7% (7) | 1.8% (1) | 1.8% (1) | 100.0% (55) |
| (0.5,0.75] | 50.0% (5) | 30.0% (3) | 20.0% (2) | 0.0% (0) | 0.0% (0) | 100.0% (10) |
| (0.25,0.5] | 34.8% (8) | 39.1% (9) | 13.0% (3) | 8.7% (2) | 4.3% (1) | 100.0% (23) |
| [0,0.25] | 24.0% (12) | 38.0% (19) | 26.0% (13) | 6.0% (3) | 6.0% (3) | 100.0% (50) |
| Total | 34.8% (48) | 39.1% (54) | 18.1% (25) | 4.3% (6) | 3.6% (5) | 100.0% (138) |
Based on these tables:
Question: In the last three years, my choice of publication venue (journal, book publisher, conference, etc.) was influenced by the promotion criteria of my university.
With “other” values
df %>%
select(X71) %>%
mutate(X71 = fct_relevel(X71,
"Very often", "Often", "Neither often nor rarely",
"Rarely", "Very rarely") %>% fct_rev()) %>%
plot_bar(X71, reorder = FALSE, nudge_y = .01)
## select: dropped 72 variables (X1, X3, X7, X8, X9, …)
## mutate: converted 'X71' from character to factor (0 new NA)
## count: now 7 rows and 2 columns, ungrouped
## drop_na: no rows removed
## mutate: new variable 'prop' (double) with 6 unique values and 0% NA
## new variable 'label' (character) with 6 unique values and 0% NA
## new variable 'xvar' (factor) with 7 unique values and 0% NA
## filter: no rows removed
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
Without
df %>%
select(X71) %>%
filter(!(X71 %in% c("I don't know", "Does not apply"))) %>%
mutate(X71 = fct_relevel(X71,
"Very often", "Often", "Neither often nor rarely",
"Rarely", "Very rarely") %>% fct_rev()) %>%
plot_bar(X71, reorder = FALSE, nudge_y = .01)
## select: dropped 72 variables (X1, X3, X7, X8, X9, …)
## filter: removed 20 rows (10%), 178 rows remaining
## mutate: converted 'X71' from character to factor (0 new NA)
## count: now 5 rows and 2 columns, ungrouped
## drop_na: no rows removed
## mutate: new variable 'prop' (double) with 4 unique values and 0% NA
## new variable 'label' (character) with 4 unique values and 0% NA
## new variable 'xvar' (factor) with 5 unique values and 0% NA
## filter: no rows removed
This version has 178 cases.