--- title: "Game Engines Analysis" output: html_document: default html_notebook: default --- Are game engines the more complex than the frameworks? ```{r, message=FALSE, warning=FALSE} library(dplyr) library(ggplot2) library(tidyr) library(stringr) library(GGally) library(ggridges) library(forcats) library(psych) library(pwr) library(effsize) knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='figs/', echo=FALSE) ``` # General pre-processing ```{r} langs <- c("C", "C++", "Java", "C#", "JavaScript", "Objective-C", "Swift", "Python", "Ruby", "TTCN-3", "PHP", "Scala", "GDScript", "Go", "Lua", "TypeScript") engines.blacklist <- read.csv("../../datasets/game-engines/game-engines-blacklist.csv", sep = ",") %>% mutate(fullname = paste(author, engine, sep = "_")) engines.dataset <- read.csv("../../datasets/game-engines/game-engines-full.csv", header = TRUE, sep = ";") %>% mutate(fullname = paste(owner, name, sep = "_")) %>% filter(!(fullname %in% engines.blacklist$fullname)) %>% filter(!(url %in% c("https://github.com/kooparse/engine", "https://github.com/aleksijuvani/polygonist"))) %>% #removing engines that we couldnt download mutate(created_at = as.Date(created_at)) %>% mutate(last_push = as.Date(last_push)) %>% mutate(lifespan = as.numeric(difftime(last_push, created_at, units = "weeks"))) %>% filter(main_language %in% langs) %>% filter(contributors_count > 1 & stargazers_count > 1 & archived == "False" & main_language != "" & main_language_size != "" & last_push >= as.Date("2017-01-01")) frameworks.blacklist <- read.csv("../../datasets/frameworks/frameworks-blacklist.csv", sep = ";") %>% mutate(fullname = paste(owner, name, sep = "_")) frameworks.dataset <- read.csv("../../datasets/frameworks/frameworks-full.csv", sep = ";", header = TRUE) %>% mutate(fullname = paste(owner, name, sep = "_")) %>% filter(!(fullname %in% frameworks.blacklist$fullname)) %>% mutate(created_at = as.Date(created_at)) %>% mutate(last_push = as.Date(last_push)) %>% mutate(stargazers_count = as.numeric(stargazers_count)) %>% mutate(lifespan = as.numeric(difftime(last_push, created_at, units = "weeks"))) %>% filter(contributors_count > 1 & stargazers_count > 1 & archived == "False" & main_language != "" & main_language_size != "" & last_push >= as.Date("2017-01-01")) %>% as.data.frame %>% filter(main_language %in% langs) # top_n(x = ., n = nrow(engines.dataset), wt = stargazers_count) collected_date <- as.Date("2019-08-07") engines.lizardmetrics <- read.csv("../../datasets/game-engines/lizardmetrics-engines.csv", sep = ",") %>% na.exclude(engines.lizardmetrics) frameworks.lizardmetrics <- read.csv("../../datasets/frameworks/lizardmetrics-frameworks.csv", sep = ",") %>% na.exclude(frameworks.lizardmetrics) dataset.eng <- merge(x = engines.dataset, y = engines.lizardmetrics, by = "fullname", all = FALSE) %>% select(-X) %>% arrange(desc(stargazers_count)) %>% mutate(type = "engine") # %>% inner_join(truckfactors, by = c("name" = "name", "owner" = "owner")) dataset.frameworks <- merge(x = frameworks.dataset, y = frameworks.lizardmetrics, by = "fullname", all = FALSE) %>% # top_n(x = ., n = nrow(dataset.eng)) %>% select(-X) %>% arrange(desc(stargazers_count)) %>% mutate(type = "framework") # %>% inner_join(truckfactors, by = c("name" = "name", "owner" = "owner")) dataset <- rbind(dataset.eng, dataset.frameworks) %>% mutate(main_language_size = main_language_size * 0.000001) %>% mutate(total_size = total_size * 0.000001) %>% mutate(commits_per_time = commits_count / max(c(lifespan, 1))) # separate_rows(tags_releases, sep = "\n") # mutate(tags_releases = as.Date(str_extract(tags_releases, "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"))) %>% # mutate(tag_interval = lag(tags_releases) - tags_releases) %>% # mutate(tag_interval = replace_na(tag_interval, 0)) truckfactors <- read.csv("../../datasets/truckfactor.csv", sep = ";") %>% group_by(project_name) %>% summarise(truck_factor = n()) %>% mutate(fullname = as.character(project_name)) %>% select(truck_factor, fullname) #%>% separate(project_name, c("owner", "name"), sep = "_") dataset <- merge(x = dataset, y = truckfactors, by = "fullname", all = FALSE) rm(dataset.eng, dataset.frameworks, engines.blacklist, engines.dataset, engines.lizardmetrics, frameworks.blacklist, frameworks.dataset, frameworks.lizardmetrics, truckfactors) engines_low_star_commit <- c("https://github.com/alexiusacademia/CoronaSDKGridView", "https://github.com/Anders1232/RattletrapEngine", "https://github.com/ShawnPConroy/MUSE", "https://github.com/JakesCode/somePythonAdventureGame", "https://github.com/RetroFireStudios/Charcoal-Engine", "https://github.com/JayhawkZombie/SFEngine", "https://github.com/LorenzoLeonardini/Badger-Engine") dataset.f <- dataset %>% filter(type == "framework") dataset.e <- dataset %>% filter(type == "engine") %>% filter(!url %in% engines_low_star_commit) dataset <- rbind(dataset.f, dataset.e) # Designate as a categorical factor dataset$type <- as.factor(dataset$type) # write.csv(dataset, file = "../../datasets/dataset.csv") ``` # Dataset without outliers ```{r} # removing outliers '%!in%' <- function(x,y)!('%in%'(x,y)) # hack because this lang is shit # d.engine.noout <- dataset %>% filter(type == "engine") %>% d.e.out.total_size <- dataset.e %>% filter(total_size %!in% boxplot.stats(total_size, coef = 1.5)$out) d.e.out.main_language_size <- dataset.e %>% filter(main_language_size %!in% boxplot.stats(main_language_size, coef = 1.5)$out) d.e.out.n_file <- dataset.e %>% filter(n_file %!in% boxplot.stats(n_file, coef = 1.5)$out) d.e.out.n_func <- dataset.e %>% filter(n_func %!in% boxplot.stats(n_func, coef = 1.5)$out) d.e.out.nloc_mean <- dataset.e %>% filter(nloc_mean %!in% boxplot.stats(nloc_mean, coef = 1.5)$out) d.e.out.cc_mean <- dataset.e %>% filter(cc_mean %!in% boxplot.stats(cc_mean, coef = 1.5)$out) d.e.out.func_per_file_mean <- dataset.e %>% filter(func_per_file_mean %!in% boxplot.stats(func_per_file_mean, coef = 1.5)$out) d.e.out <- merge(d.e.out.total_size, d.e.out.main_language_size, all = T, no.dups = T) d.e.out <- merge(d.e.out, d.e.out.n_file, all = T, no.dups = T) head(d.e.out) d.framework.noout <- dataset %>% filter(type == "framework") %>% filter(total_size %!in% boxplot.stats(total_size, coef = 1.5)$out) %>% filter(main_language_size %!in% boxplot.stats(main_language_size, coef = 1.5)$out) %>% filter(n_file %!in% boxplot.stats(n_file, coef = 1.5)$out) %>% filter(n_func %!in% boxplot.stats(n_func, coef = 1.5)$out) %>% filter(nloc_mean %!in% boxplot.stats(nloc_mean, coef = 1.5)$out) %>% filter(cc_mean %!in% boxplot.stats(cc_mean, coef = 1.5)$out) %>% filter(func_per_file_mean %!in% boxplot.stats(func_per_file_mean, coef = 1.5)$out) dataset <- rbind(d.engine.noout, d.framework.noout) ``` # Functions ```{r} # Mann-Whitney U # https://stat-methods.com/home/mann-whitney-u-r/ library("gmodels") library("car") library("DescTools") library("ggplot2") library("qqplotr") library("dplyr") desc_stats <- function(dep_variable) { # Produce descriptive statistics by group dep_variable <- enquo(dep_variable) show(paste(as_label(dep_variable))) tab_ds <- dataset %>% select(type, !!dep_variable) %>% group_by(type) %>% summarise(n = n(), mean = mean(!!dep_variable, na.rm = TRUE), sd = sd(!!dep_variable, na.rm = TRUE), stderr = sd/sqrt(n), LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr, UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr, median = median(!!dep_variable, na.rm = TRUE), min = min(!!dep_variable, na.rm = TRUE), max = max(!!dep_variable, na.rm = TRUE), IQR = IQR(!!dep_variable, na.rm = TRUE), LCLmed = MedianCI(!!dep_variable, na.rm=TRUE)[2], UCLmed = MedianCI(!!dep_variable, na.rm=TRUE)[3]) # write.csv(tab_ds, file = paste("../../tables/tab-", as_label(dep_variable), ".csv", sep = ""), append = TRUE) return(tab_ds) } # Boxplots boxplot_no_outlier <- function(var_column, var, var_name, coef_out = 1.5) { var <- enquo(var) min_out <- min(boxplot.stats(var_column, coef = coef_out)$out) dataset.out <- dataset %>% select(!!var, type, fullname) %>% arrange(!!var) %>% filter(!!var > min_out) show(dataset.out) dataset.no_out <- dataset %>% filter(!!var < min_out) plot <- ggplot(dataset.no_out, aes(x = type, y = !!var, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none")+ labs(title="", x="", y="") + theme(axis.text=element_text(size=16), axis.text.x = element_text(angle = 90, vjust = 0.5)) show(plot) ggsave(filename=paste("../../plots/boxplot-",var_name,"-no-out.pdf", sep = ""), plot=plot, dpi=300, width = 6, units = "cm") return(dataset.out) } boxplot_no_outlier_flip <- function(var_column, var, var_name, coef_out = 1.5) { var <- enquo(var) min_out <- min(boxplot.stats(var_column, coef = coef_out)$out) dataset.out <- dataset %>% select(!!var, type, fullname) %>% arrange(!!var) %>% filter(!!var > min_out) show(dataset.out) dataset.no_out <- dataset %>% filter(!!var < min_out) plot <- ggplot(dataset.no_out, aes(x = type, y = !!var, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + coord_flip() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none")+ labs(title="", x="", y="") + theme(axis.text=element_text(size=16), axis.text.x = element_text(angle = 0, vjust = 0.5)) show(plot) ggsave(filename=paste("../../plots/boxplot-",var_name,"-no-out.pdf", sep = ""), plot=plot, dpi=300, height = 6, units = "cm") return(dataset.out) } box_plot <- function(dep_variable) { dep_variable <- enquo(dep_variable) dp_str <- as_label(dep_variable) # Produce Boxplots and visually check for outliers plot <- ggplot(dataset, aes(x = type, y = !!dep_variable, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + # scale_y_log10() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none") ggsave(filename=paste("../../plots/boxplot-",dp_str,".pdf", sep = ""), plot=plot, dpi=300, width = 6, height = 6, units = "cm") show(plot) # scaling values plot <- ggplot(dataset, aes(x = type, y = !!dep_variable, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + scale_y_log10() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none") ggsave(filename=paste("../../plots/boxplot-scaled-",dp_str,".pdf", sep = ""), plot=plot, dpi=300, width = 6, height = 6, units = "cm") show(plot) # removing outliers # '%!in%' <- function(x,y)!('%in%'(x,y)) # hack because this lang is shit # dataset.out <- rbind(dataset %>% filter(type == "engine") %>% filter( !!dep_variable %!in% boxplot.stats(!!dep_variable, coef = 1.5)$out), # dataset %>% filter(type == "framework") %>% filter( !!dep_variable %!in% boxplot.stats(!!dep_variable, coef = 1.5)$out)) min_out <- min(boxplot.stats(dataset$dep_variable, coef = 1.5)$out) dataset.out <- dataset %>% select(cc_mean, type, fullname) %>% arrange(cc_mean) %>% filter(cc_mean > min_out) dataset.no_out <- dataset %>% filter(cc_mean < min_out) plot.out <- ggplot(dataset.no_out, aes(x = type, y = !!dep_variable, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + # geom_text(hjust=0.5, vjust=1.5, size=5) + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none")+ labs(title="", x="", y="") + theme(axis.text=element_text(size=16), axis.text.x = element_text(angle = 90, vjust = 0.5)) ggsave(filename=paste("../../plots/boxplot-",dp_str,"-no-outliers.pdf", sep = ""), plot=plot.out, dpi=300, width = 6, units = "cm") show(plot.out) # removeing 75 quantile # dataset.out <- rbind(dataset %>% filter(type == "engine") %>% filter(tags_releases_count < quantile(tags_releases_count, 0.75)), # dataset %>% filter(type == "framework") %>% filter(tags_releases_count < quantile(tags_releases_count, 0.75))) dataset.out <- dataset %>% filter(tags_releases_count < quantile(tags_releases_count, 0.75)) plot.out <- ggplot(dataset.out, aes(x = type, y = !!dep_variable, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + # scale_y_log10() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none") + theme(axis.text=element_text(size=14)) show(plot.out) } test_normality <- function(dep_variable){ # Test each group for normality dep_variable <- enquo(dep_variable) dp_str <- as_label(dep_variable) # A p-value < 0.05 would indicate that we should reject the assumption of normality. shapiro <- dataset %>% group_by(type) %>% summarise(`W Stat` = shapiro.test(!!dep_variable)$statistic, p.value = shapiro.test(!!dep_variable)$p.value) # write.csv(shapiro, file = paste("../../tables/tab-norm-", dp_str, ".csv", sep = "")) # write.csv(shapiro, file = paste("../../tables/tab-", as_label(dep_variable), ".csv", sep = ""), append = TRUE) show(shapiro) # Perform QQ plots by group plot <- ggplot(data = dataset, mapping = aes(sample = !!dep_variable, color = type, fill = type)) + stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "boot") + stat_qq_line(identity=TRUE) + stat_qq_point(col="black") + facet_wrap(~ type, scales = "free") + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + scale_fill_brewer(palette=6)+ theme_minimal() + theme(legend.position="none") ggsave(filename=paste("../../plots/plot-norm-",dp_str,".pdf", sep = ""), plot=plot, dpi=300, units = "cm") show(plot) return(shapiro$p.value) } mw <- function(dep_variable){ # Perform the Mann-Whitney U test # The Wilcoxon test statistic is the sum of the ranks in sample 1 minus n1*(n1+1)/2. n1 is the number of observations in sample 1. # A p-value < 0.05 indicates that we should reject the null hypothesis that distributions are equal and conclude that there is a significant difference between the samples regarding the variable (CC) mw <- wilcox.test(dep_variable ~ type, data=dataset, na.rm=TRUE, paired=FALSE, exact=FALSE, conf.int=TRUE) print(data.frame(mw$p.value, mw$estimate)) # write.csv(data.frame(mw$p.value, mw$estimate), file = paste("../../tables/tab-", dp_str, ".csv", sep = ""), append = TRUE) # Hodges Lehmann Estimator # The Hodges-Lehmann estimate more precisely indicates that we can expect a median of about 0.5 more CC for engines projects # Thus the average of CC in functions in engines projects are greater that in frameworks projects mw$estimate return(data.frame(mw$p.value, mw$estimate)) } effect_size <- function(dep_variable){ # Effect size is a simple way of quantifying the difference between two groups. This is particularly important in experimentation, since it may be possible to show a statistical significant difference between two groups, but it may not be meaningful from a practical point of view. In most cases, it is possible to show statistically significant differences with a sufficiently large number of subjects in an experiment, but it does not necessarily mean that it is meaningful from a practical point of view. It may be the case that the difference is too small or the cost to exploit the difference is simply too high. # 0.10 – < 0.40 [small] # 0.40 – < 0.60 [med] # ≥ 0.60 [large] library(rcompanion) wilcoxon <- wilcoxonPairedR(x = dep_variable, g = dataset$type, histogram = TRUE) wil_str = "" if(wilcoxon > 0.10 & wilcoxon < 0.40){ wil_str = "(small)" } else if(wilcoxon > 0.40 & wilcoxon < 0.60) { wil_str <- "(medium)" } else { wil_str <- "(large)" } return(paste(wilcoxon, wil_str)) } ``` # GOAL 1 ## RQ 1.3, 1.4, 1.5 - DS ```{r} # box_plot(main_language_size) ds <- data.frame( desc_stats(main_language_size), test_normality(main_language_size) ) %>% mutate(variable = "main_language_size") write.table(x = ds, file = "../../tables/tab-ds.csv", append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) # box_plot(total_size) ds <- data.frame( desc_stats(total_size), test_normality(total_size) ) %>% mutate(variable = "total_size") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # box_plot(n_file) ds <- data.frame( desc_stats(n_file), test_normality(n_file) ) %>% mutate(variable = "n_file") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # box_plot(n_func) ds <- data.frame( desc_stats(n_func), test_normality(n_func) ) %>% mutate(variable = "n_func") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # box_plot(nloc_mean) ds <- data.frame( desc_stats(nloc_mean), test_normality(nloc_mean) ) %>% mutate(variable = "nloc_mean") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # box_plot(cc_mean) ds <- data.frame( desc_stats(cc_mean), test_normality(cc_mean) ) %>% mutate(variable = "cc_mean") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # box_plot(func_per_file_mean) ds <- data.frame( desc_stats(func_per_file_mean), test_normality(func_per_file_mean) ) %>% mutate(variable = "func_per_file_mean") write.table(x = ds, file = "../../tables/tab-ds.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) boxplot_no_outlier(dataset$main_language_size, main_language_size, "main_language_size") boxplot_no_outlier(dataset$total_size, total_size, "total_size") boxplot_no_outlier(dataset$n_file, n_file, "n_file") boxplot_no_outlier(dataset$nloc_mean, nloc_mean, "nloc_mean") boxplot_no_outlier(dataset$n_func, n_func, "n_func") boxplot_no_outlier(dataset$func_per_file_mean, func_per_file_mean, "func_per_file_mean") # CC cc_out <- min(boxplot.stats(dataset$cc_mean, coef = 1.5)$out) dataset.out <- dataset %>% select(cc_mean, type, fullname) %>% arrange(cc_mean) %>% filter(cc_mean > cc_out) dataset.no_out <- dataset %>% filter(cc_mean < cc_out) plot <- ggplot(dataset.no_out, aes(x = type, y = cc_mean, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + # scale_y_log10() + coord_flip() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ # theme(legend.position="none", axis.title.y=element_blank()) theme(legend.position="none")+ labs(title="", x="", y="") + theme(axis.text=element_text(size=16)) ggsave(filename=paste("../../plots/boxplot-cc_mean-no-outliers.pdf", sep = ""), plot=plot, dpi=300, height = 6, units = "cm") ``` ## RQ 1.3, 1.4, 1.5 -MW ```{r} ds <- data.frame( mw(dataset$main_language_size), effect_size(dataset$main_language_size) ) %>% mutate(variable = "main_language_size") write.table(x = ds, file = "../../tables/tab-mw.csv", append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) ds <- data.frame( mw(dataset$total_size), effect_size(dataset$total_size) ) %>% mutate(variable = "total_size") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # funcs c(n_file, n_func, nloc_mean, cc_mean, func_per_file_mean) ds <- data.frame( mw(dataset$n_file), effect_size(dataset$n_file) ) %>% mutate(variable = "n_file") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ds <- data.frame( mw(dataset$n_func), effect_size(dataset$n_func) ) %>% mutate(variable = "n_func") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ds <- data.frame( mw(dataset$nloc_mean), effect_size(dataset$nloc_mean) ) %>% mutate(variable = "nloc_mean") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ds <- data.frame( mw(dataset$cc_mean), effect_size(dataset$cc_mean) ) %>% mutate(variable = "cc_mean") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ds <- data.frame( mw(dataset$func_per_file_mean), effect_size(dataset$func_per_file_mean) ) %>% mutate(variable = "func_per_file_mean") write.table(x = ds, file = "../../tables/tab-mw.csv", append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ``` ## Checking the outliers ```{r} '%!in%' <- function(x,y)!('%in%'(x,y)) dataset %>% filter(!nloc_mean %!in% boxplot.stats(nloc_mean, coef = 1.5)$out) %>% select(type, name, nloc_mean , url, main_language) %>% mutate(nloc_mean = round(nloc_mean, 0)) %>% arrange(desc(nloc_mean)) dataset %>% filter(type == "engine") %>% filter(!nloc_mean %!in% boxplot.stats(nloc_mean, coef = 1.5)$out) %>% select(name, nloc_mean, main_language, url) %>% arrange(desc(nloc_mean)) dataset %>% filter(type == "framework") %>% filter(!nloc_mean %!in% boxplot.stats(nloc_mean, coef = 1.5)$out) %>% select(name, nloc_mean, main_language, url) %>% arrange(desc(nloc_mean)) ``` ## RQ 2 - Licenses ```{r} # ploting a treemap # library(treemap) # dataset %>% filter(type == "engine") %>% select(license) %>% table(.) %>% data.frame(.) %>% treemap(., index=".", vSize="Freq", type="index", title="Engines") # dataset %>% filter(type == "framework") %>% select(license) %>% table(.) %>% data.frame(.) %>% treemap(., index=".", vSize="Freq", type="index", title="Frameworks") # table t.e <- dataset %>% select(fullname, license, type) %>% filter(type == "engine") %>% group_by(license) %>% summarise(freq = n()) %>% ungroup %>% mutate(p = freq / sum(freq)) %>% arrange(desc(freq)) t.f <- dataset %>% select(fullname, license, type) %>% filter(type == "framework") %>% group_by(license) %>% summarise(freq = n()) %>% ungroup %>% mutate(p = freq / sum(freq) ) %>% arrange(desc(freq)) lic_table <- merge(x = t.e, y = t.f, by = "license", all = TRUE) %>% replace(is.na(.), 0) %>% mutate(sum = freq.x + freq.y) %>% mutate(p.t = sum / nrow(dataset)) %>% arrange(desc(sum)) # write.table(lic_table, file="../../tables/tab-licenses.csv", sep = ",", fileEncoding = "UTF-8") # barplot # TODO fix the problem with the factor not appearing lic_table_top <- lic_table %>% top_n(x = ., n = 10, wt = sum) lic <- dataset %>% select(fullname, license, type) %>% # mutate_if(is.factor, funs(factor(replace(., .=="", "Not defined")))) %>% # mutate_if(license == "", license = "GG") group_by(license, type) %>% summarise(freq = n()) %>% ungroup %>% arrange(desc(freq)) %>% filter(license %in% lic_table_top$license) barplot <- ggplot(lic, aes(reorder(license, -freq), freq)) + geom_bar(aes(fill = type), position = "stack", stat="identity", inherit.aes = TRUE)+ coord_flip() + scale_fill_brewer(palette=6)+ theme_minimal() + theme(legend.position="bottom") + labs(title="", x="", y="Frequency") + theme(axis.text.x = element_text(angle = 0, hjust = 1)) # ggsave(filename=paste("../../plots/barplot-license.pdf", sep = ""), plot=barplot, dpi=300, units = "cm") dataset %>% filter(license == "Other") %>% select(license) ``` * The majority of the licenses are MIT License. ## RQ 1 - Languages ```{r} # ploting a treemap # library(treemap) # dataset %>% filter(type == "engine") %>% select(main_language) %>% table(.) %>% data.frame(.) %>% treemap(., index=".", vSize="Freq", type="index", title="Engines") # dataset %>% filter(type == "framework") %>% select(main_language) %>% table(.) %>% data.frame(.) %>% treemap(., index=".", vSize="Freq", type="index", title="Frameworks") # table t.e <- dataset %>% select(fullname, main_language, type) %>% filter(type == "engine") %>% group_by(main_language) %>% summarise(freq = n()) %>% ungroup %>% mutate(p = freq / sum(freq)) %>% arrange(desc(freq)) t.f <- dataset %>% select(fullname, main_language, type) %>% filter(type == "framework") %>% group_by(main_language) %>% summarise(freq = n()) %>% ungroup %>% mutate(p = freq / sum(freq) ) %>% arrange(desc(freq)) lic_table <- merge(x = t.e, y = t.f, by = "main_language", all = TRUE) %>% replace(is.na(.), 0) %>% mutate(sum = freq.x + freq.y) %>% mutate(p.t = sum / nrow(dataset)) %>% arrange(desc(sum)) # mutate(sum = rowSums(.[c(2,4)])) %>% # mutate(sum = rowSums(.[c(3,5)])) %>% write.table(lic_table, file="../../tables/tab-languages.csv", sep = ",", fileEncoding = "UTF-8") # barplot lic_table_top <- lic_table %>% top_n(x = ., n = 10, wt = sum) lic <- dataset %>% select(fullname, main_language, type) %>% mutate_if(is.factor, funs(factor(replace(., .=="", "Not defined")))) %>% group_by(main_language, type) %>% summarise(freq = n()) %>% ungroup %>% arrange(desc(freq)) %>% filter(main_language %in% lic_table_top$main_language) barplot <- ggplot(lic, aes(reorder(main_language, -freq), freq)) + geom_bar(aes(fill = type), position = "stack", stat="identity", colour="gray20")+ scale_fill_brewer(palette=6)+ theme_minimal() + theme(legend.position="bottom") + labs(title="", x="", y="Frequency") + theme(axis.text.x = element_text(angle = 0, vjust = 1), axis.text=element_text(size=14)) ggsave(filename=paste("../../plots/barplot-language.pdf", sep = ""), plot=barplot, dpi=300, units = "cm") ``` * For engines, C++ is the most used. * For frameworks Javascript # GOAL 2 - Temporal Analysis ## RQ 5 - How is the frequency of the releases in the projects? ```{r} engines.tags_dataset = dataset %>% filter(type == "engine") %>% select(c("name", "owner", "tags_releases", "tags_releases_count", "created_at", "last_push", "lifespan")) %>% mutate(tags_per_year = tags_releases_count / (lifespan / 52.25)) frameworks.tags_dataset = dataset %>% filter(type == "framework") %>% select(c("name", "owner", "tags_releases", "tags_releases_count", "created_at", "last_push", "lifespan")) %>% mutate(tags_per_year = tags_releases_count / (lifespan / 52.25)) ``` We removed projects with less than 1 tag released per year for the next analysis. ```{r} # engines.tags_dataset = engines.tags_dataset %>% filter(tags_per_year >= 1 & tags_releases_count != 0) engines.tags_dataset %>% filter(tags_releases_count > 0) %>% ggplot(aes(x = "", y = tags_releases_count)) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + scale_y_log10() + ylab("# of Tags in each project") + coord_flip() + theme_classic() + ggtitle("Engines") # frameworks.tags_dataset = frameworks.tags_dataset %>% filter(tags_per_year >= 1 & tags_releases_count != 0) frameworks.tags_dataset %>% filter(tags_releases_count > 0) %>% ggplot(aes(x = "", y = tags_releases_count)) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + scale_y_log10() + ylab("# of Tags in each project") + coord_flip() + theme_classic() + ggtitle("Frameworks") # barplot barplot <- ggplot(dataset, aes(reorder(type, -tags_releases_count), tags_releases_count)) + geom_bar(aes(fill = type), position = "stack", stat="identity", inherit.aes = TRUE) + coord_flip() + scale_fill_brewer(palette=6)+ theme_minimal() + theme(legend.position="bottom") + labs(title="", x="", y="tags_releases_count") + theme(axis.text.x = element_text(angle = 0, hjust = 1), axis.text.y = element_blank()) ggsave(filename=paste("../../plots/barplot-releases.pdf", sep = ""), plot=barplot, dpi=300, units = "cm") # box_plot(tags_releases_count) dataset %>% filter(type == "engine" & tags_releases_count == 0) dataset %>% filter(type == "framework" & tags_releases_count == 0) dataset %>% filter(type == "engine") %>% group_by(tags_releases_count) %>% summarise(n = n()) %>% arrange(desc(n)) dataset %>% filter(type == "framework") %>% group_by(tags_releases_count) %>% summarise(n = n()) %>% arrange(desc(n)) ``` * `r ecdf(engines.tags_dataset$tags_releases_count)(10) * 100`% of the engines have 10 tags at most. * `r ecdf(frameworks.tags_dataset$tags_releases_count)(10) * 100`% of the frameworks have 10 tags at most. ```{r} library(tidyr) library(stringr) engines.tags_intervals = engines.tags_dataset %>% separate_rows(tags_releases, sep = "\n") %>% mutate(tags_releases = as.Date(str_extract(tags_releases, "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"))) %>% group_by(name, owner, created_at, last_push, tags_releases_count, tags_per_year, lifespan) %>% arrange(desc(tags_releases)) %>% mutate(tag_interval = lag(tags_releases) - tags_releases) %>% mutate(tag_interval = replace_na(tag_interval, 0)) engines.tags_summary = engines.tags_intervals %>% summarise(interval_avg = as.numeric(mean(tag_interval))) %>% ungroup() engines.tags_summary %>% mutate(type = "engine") %>% select(tags_releases_count, tags_per_year, lifespan, interval_avg) %>% summary() engines.tags_summary %>% filter(interval_avg > 0 ) %>% ggplot(aes(x = "", y = interval_avg)) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + scale_y_log10() + ylab("Average interval (in days) of tags creation") + coord_flip() + theme_classic() + ggtitle("Engines") # frameworks frameworks.tags_intervals = frameworks.tags_dataset %>% separate_rows(tags_releases, sep = "\n") %>% mutate(tags_releases = as.Date(str_extract(tags_releases, "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"))) %>% group_by(name, owner, created_at, last_push, tags_releases_count, tags_per_year, lifespan) %>% arrange(desc(tags_releases)) %>% mutate(tag_interval = lag(tags_releases) - tags_releases) %>% mutate(tag_interval = replace_na(tag_interval, 0)) frameworks.tags_summary = frameworks.tags_intervals %>% summarise(interval_avg = as.numeric(mean(tag_interval))) %>% ungroup() frameworks.tags_summary %>% mutate(type = "framework") %>% select(tags_releases_count, tags_per_year, lifespan, interval_avg) %>% summary() frameworks.tags_summary %>% filter(interval_avg > 0 ) %>% ggplot(aes(x = "", y = interval_avg)) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + scale_y_log10() + ylab("Average interval (in days) of tags creation") + coord_flip() + theme_classic() + ggtitle("Frameworks") ``` * On average, engines releases are created every `r mean(engines.tags_summary$interval_avg)` days. Half of the projects creates new releases every `r median(engines.tags_summary$interval_avg)` days. * On average, frameworks releases are created every `r mean(frameworks.tags_summary$interval_avg)` days. Half of the projects creates new releases every `r median(frameworks.tags_summary$interval_avg)` days. ```{r} monthOrder <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec') engines.tags_intervals %>% mutate(month = factor(format(tags_releases, "%b"), levels = monthOrder)) %>% group_by(name, owner, month) %>% summarise(n_releases = n()) %>% ungroup() %>% group_by(month, n_releases) %>% summarize(n_systems = n()) %>% ungroup() %>% ggplot(aes(y = n_releases, x = month, size = n_releases, alpha = n_releases)) + geom_boxplot() + theme_light() monthOrder <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec') frameworks.tags_intervals %>% mutate(month = factor(format(tags_releases, "%b"), levels = monthOrder)) %>% group_by(name, owner, month) %>% summarise(n_releases = n()) %>% ungroup() %>% group_by(month, n_releases) %>% summarize(n_systems = n()) %>% ungroup() %>% ggplot(aes(y = n_releases, x = month, size = n_releases, alpha = n_releases)) + geom_boxplot() + theme_light() ``` **Findings** * Most engines have between `r quantile(engines.tags_dataset$tags_releases_count)[2]`-`r quantile(engines.tags_dataset$tags_releases_count)[4]` tags. * Most frameworks have between `r quantile(frameworks.tags_dataset$tags_releases_count)[2]`-`r quantile(frameworks.tags_dataset$tags_releases_count)[4]` tags. ## RQ 6 - What is the lifetime of the project? ```{r} engines.lifetime_dataset = dataset %>% filter(type == "engine") %>% select(c("created_at", "last_push", "lifespan", "main_language")) %>% arrange(desc(lifespan)) # projects creation through time ggplot(engines.lifetime_dataset) + geom_histogram(aes(created_at), bins = 30) + geom_vline(xintercept = median(engines.lifetime_dataset$created_at), linetype = "dashed", color = "red") + scale_fill_brewer(palette=6)+ theme_minimal()+ ggtitle("Engines") frameworks.lifetime_dataset = dataset %>% filter(type == "framework") %>% select(c("created_at", "last_push", "lifespan", "main_language")) # projects creation through time ggplot(frameworks.lifetime_dataset) + geom_histogram(aes(created_at), bins = 30) + geom_vline(xintercept = median(frameworks.lifetime_dataset$created_at), linetype = "dashed", color = "red") + scale_fill_brewer(palette=6)+ theme_minimal()+ ggtitle("Frameworks") ``` ```{r} # Projects lifetime engines.lifetime_dataset %>% select(lifespan) %>% summary engines.lifetime_dataset %>% ggplot(aes(y=lifespan, x="Engines")) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + coord_flip() + theme_classic() frameworks.lifetime_dataset %>% select(lifespan) %>% summary frameworks.lifetime_dataset %>% ggplot(aes(y=lifespan, x="Frameworks")) + geom_violin(fill = "lightgrey") + geom_boxplot(width = 0.1) + coord_flip() + theme_classic() ``` * Regarding Engines lifetime, `r ecdf(engines.lifetime_dataset$lifespan)(100) * 100`% of the projects are at least 100 weeks old. * Regarding Frameworks lifetime, `r ecdf(frameworks.lifetime_dataset$lifespan)(100) * 100`% of the projects are at least 100 weeks old. ## RQ 7 - How is the frequency of the contributions in the projects? ```{r} engines.general_contributions = dataset %>% filter(type == "engine") %>% select(c(owner, name, created_at, tags_releases_count:lifespan)) %>% mutate(commits_per_time = commits_count / max(c(lifespan, 1))) engines.general_contributions %>% select(commits_count, commits_per_time) %>% summary frameworks.general_contributions = dataset %>% filter(type == "framework") %>% select(c(owner, name, created_at, tags_releases_count:lifespan)) %>% mutate(commits_per_time = commits_count / max(c(lifespan, 1))) frameworks.general_contributions %>% select(commits_count, commits_per_time) %>% summary ``` * Most engines have more than `r quantile(engines.general_contributions$commits_count)[3]` commits. Considering commits over time, `r ecdf(engines.general_contributions$commits_per_time)(1) * 100`% of the projects commit once per week, on average. * Most frameworks have more than `r quantile(frameworks.general_contributions$commits_count)[3]` commits. Considering commits over time, `r ecdf(frameworks.general_contributions$commits_per_time)(1) * 100`% of the projects commit once per week, on average. ```{r} engines.general_contributions %>% ggplot() + geom_histogram(aes(x=commits_per_time), alpha = 0.3, fill = "red", bins = 50) + geom_histogram(aes(x=commits_count), alpha = 0.3, fill = "blue", bins = 50) + scale_x_log10() + labs(x = "Commits number (log scale)") + scale_fill_brewer(palette=6)+ theme_minimal() + ggtitle("Engines") frameworks.general_contributions %>% ggplot() + geom_histogram(aes(x=commits_per_time), alpha = 0.3, fill = "red", bins = 50) + geom_histogram(aes(x=commits_count), alpha = 0.3, fill = "blue", bins = 50) + scale_x_log10() + labs(x = "Commits number (log scale)") + scale_fill_brewer(palette=6) + theme_minimal() + ggtitle("Frameworks") ``` * Both `commits_count` and `commits_per_time` have really similar distributions (both increases in the same proportion?). ## RQ 2.1, 2.2, 2.3 - DS ```{r} str(frameworks.general_contributions) # commits_count tab_name_ds <- "../../tables/tab-ds-rq2.csv" dout <- boxplot_no_outlier_flip(dataset$commits_count, commits_count, "commits_count") dout %>% group_by(type) %>% summarise(freq = n()) ds_table <- data.frame( desc_stats(commits_count), test_normality(commits_count) ) %>% mutate(variable = "commits_count") write.table(x = ds_table, file = tab_name_ds, append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) # commits_per_time # box_plot(commits_per_time) dout <- boxplot_no_outlier_flip(dataset$commits_per_time, commits_per_time, "commits_per_time") dout %>% group_by(type) %>% summarise(freq = n()) ds_table <- data.frame( desc_stats(commits_per_time), test_normality(commits_per_time) ) %>% mutate(variable = "commits_per_time") write.table(x = ds_table, file = tab_name_ds, append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # tags_releases_count # box_plot(tags_releases_count) dout <- boxplot_no_outlier_flip(dataset$tags_releases_count, tags_releases_count, "tags_releases_count") dout %>% group_by(type) %>% summarise(freq = n()) ds_table <- data.frame( desc_stats(tags_releases_count), test_normality(tags_releases_count) ) %>% mutate(variable = "tags_releases_count") write.table(x = ds_table, file = tab_name_ds, append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # lifespan # box_plot(lifespan) dout <- boxplot_no_outlier_flip(dataset$lifespan, lifespan, "lifespan") dout %>% group_by(type) %>% summarise(freq = n()) ds_table <- data.frame( desc_stats(lifespan), test_normality(lifespan) ) %>% mutate(variable = "lifespan") write.table(x = ds_table, file = tab_name_ds, append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # created_at plot <- ggplot(dataset, aes(x = type, y = created_at, fill = type)) + stat_boxplot(geom ="errorbar", width = 0.5) + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + # scale_y_log10() + coord_flip() + ggtitle("") + geom_boxplot(alpha=0.75) + scale_fill_brewer(palette=6)+ theme_minimal()+ theme(legend.position="none")+ labs(title="", x="", y="") + theme(axis.text=element_text(size=16)) + expand_limits(y = max(dataset$created_at) + 100) ggsave(filename=paste("../../plots/boxplot-created_at.pdf", sep = ""), plot=plot, dpi=300, height = 6, units = "cm") ``` ## RQ 2.1, 2.2, 2.3 - MW ```{r} tab_name_mw <- "../../tables/tab-mw-rq2.csv" mw_table <- data.frame( mw(dataset$tags_releases_count), effect_size(dataset$tags_releases_count) ) %>% mutate(variable = "tags_releases_count") write.table(x = mw_table, file = tab_name_mw, append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) mw_table <- data.frame( mw(dataset$lifespan), effect_size(dataset$lifespan) ) %>% mutate(variable = "lifespan") write.table(x = mw_table, file = tab_name_mw, append = T, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = F) mw_table <- data.frame( mw(dataset$commits_count), effect_size(dataset$commits_count) ) %>% mutate(variable = "commits_count") write.table(x = mw_table, file = tab_name_mw, append = T, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = F) mw_table <- data.frame( mw(dataset$commits_per_time), effect_size(dataset$commits_per_time) ) %>% mutate(variable = "commits_per_time") write.table(x = mw_table, file = tab_name_mw, append = T, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = F) ``` ## [TODO] Analyzing commits (TODO: we need to use the pydriller) ```{r} # contributions_intervals = read.csv("../datasets/pydriller/all_engines.csv", head = T, sep = ",") %>% # distinct(project_name, author, author_date, hash) %>% # mutate(author_date = as.Date(author_date)) %>% # group_by(project_name) %>% # arrange(author_date) %>% # mutate( # rolling_n_commits = row_number(), # commits_interval = as.numeric(author_date - lag(author_date, default = author_date[1])), # rolling_commits_interval = cumsum(commits_interval) # ) %>% # ungroup %>% # drop_na ``` ```{r} # contributions_intervals %>% # group_by(project_name) %>% # summarise( # median_commits_interval = median(commits_interval), # median_rolling_commits_interval = median(rolling_commits_interval), # max_innactive_interval = max(commits_interval) # ) %>% # ungroup %>% # group_by(median_commits_interval) %>% # summarise(n_projects = n()) ``` Analyzing commits deeper, we observe that their vast majority are performed in chunks (small time interval). 157 out of 171 projects commited most of their code together in the same day (median = 0). ```{r} # commits_intervals = contributions_intervals %>% # group_by(project_name) %>% # summarise( # max_inactive_interval = max(commits_interval) # ) %>% # arrange(desc(max_inactive_interval)) ``` ```{r} # commits_intervals %>% # summarize( # d_30 = length(project_name[max_inactive_interval >= 30]), # d_90 = length(project_name[max_inactive_interval >= 90]), # d_180 = length(project_name[max_inactive_interval >= 180]), # d_365 = length(project_name[max_inactive_interval >= 365]) # ) ``` ```{r} # monthOrder <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec') # contributions_intervals %>% # select(project_name, author_date, commits_interval, rolling_n_commits) %>% # mutate(month = factor(format(author_date, "%b"), levels = monthOrder)) %>% # group_by(project_name, month) %>% # summarise(n_commits = n()) %>% # ungroup() %>% # ggplot(aes(x = month, y = n_commits)) + # geom_boxplot() + # scale_y_log10() + # theme_light() ``` # GOAL 3: Community Engagement ## RQ 8 - What is the popularity of game engines considering its main languages? * M: correlate main language with stars and contributors ```{r} engines.community_dataset = dataset %>% filter(type == "engine") %>% select(name, owner, main_language, stargazers_count, contributors_count) frameworks.community_dataset = dataset %>% filter(type == "framework") %>% select(name, owner, main_language, stargazers_count, contributors_count) engines.community_dataset %>% ggplot(aes(x = stargazers_count)) + geom_histogram(bins = 10) + geom_vline(xintercept = median(engines.community_dataset$stargazers_count), linetype = "dashed", color = "red") + annotate("text", x = median(engines.community_dataset$stargazers_count) + 10, y = 40, label = median(engines.community_dataset$stargazers_count)) + scale_x_log10() + theme_light() + ggtitle("Engines") frameworks.community_dataset %>% ggplot(aes(x = stargazers_count)) + geom_histogram(bins = 10) + geom_vline(xintercept = median(frameworks.community_dataset$stargazers_count), linetype = "dashed", color = "red") + annotate("text", x = median(frameworks.community_dataset$stargazers_count) + 10, y = 40, label = median(frameworks.community_dataset$stargazers_count)) + scale_x_log10() + theme_light() + ggtitle("Frameworks") engines.community_dataset %>% arrange(desc(contributors_count)) %>% filter(contributors_count == 2) ``` * [Game engines] 50% of the projects have `r quantile(engines.community_dataset$stargazers_count)[3]` stars or less. In fact, only `r 100 - ecdf(engines.community_dataset$stargazers_count)(1000) * 100`% of the projects have more than 1K stars. * [Frameworks] 50% of the projects have `r quantile(frameworks.community_dataset$stargazers_count)[3]` stars or less. In fact, only `r 100 - ecdf(frameworks.community_dataset$stargazers_count)(1000) * 100`% of the projects have more than 1K stars. ```{r} top_langs_eng <- dataset %>% filter(type == "engine") %>% group_by(main_language) %>% summarise(freq = n()) %>% ungroup %>% arrange(desc(freq)) %>% top_n(x = ., n = 10, wt = freq) top_langs_fra <- dataset %>% filter(type == "framework") %>% group_by(main_language) %>% summarise(freq = n()) %>% ungroup %>% arrange(desc(freq)) %>% top_n(x = ., n = 10, wt = freq) engines.community_dataset %>% group_by(main_language) %>% mutate(n = n()) %>% # mutate(med = median(stargazers_count)) %>% ungroup() %>% filter(main_language %in% top_langs_eng$main_language) %>% ggplot(aes(x = fct_reorder(main_language, stargazers_count, .fun = median, .desc = T), y = stargazers_count)) + geom_boxplot(width = 0.5) + scale_y_log10() + labs( x = "", y = "stargazers_count" ) + scale_fill_brewer(palette=6)+ theme_minimal() + ggtitle("Engines") + ggsave(filename=paste("../../plots/boxplot-stars-lang-eng.pdf", sep = ""), dpi=300, units = "cm", width = 14, height = 7) frameworks.community_dataset %>% group_by(main_language) %>% mutate(n = n()) %>% ungroup() %>% filter(main_language %in% top_langs_fra$main_language) %>% ggplot(aes(x = fct_reorder(main_language, stargazers_count, .fun = median, .desc = T), y = stargazers_count)) + geom_boxplot(width = 0.5) + scale_y_log10() + labs( x = "", y = "stargazers_count" ) + scale_fill_brewer(palette=6)+ theme_minimal() + ggtitle("Frameworks") + ggsave(filename=paste("../../plots/boxplot-stars-lang-fra.pdf", sep = ""), dpi=300, units = "cm", width = 14, height = 7) ## contributors_count engines.community_dataset %>% group_by(main_language) %>% mutate(n = n()) %>% # mutate(med = median(contributors_count)) %>% ungroup() %>% filter(main_language %in% top_langs_eng$main_language) %>% ggplot(aes(x = fct_reorder(main_language, contributors_count, .fun = median, .desc = T), y = contributors_count)) + geom_boxplot(width = 0.5) + scale_y_log10() + labs( x = "", y = "contributors_count" ) + scale_fill_brewer(palette=6)+ theme_minimal() + ggtitle("Engines") + ggsave(filename=paste("../../plots/boxplot-contrib-lang-eng.pdf", sep = ""), dpi=300, units = "cm") frameworks.community_dataset %>% group_by(main_language) %>% mutate(n = n()) %>% ungroup() %>% filter(main_language %in% top_langs_fra$main_language) %>% ggplot(aes(x = fct_reorder(main_language, contributors_count, .fun = median, .desc = T), y = contributors_count)) + geom_boxplot(width = 0.5) + scale_y_log10() + labs( x = "", y = "contributors_count" ) + scale_fill_brewer(palette=6)+ theme_minimal() + ggtitle("Frameworks") + ggsave(filename=paste("../../plots/boxplot-contrib-lang-fra.pdf", sep = ""), dpi=300, units = "cm") ``` * In contrast to the literature in open source projects in general, *C* projects are the most popular ones in general (median `r median(engines.community_dataset[engines.community_dataset$main_language == "C",]$stargazers_count)` stars), followed by *JavaScript* (median `r median(engines.community_dataset[engines.community_dataset$main_language == "JavaScript",]$stargazers_count)` stars). **Findings** * [Engines] have more than 1000 stars (`r 100 - ecdf(engines.community_dataset$stargazers_count)(1000) * 100`%). * [Frameworks] have more than 1000 stars (`r 100 - ecdf(frameworks.community_dataset$stargazers_count)(1000) * 100`%). * *C* projects are the most popular ones (median is `r median(engines.community_dataset[engines.community_dataset$main_language == "C",]$stargazers_count)`). * *JavaScript* is the most popular scripting language for game engines. ### RQ8 - barplot and table ```{r} dataset %>% select(name, owner, main_language, stargazers_count, contributors_count, type) %>% filter(type == "engine") %>% summarise(p = sum(stargazers_count)) %>% arrange(desc(stargazers_count)) # table t.e <- dataset %>% select(name, owner, main_language, stargazers_count, contributors_count, type) %>% filter(type == "engine") %>% group_by(main_language) %>% summarise(freq = n()) %>% ungroup %>% arrange(desc(freq)) %>% top_n(x = ., n = 10, wt = freq) # summarise(med = median(stargazers_count)) # ungroup %>% # summarise(freq = n()) %>% # summarise(freq = sum(stargazers_count)) %>% # arrange(desc(med)) dataset %>% group_by(main_language) %>% # summarise(med = median(stargazers_count)) %>% summarise(mean = mean(stargazers_count)) %>% arrange(desc(mean)) %>% filter(main_language %in% t.e$main_language) t.f <- dataset %>% select(name, owner, main_language, stargazers_count, contributors_count, type) %>% filter(type == "framework") %>% group_by(main_language) %>% # summarise(freq = sum(stargazers_count)) %>% summarise(med = median(stargazers_count)) %>% ungroup %>% mutate(p = round(freq / sum(freq), 2)) %>% # mutate(type="framework") %>% arrange(desc(freq)) lic_table <- merge(x = t.e, y = t.f, by = "main_language", all = TRUE) %>% replace(is.na(.), 0) %>% mutate(sum = freq.x + freq.y) %>% mutate(p.t = round(sum / sum(freq.x + freq.y), 2)) %>% arrange(desc(sum)) write.table(lic_table, file="../../tables/tab-stars.csv", sep = ",", fileEncoding = "UTF-8") # barplot lic_table_top <- lic_table %>% top_n(x = ., n = 10, wt = sum) lic <- dataset %>% select(fullname, main_language, stargazers_count, type) %>% mutate_if(is.factor, funs(factor(replace(., .=="", "Not defined")))) %>% group_by(type, main_language) %>% summarise(freq = sum(stargazers_count)) %>% ungroup %>% arrange(desc(freq)) %>% filter(main_language %in% lic_table_top$main_language) # same order as lang freq # add median ggplot(lic, aes(reorder(main_language, -freq), freq)) + geom_bar(aes(fill = type), position = "dodge", stat="identity")+ scale_fill_brewer(palette=6)+ theme_minimal() + theme(legend.position="bottom") + labs(title="", x="", y="stargazers_count (sum)") + theme(axis.text.x = element_text(angle = 0, vjust = 1)) ggsave(filename=paste("../../plots/barplot-stars.pdf", sep = ""), plot=barplot, dpi=300, units = "cm") ``` ## Goal 3 - DS ```{r} tab_name_ds <- "../../tables/tab-ds-rq3.csv" boxplot_no_outlier(dataset$stargazers_count, stargazers_count, "stargazers_count") boxplot_no_outlier(dataset$contributors_count, contributors_count, "contributors_count") boxplot_no_outlier(dataset$truck_factor, truck_factor, "truck_factor") # contributors_count ds_table <- data.frame( desc_stats(stargazers_count), test_normality(stargazers_count) ) %>% mutate(variable = "stargazers_count") write.table(x = ds_table, file = tab_name_ds, append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) # contributors_count ds_table <- data.frame( desc_stats(contributors_count), test_normality(contributors_count) ) %>% mutate(variable = "contributors_count") write.table(x = ds_table, file = tab_name_ds, append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) # truck factor ds_table <- data.frame( desc_stats(truck_factor), test_normality(truck_factor) ) %>% mutate(variable = "truck_factor") write.table(x = ds_table, file = tab_name_ds, append = TRUE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = FALSE) ``` ## Goal 3 - MW ```{r} tab_name_mw <- "../../tables/tab-mw-rq3.csv" mw_table <- data.frame( mw(dataset$stargazers_count), effect_size(dataset$stargazers_count) ) %>% mutate(variable = "stargazers_count") write.table(x = mw_table, file = tab_name_mw, append = FALSE, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = TRUE) mw_table <- data.frame( mw(dataset$contributors_count), effect_size(dataset$contributors_count) ) %>% mutate(variable = "contributors_count") write.table(x = mw_table, file = tab_name_mw, append = T, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = F) mw_table <- data.frame( mw(dataset$truck_factor), effect_size(dataset$truck_factor) ) %>% mutate(variable = "truck_factor") write.table(x = mw_table, file = tab_name_mw, append = T, sep = ",", fileEncoding = "UTF-8", row.names = FALSE, col.names = F) ``` ## RQ 9 How is the truck/bus factor in these projects? ```{r} dte <- dataset.e %>% group_by(truck_factor) %>% summarise(n_projects_e = n()) %>% ungroup() dtf <- dataset.f %>% group_by(truck_factor) %>% summarise(n_projects_f = n()) dt <- merge(dtf, dte, by = "truck_factor", all = T) dce <- dataset.e %>% group_by(truck_factor) %>% summarise(contrib_e = median(contributors_count)) dcf <- dataset.f %>% group_by(truck_factor) %>% summarise(contrib_f = median(contributors_count)) dc <- merge(dcf, dce, by = "truck_factor", all = T) d <- merge(dt, dc, by = "truck_factor", all = T) write.table(x = d, file = "../../tables/tab-truckfactor.csv", append = F, sep = ",", fileEncoding = "UTF-8", row.names = T, col.names = T) dataset %>% filter(truck_factor > 8 ) ``` The vast majority of the projects have truckfactor = 1 (`r ecdf(dataset.e$truck_factor)(1) * 100`%). The maximum truckfactor is `r max(dataset.e$truck_factor)` (`r nrow(dataset.e[dataset.e$truck_factor == 8,])` project). When we analyze frameworks, `r ecdf(dataset.f$truck_factor)(1) * 100`% of the projects have truckfactor = 1. The maximum truckfactor is `r max(dataset.f$truck_factor)` (`r nrow(dataset.e[dataset.e$truck_factor == 6,])` projects). ```{r} dataset.e %>% ggplot(aes(x = as.factor(truck_factor), y = contributors_count)) + geom_boxplot() + # scale_y_log10() + theme_light() dataset.f %>% ggplot(aes(x = as.factor(truck_factor), y = contributors_count)) + geom_boxplot() + # scale_y_log10() + theme_light() ``` **Game Engines** * projects with truckfactor = 1 have `r median(dataset.e[dataset.e$truck_factor == 1,]$contributors_count)` contributors (median value). * projects which truckfactor = 2 have `r median(dataset.e[dataset.e$truck_factor == 2,]$contributors_count)` contributors (median value). **Frameworks** * projects with truckfactor = 1 have `r median(dataset.f[dataset.f$truckfactor == 1,]$contributors_count)` contributors (median value). * projects which truckfactor = 2 have `r median(dataset.f[dataset.f$truckfactor == 2,]$contributors_count)` contributors (median value). ## How developers contribute to the game engines? (TODO) * M: correlate contributors with some other variable ```{r} # developers_contributions = contributions_intervals %>% # group_by(project_name, author) %>% # summarise( # n_commits = n(), # first_commit = min(author_date), # last_commit = max(author_date) # ) %>% # ungroup # projects_contributions = developers_contributions %>% # group_by(project_name) %>% # summarise(n_contributors = n()) %>% # ungroup %>% # inner_join(general_dataset, by = c("project_name" = "name")) %>% # mutate( --> # total_issues = closed_issues_count + open_issues_count, # total_prs = open_pulls_count + closed_pulls_count, # closed_issues_rate = closed_issues_count / total_issues, # closed_pulls_rate = closed_pulls_count / total_prs # ) ``` ```{r} # projects_contributions %>% # ggplot(aes(x = "", y = n_contributors)) + # geom_violin(fill = "lightgrey") + # geom_boxplot(width = 0.1) + # scale_y_log10() + # coord_flip() + # theme_light() ``` ```{r} # projects_contributions %>% # select(total_issues, total_prs, closed_issues_rate, closed_pulls_rate, open_issues_count, closed_issues_count, open_pulls_count, closed_pulls_count) %>% # summary ``` Overall, projects are well maintained. Half of the projects closed `r #quantile(na.omit(projects_contributions$closed_issues_rate))[3] * 100`% of their issues. `r #ecdf(projects_contributions$closed_pulls_rate)(0.999) * 100`% of the projects closed all pull requests. ## What is the community size in these projects? [postpone] Postponed for now as we need more data (pull requests, issues, etc). * M: pull requests * M: issues # Notes ## Emprirical data * A robust statistical analysis should consider a priori power analysis, which is widely supported by modern statistical tools. 25 * Researchers should consider not only related work but explicitly reasoning about the strength of previous re- search (e.g., existing datasets). ## Power analysis ```{r} # https://www.statmethods.net/stats/power.html wt <- wilcox.test(dataset.e$cc, dataset.e$cc, paired = TRUE, conf.int = TRUE, conf.level = 0.95) et <- effsize::cliff.delta(dataset.f$cc, dataset.e$cc, conf.level=.95) p.out <- pwr.2p.test(et$estimate, length(dataset.f$cc)) p.out <- pwr.p.test(et$estimate, length(dataset.f$cc)) plot(p.out) ``` # Descriptive Statistics * Researchers should report on different types of descriptive statistics in their data, including: * central tendecy, dispersion and shape. x`` * Such information can act as a first checkpoint of the statistical analysis. * Plots and visualization of data should enhance and complement understanding of the descriptive statistics. * kernel density plots (emphasizing dispersion in terms of shape of the data), * scatterplots and beehive plots (emphasizing dispersion in terms of individual datapoints). * Researchers should use distribution tests to support assumptions about the data (e.g., normality), instead of simple visual analysis of density plots and histograms. * We recommend Shapiro-Wilk or Anderson-Darling instead of the widely used Kolmogorov-Smirnov ## What are the key properties of the data? ```{r} ds <- psych::describe(dataset, omit = TRUE, trim = .1) write.csv(ds, file = "../../tables/descriptive-statistics.csv") ``` ## What does the data look like? ```{r} str(dataset.e) # kernel density plots (better than boxplots) library("ggpubr") ggdensity(data=dataset, x=c("cc","nloc"), facet.by="type") # scatter plot plot(dataset.e$nloc, dataset.e$len, pch=1) # beehive plot ``` ## Which assumptions are supported by the data? ```{r} # Shapiro-Wilk # http://www.sthda.com/english/wiki/normality-test-in-r sha <- function(x) { show(shapiro.test(x)$p.value) # show(typeof(x)) # show(floor(shapiro.test(x)$p.value)) if (shapiro.test(x)$p.value > 0.05) { return("normal distributed") } else { return("NOT normal distributed") } } apply(dataset.e %>% select(total_size, main_language_size, cc, nloc, len, lifespan, tags_releases_count, forks_count, watchers_count, stargazers_count, open_issues_count, closed_issues_count, commits_count, contributors_count, open_pulls_count, closed_pulls_count), 2, sha) data_n <- rnorm(310) sha(data_n) ``` * The numeric data is NOT normal distributed! # Statistical testing * Researchers should understand the assumptions and constrains inherent to the different statistical tests in their toolkit. * Researchers should perform distribution tests (e.g., normality tests) on their data before using parametric tests. * Neglecting to check the assumptions related with the test can lead to wrong results, hence nonparametric tests are preferred since they have less assumptions and less constraints concerning the data. * Researchers should inform themselves about alternatives to p-values and hypothesis testing (as currently being discussed and questioned within the area of statistics itself) and evolve their methods as a wider consensus is reached. *I'd like to use language variable as a random factor. We could use it with LMM, but the data is non-normal, so we cant't. The easy way is to compare each variable with each time.* ## Parametric or nonparametric models? * To check the difference of means in a non parametric dataset (not normal) with two treatment groups (type: engine and framework): *Mann-Whitney U* OR *Wicoxon Rank Sums Test* ### Mann-Whitney U * Also known as Mann-Whitney-Wilcoxon, Wilcoxon-Mann-Whitney, and the Wilcoxon Rank Sum * Formally, the null hypothesis is that the distribution functions of both populations are equal. The alternative hypothesis is that the distribution functions are not equal. * Requirements: * Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap. * The response variable of interest is ordinal or continuous. * Both samples are random. * Dependent response variables: * total_size, * main_language_size, * cc, * nloc, len, lifespan, tags_releases_count, forks_count, watchers_count, stargazers_count, open_issues_count, closed_issues_count, commits_count, contributors_count, open_pulls_count, closed_pulls_count. * Categorical independent variable (treatments): * type ("engine" or "framework") * It is a lower power test when compared to the independent samples t-test.