Published May 17, 2025 | Version v2
Dataset Open

Artifact of the paper "An Empirical Investigation on the Challenges in Scientific Workflow Systems Development"

  • 1. ROR icon University of Saskatchewan

Description

Scientific Workflow Systems (SWSs) play a critical role in the contemporary scientific landscape, significantly enriching research endeavors by augmenting productivity and fostering collaboration, SWSs elevate the standard of scholarly inquiry, fortifying its pillars of reproducibility and ethical adherence. Essentially, they serve as the bedrock upon which efficient, transparent, and impactful research is built, propelling knowledge and innovation across diverse fields. SWSs accomplish mundane yet essential tasks intrinsic to scientific inquiry—ranging from data acquisition to analysis and reporting. By liberating researchers from the shackles of manual labor, SWSs enable them to channel their energies toward more intellectually demanding pursuits, thereby enhancing the pace and quality of research outcomes. Moreover, SWSs wield a formidable influence in standardizing workflows across research cohorts, instilling a sense of uniformity in experimental methodologies and data-handling practices. This standardization not only cultivates a culture of rigor and coherence but also fosters cross-disciplinary dialogue and collaboration.

Integral to the operation of SWSs is their capacity to integrate diverse tools, software, and data sources, effectively functioning as centralized hubs for research management. This integration expedites the research process and facilitates seamless data exchange and interoperability—a pivotal asset in an era characterized by the deluge of data and the imperative of interdisciplinary collaboration. Furthermore, SWSs afford researchers and project managers real-time insights into the progress of research endeavors, empowering them to identify bottlenecks, allocate resources judiciously, and optimize workflow execution. This granular oversight enhances project transparency and accountability and serves as a catalyst for informed decision-making.

Crucially, SWSs are engineered to accommodate the complexities inherent in scientific inquiry, adeptly handling vast volumes of data and supporting parallel processing to meet the evolving demands of research projects. This scalability underscores their adaptability to diverse research paradigms, ensuring their relevance across a spectrum of scientific disciplines. Facilitating collaboration across geographic and temporal divides, SWSs offer a suite of collaborative features—including version control, shared workspaces, and communication tools—that transcend the constraints of physical proximity. By fostering a culture of inclusivity and knowledge exchange, SWSs catalyze innovation and synergy among distributed research teams.

Moreover, SWSs serve as custodians of reproducibility, meticulously documenting each facet of the research workflow—from data sources to analysis methods—thus safeguarding the integrity of scientific inquiry. This commitment to transparency and methodological rigor underpins the credibility of research findings, engendering trust within the scientific community and beyond. The customizable nature of SWSs empowers research teams to tailor their workflows to suit their unique needs and preferences, further amplifying their utility and versatility. In essence, SWSs emerge not merely as tools of convenience but as indispensable allies in the relentless pursuit of scientific excellence.

Numerous developers actively participate in the advancement of SWSs through diverse roles, including designing system architectures to ensure flexibility and performance, developing algorithms for data processing and analysis, crafting user-friendly interfaces, handling backend logic, integrating with external tools, and ensuring quality, security, and compliance. They address challenges such as optimizing performance and scalability by leveraging parallel processing and distributed computing techniques. To tackle these diverse tasks, developers encounter numerous challenges, often turning to crowd-sourced platforms like Stack Overflow and GitHub to discuss and address them. Stack Overflow serves as a vital resource for developers to seek solutions, learn new technologies, validate best practices, and engage with the programming community. Similarly, GitHub facilitates collaborative development by allowing developers to report problems, propose enhancements, and contribute to open-source projects. Our research draws insights from Stack Overflow discussions, GitHub issues, and pull request reports related to SWSs, reflecting the dynamic and collaborative nature of software development in this domain.

Notes

scientific-workflow-systems-list.xlsx contains the SWSs list, and the Used SWSs For the analysis sheet contains our selected projects for our analysis.

Collected Stack Overflow Data.zip contains the data we extracted from Stack Overflow. After unzipping, you will be able to find a folder, Popular SWfMS Filtered Data, where we store our selected projects data. After checking, we found many posts that were unrelated to SWS. We filtered that out and stored the irrelevant data in the Popular SWfMS Filtered Data/Unrelated data folder for verification. AllpostsforSelectedSWS.csv contains the merged data for our analysis using Stack Overflow Data.

The GitHub Data.zip archive contains three main folders. The SelectedGitHubData folder includes the raw data we initially downloaded. The PreprocessedGitHubData folder contains the data after cleaning and preprocessing. Finally, the CombinedGitHubData folder holds the merged dataset that was used for our analysis.

SO Data Other Fileds.zip contains data for other software engineering domains (i.e., mobile, security, webapp, chatbot). These datasets are used to compare SWSs with other Software engineering domains. As there are more than 1 million mobile development posts, we only share the IDs of the posts.  Interested personnel can check the data using the ID.

Notes

Scripts.zip folder contains the script for downloading the issues/pull requests, data preprocessing, and topic modeling. You might need to change the directory location for each file before performing any operation.

Final_Topics.zip contains the generated topics (RQ1) we obtained after running the BERTopic modeling algorithm. We obtained ten topics for SO data and 13 topics for GitHub Data.

RQ2 Types Analysis.zip contains the types (How, Why, What, and Other) analysis results for RQ3. We selected statistically significant numbers for each topic to identify the types of RQ2.

Files

Collected Stack Overflow Data.zip

Files (436.6 MB)

Name Size Download all
md5:c476bd4cab8c39f3a02a11812f14feda
61.0 MB Preview Download
md5:650be96e505509d9bac719c9e6d2455d
83.7 MB Preview Download
md5:2247268e02d2c1e8bf32af9034ebd1aa
158.8 MB Preview Download
md5:089e3ee1d86d9a8dd39c175ad36a559c
2.1 MB Preview Download
md5:eb5f736ad8e11b759520bcd643fbbcc5
62.9 kB Download
md5:f26b836c36d4f7c97d85c9d3d12a1e7a
254.1 kB Preview Download
md5:c7b39090aa11d7981ccc89285508c06e
130.8 MB Preview Download

Additional details

Funding

Natural Sciences and Engineering Research Council