Research Data Management and Data Stewardship Competences in University Curriculum

Skills for data governance and management are critical for the wide adoption of Open Science practices and effective use of the data in research, industry, business and other economic sectors. The FAIR (Findable – Accessible – Interoperable - Reusable) data management principles and data stewardship provide a foundation for effective research data management. The 2018 “Turning FAIR into Reality” report and other documents recommend that data skills should be more widely included in university curricula and that a concerted effort should be made to coordinate and accelerate the pedagogy for professional data roles. Throughout Europe and beyond, many organisations, projects and initiatives work on providing training on FAIR data competences. However, wider adoption of the FAIR data culture can be achieved by including FAIR competences into university curricula. This paper presents the ongoing work of the FAIRsFAIR project to develop a Data Stewardship competence framework and to provide recommendations for implementing this framework in university curricula by means of defining the Data Stewardship Body of Knowledge Model Curricula. The proposed approach and identified competences and knowledge topics are supported by a job market analysis. The presented work is actively using the EDISON Data Science Framework as a basis for Data Stewardship competences definition and methodology for linking competences, skills, knowledge, and intended learning outcomes when designing curricula.


I. INTRODUCTION
The growing importance of data in the modern data driven economy, research and industry, requires special attention to including data management and governance related topics in university education. The future specialists should understand the role and value of data in research and industry and be able to derive actionable value from data collected from research, technological process or business/social activity, and be able to use open data and public data. Modern data-driven research and industry require new types of specialists capable of supporting all stages of the data lifecycle from data production to data processing and actionable results delivery, visualisation and reporting, which can be jointly defined as the Data Science professions family [1,2].
Data Management and Data Analytics are the critical aspects of digital transformation, however it requires a change of the whole organisational culture, which is often referred to as data literacy. The research community has responded to this with the formulation of the FAIR data principles that suggest data must be Findable, Accessible, Interoperable, and Reusable [3].
The education and training of Data Stewards should not be limited to the general data management or FAIR data principles. The presented research identified a number of competences, skills, and knowledge areas covering technology and data management that are required from the Data Stewards for successful work in their future organisation. Besides data-related competences and knowledge, the Data Stewards are required to have an understanding of project management and organisational processes (research or business, depending on organisation). At the present time, most of the existing university curricula and training programs cover a limited set of competences and knowledge and falls short of what is required for multiple Data Science and Data Stewardship professional profiles and organisational roles within research and industry. In conditions of continuous technology development and shortened technology change cycles, Data Science and Data Stewardship education requires an effective combination of theoretical, practical and workplace skills.
Recent European initiatives and projects such as the European Open Science Cloud (EOSC) [4] and the Research Data Alliance (RDA) [5] facilitate implementation of the FAIR (Findable, Accessible, Interoperable, Reusable) data principles [6,7]. They aim to allow for a more effective data exchange and integration across scientific domains, making scientific data a valuable resource and a growth factor for the whole digital economy and society.
The proposed work has been done in the framework of the EU-funded FAIRsFAIR project [8] that recognises the importance of establishing the new profession of the Data Steward and introducing FAIR principles and culture at the early stage of professional education and careers by including FAIR principles into university curricula. The FAIR competences and corresponding Knowledge Areas can be introduced as a special course and/or a part of other courses typically taught at universities such as Research Methods, Research Data Management, or Professional Issues [9]. Research Data Management and FAIR principles are currently attributed to the emerging profession of the Data Steward.
The proposed Data Stewardship Professional Competence Framework (CF-DSP) is based on the EDISON Data Science Framework (EDSF) [2] and defines the main competences required from the Data Steward in their work in different organisations. CF-DSP is also complemented by the DSP Body of Knowledge (DSP-BoK) that is defined as a subset of the Data Science Body of Knowledge. This allows reusing the whole EDSF toolkit developed for customised curriculum design [10 The paper refers to the previous authors' works on defining the EDISON Data Science Framework (EDSF) [10] and its application of individual competences management and customised curricula design based on required competences and intended learning outcome that can be targeted for specific professional profiles including Data Stewards [12].
The paper is organized as follows. Section II provides a reference to European and international initiatives related to research data management and growing demand for the Data Stewardship profession. Section III summarises the job market analysis for Data Steward and Data Management vacancies to identify demanded competences, skills and knowledge. Section IV provides an overview of existing frameworks defining Data Stewardship and related competences, including EDSF. Section V discusses the proposed definition of the Data Stewardship Professional Competence Framework (CF-DSP) as an extension to EDSF. Section VI provides suggestions about new knowledge topics to be included in the DSP Body of Knowledge. A conclusion in section VII provides a summary and refers to ongoing and future developments.

II. RESEARCH DATA MANAGEMENT AND DATA STEWARDSHIP
The importance of data and research information sharing has been central in a number of European-wide initiatives and projects addressing Open Access, Open Data, and Open Science in general. The Research Data Alliance (RDA) that was created in 2012 jointly by the National Science Foundation of USA (NSF) and European Commission, became a key community coordination body to exchange and develop best practices in research data management.
To facilitate research data sharing and implementation of the FAIR principles, the European Commission introduced the Open Research Data (ORD) Pilot [13]. EU-funded projects by default are required to develop and implement the Data Management Plan (DMP) at the initial stage of the project [14]. Data produced in projects must be made as 'open as possible' and deposited in data repositories (operated by the project or using national or European data archive services). Metadata must be published, and quality of data ensured, in particular through compliance with the FAIR principles. The DMP template provided by the Commission is structured to ensure that the data produced by funded research are open and FAIR [15].
The FAIR data principles and Data Stewardship are among key objectives of the European Open Science Cloud (EOSC) initiative started in 2016 as the part of the "European Cloud Initiative -Building a competitive data and knowledge economy in Europe" [16], which targeted to capitalise on the data revolution. Under this initiative, EOSC federates existing and emerging e-Infrastructures to provide European science, industry, and public authorities with world-class data infrastructure connected to high performance computers (HPC).
The EOSC goals are to enable the Open Science Commons [17] and achieve FAIRness in research data management and in the services provided. At the present time, the EOSC projects created the foundation for research data interoperability and integration for European IRs. The Minimum Viable EOSC (MVE) achieved by the end of 2020, will create a starting point for future EOSC development [18].

A. Job Market Analysis
The presented study and the proposed Data Stewardship competences and skills definition is based on data collected from job advertisements on such popular job search and employment portals as indeed.com, IEEE Jobs portal and LinkedIn Jobs advertised that provided rich information for defining Data Stewardship competences, skills and required knowledge of data management, Big Data and data analytics tools and software. indeed.com provides a rich selection of job vacancies by countries both for companies and universities. LinkedIn posts vacancies related to the region or country from where the request is originated and many job ads are posted in the national language. In the particular case of this study, the job advertisements were collected for positions available in Netherlands, UK and Germany in Europe and in the USA B. EDISON methodology to collect and analyse job market and competence related data [1,2] To verify existing frameworks and potentially identify new competences, different sources of information have been investigated: x First of all, job advertisements that represent the demand side for Data Stewards and data management specialists and based on practical tasks and functions that are identified by organisations for specific positions. This source of information provided factual data to define demanded competences and skills.
x Structured presentation of Data Steward related competences and skills produced by different studies as mentioned above, in particular EDSF definition of Data Science and Data Stewardship that identifies the following groups of competences, namely Data Analytics, Data Science Engineering, Data Management, Research Data, and Domain expertise. This information was used to correlate with information obtained from job advertisements. x Blog articles and community forums discussions that represented valuable community opinion. This information was specifically important for defining practical skills and required tools.
The following approach has been used when analysing the job advertisement data 1) Collect data on required competences and skills 2) Extract information related to competences, skills, knowledge, qualification level, and education; translate and/or reformulate if necessary 3) Split extracted information on initial classification or taxonomy facets, first of all, on required competences, skills, knowledge; suggest mapping if necessary 4) Apply existing taxonomy or classification: for the purpose of this study, we used skills and knowledge groups as defined by the EDSF definition of Data Science and Data Stewardship (i.e. Data Analytics, Data Engineering, Data Management, Research Methods, Domain Knowledge) 5) Identify competences and skills groups that do not fit into the initial/existing taxonomy and create new competences and skills groups 6) Do clustering and aggregations of individual records/samples in each identified group 7) Verify the proposed competences groups definition by applying to originally collected and new data 8) Validate the proposed competence framework via community surveys and individual interviews. The outlined above process has been applied to the collected information and all steps are tracked in the two Excel workbooks provided as supplementary material which is available at the FAIRsFAIR project shared storage.

C. Demanded Data Stewardship competences
A preliminary analysis has been done based on data collected from the job advertisements on the job search and employment portals mentioned above. This provided a sufficient amount of vacancies for decisive analysis. The collected data were used to extract information on competences, skills and knowledge demanded from prospective Data Steward candidates (following EDSF methodology as explained above). Figure 1 illustrates what competences, skills and knowledge topics have been extracted from collected vacancies data and their mapping to CF-DS competence groups. Page 1719 The analysis allows assuming that (1) EDSF and CF-DS are suitable for defining Data Stewardship Professional Competence Framework as a special CF-DS profile; (2) certain extension may be needed for the general CF-DS as it is described below. Figure 1 illustrates the variety of knowledge that is expected from the Data Stewards covering all main competence and knowledge groups defined for Data Scientists: DSENG, DSDM, DSRMP, DSDK/DS Biz, with less demand for Data Analytics, which will remain an area for the Data Scientists. This knowledge profile can be explained by the expected role of the Data Steward as coordinating multiple cross-organisational (horizontal) activities to ensure data quality, data management infrastructure operation, and promoting best practices in data management, in particular FAIR data principles.

D. The outcome of the job vacancies analysis
The following conclusions and assumptions can be made from the initial vacancies analysis: x The published Data Stewards vacancies demonstrated the variety of competences, skills and knowledge required from the candidates x The extracted competences can be successfully mapped to the competence groups defined for the Data Science professional family that includes Data Stewards x The presented analysis confirms the applicability of EDSF to the analysis and further structured development of the intended FAIR4HE framework The most populated competence group is Data Management, which reflects the nature of the Data Steward profession and responsibilities. Two other populated groups are Domain Knowledge and Data Science Engineering what reflects another side of the Data Steward profession to act as a bridge between ICT teams operating data facilities and domain specialists. This imposes the need for related knowledge at the level sufficient for coordination and communication. This fact is clearly reflected in the distribution of required knowledge topics.
The collected/extracted set of competences, skills and knowledge topics will be used for detailed competences analysis and mapping to current definitions and vocabulary in EDSF and necessary updates and extensions/additions will be suggested. This information is presented in the next section.

A. EOSCpilot FAIR4S Framework
The EOSCpilot project defines data stewardship as a shared responsibility of professional groups involved in data management: data management and curation, data science and analytics, data services engineering, and domain research [20]. The EOSCpilot deliverable "D7.5: Strategy for Sustainable Development of Skills and Capabilities" [20] describes the comprehensive 'FAIR4S' framework that defines 6 skill profiles grouped around the research data lifecycle stages and 4 professional groups that include researchers, data scientists, data advisors, and data services providers involved into different aspects of data management, data curation and related services provisioning. FAIR4S is primarily focused on the EOSC services as they were defined at the stage of the EOSCpilot project 2017-2019.
A total of 31 individual competences and capabilities are defined in FAIR4S that are grouped into six groups around typical processes and stages in the research data lifecycle: x Plan and design: Plan stewardship and sharing of FAIR outputs x Capture and process: Reuse data from existing sources x Integrate and analyse: Use or develop FAIR research tools/services x Apprise and present: Prepare and document data/code to make outputs FAIR x Publish and release: Publish FAIR outputs on recommended repositories x Expose and discover: Recognise, cite and acknowledge contributions. The FAIR4S framework defined 2 templates for describing the Skills profiles and Role profiles. The Skills profile template includes knowledge, skills and attitude (that can also be treated as aptitude) for 3 levels of proficiency: Basic, Intermediate, Expert. The template also includes a list of professional groups and roles to which the competence group applies. The Role profile includes the list of suggested skills, an explanation of their importance and suggestions where these skills can be learned.

B. ELIXIR Data Stewardship Competency Framework for Life Sciences (DSP4LS)
The ELIXIR Data Stewardship Competency Framework for Life Sciences [21] (hereafter referred as DSP4LS -Data Steward Professional for Life Sciences) is the complete framework that defines the competencies, skills and knowledge related to Data Stewardship as a distinct profession in the modern data driven science ecosystem and life sciences in particular. The DSP4LS framework allows translating the organisational responsibilities and tasks of Data Stewards, together with required knowledge, skills and abilities, into practical learning objectives that provide a basis for developing tailored training. In this way, the framework provides a strong foundation for professionalizing Data Stewardship.
The DSP4LS starts from defining the Data Steward Roles and Competence Profiles in the following 3 areas: x Policy: institute and policy focused Page 1720 x Research: project and research focused x Infrastructure: data handling and e-infrastructure focused For all Data Steward roles, the 8 competence areas are defined: Policy/strategy; Compliance; Alignment with FAIR data principles; Services; Infrastructure; Knowledge management; Network; Data archiving. In the extended definition, for each competence the following attributes are defined: x Activities and tasks (in the organisational context) The Data Stewardship competences are defined in 6 competence groups comprising 22 competences: Open Science, Data Collection and Data Processing, Data publishing and data preservation, and competences related to research data lifecycle phases: Planning phase, Active research phase, and Dissemination/publication phase. The report defined the four roles for Data Stewards: Administrator; Analyst; Developer; Agent of change.
The report proposed three modes for Data Stewards education (based on the prospective student/learner background and entry level) x Student with Bachelor degree x Student with PhD and equivalent x Continuing and professional education

D. GO FAIR Metadata Management Requirements and FAIR Data Maturity Model
The GO FAIR initiative [23] which is devoted to promoting and sustainable adoption of the FAIR data principles, provided recommendations on FAIR metadata management that can be used for linking between general requirements to FAIR implementation and underlying technology and infrastructure and consequently for defining technical expertise areas. It is important to note that these requirements require both advanced data management infrastructure tools and corresponding competences from Data Engineers and Data Stewards.
The FAIR Data Maturity Model [24] which was developed and is maintained by the RDA community [25] provides a set of compliance indicators to assess the level of implementation of the FAIR principles and can be used for defining policy, research and infrastructure related competences in Data Stewardship and data management.

E. DAMA-DMBOK: Data Governance and Stewardship
The Data Management Body of Knowledge (DMBOK) Framework by Data Management Association International (DAMAI) is an industrystandard summarizing best practices in Data Management [26]. It is a valuable document that provides a basis for setting up organisational policy and structure to ensure consistent data management and governance. The DMBOK is directly used for training and certification of several data management and governance professions and roles. It goes into depth about the Knowledge Areas that make up the overall scope of data management.
The DMBOK defines 11 main Knowledge Areas and several additional areas related to technologies used. Each Knowledge Area is provided with a detailed context diagram that includes: Definition, Goals, Inputs, Activities, Deliverables, Suppliers, Participants, Consumers, Tools, technics and metrics -that can be used as a direct guidance for organisations setting up their data management and governance structure.

The Data Governance and Stewardship Knowledge
Area is the central for the whole DMBOK. The DMBOK also explains the relation between Data Governance and Data Management where Data Governance is focused on Legal and Judicial views ("Do right things") and Data Management is dealing with Executive issues ("Do things right"). This also defines staffing of the Data Page 1721 x Data management requirements must drive Information Technology decisions x Data management is cross-functional; it requires a range of skills and expertise x Data management requires an enterprise perspective x Data management must account for a range of perspectives x Data management is lifecycle management x Different types of data have different lifecycle characteristics x Managing data includes managing the risks associated with data x Effective data management requires leadership commitment Data Steward definition and organisational roles include the following responsibilities and activities: x Creating and managing core Metadata: Definition and management of business terminology, valid data values, and other critical Metadata. Documenting rules and standards: Definition/documentation of business rules, data standards, and data quality rules. o High quality data are often formulated in terms of rules rooted in the business processes that create or consume data. Stewards help surface these rules and ensure their consistent use.
x Managing data quality issues: Stewards are often involved with the identification and resolution of data related issues or in facilitating the process of resolution.
x Executing operational data governance activities: Stewards are responsible for ensuring that, day-today and project-by-project, data governance policies and initiatives are adhered to. They should influence decisions to ensure that data is managed in ways that support the overall goals of the organization. To stress the uniqueness if the Data Stewardship competences, the DMBOK reads: "Best Data Steward is not made but found" [26]

V. EDISON DATA SCIENCE FRAMEWORK (EDSF)
The EDISON Data Science Framework [2] provides a basis for the definition of the Data Science profession and enables the definition of other components related to Data Science education, training, organisational roles definition and skills management. EDSF provides a common semantic basis for interoperability of the different forms of the Data Science curriculum definition and education or training delivery, as well as knowledge assessment and professional certification based on the fully enumerated definition of EDSF components and individual units. Figure 2 below illustrates the main EDSF components and their inter-relations:  The CF-DS provides the overall basis for the whole framework. The CF-DS includes the core competences required for the successful work of a Data Scientist in different work environments in industry and in research and through the whole career path.

A. EDSF Components
The following core CF-DS competence and skills groups have been identified (refer to CF-DS specification [2] for details): Knowledge and Expertise (Subject/Scientific domain related) Data Science competences must be supported by knowledge, which is acquired primarily by education and training, and skills, which are defined by work experience correspondingly. The CF-DS defines two types of skills (refer to CF-DS [2] for full definition of the identified knowledge and skills groups): Page 1722 x Skills Type A which are related to the professional experience and major competences, and x Skills Type B that are related to wide range of practical computational skills including using programming languages, development environment and cloud based platforms. The DS-BoK defines the Knowledge Areas (KA) for building Data Science curricula that are required to support identified Data Science competences. DS-BoK is organised by Knowledge Area Groups (KAG) that correspond to the CF-DS competence groups. DS-BoK is based on ACM/IEEE Classification Computer Science (CCS2012) [27], incorporates best practices in defining domain specific BoK's and provides a reference to existing related BoK's. It also includes proposed new KA to incorporate new technologies and scientific subjects required for consistent Data Science education and training.
The MC-DS is built based on DS-BoK and linked to CF-DS where Learning Outcomes are defined based on CF-DS competences (specifically skills type A), and Learning Units are mapped to Knowledge Units in DS-BoK. Three mastery (or proficiency) levels are defined for each Learning Outcome to allow for flexible curricula development and profiling for different Data Science professional profiles. Three mastery (or proficiency) levels are defined for each Learning Outcome to allow for flexible curricula development and profiling for different Data Science professional profiles. The proposed Learning Outcomes are enumerated to have a direct mapping to the enumerated competences in CF-DS. The practical curriculum should be supported by a corresponding educational environment for hands-on labs and educational projects development. Figure 3 below illustrates relations between Competence framework components: competence, skills, knowledge, attitude, and academic domain, including Body of Knowledge, Model Curriculum, Learning Outcomes, and educational profiles.

B. DSPP and Data Steward Professional Definition
The DSPP is defined as an extension to European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy [28] using the ESCO top classification groups. DSPP definition provides an important instrument to define effective organisational structures and roles related to Data Science positions and can also be used for building an individual career path and corresponding competences and skills transferability between organisations and sectors.
Recognising the importance of the Data Steward role in a typical research institution, the DSPP provides the initial definition of the Data Steward professional profile: Data Steward is a data handling and management professional whose responsibilities include planning, implementing and managing (research) data input, storage, search, and presentation. Data Steward creates a data model for domain specific data, support and advice domain scientists/ researchers during the whole research cycle and data management lifecycle.

VI. CF-DSP: DATA STEWARDSHIP AND FAIR COMPETENCES AS EXTENSION FOR EDSF CF-DS
Based on the above analysis and community discussions organised through the FAIRsFAIR project, the following sections provides a short summary of the additional competences that are proposed to be added to the current Data Science Competence Framework (CF-DS/EDSF) competence groups (only additional full or partial definition is provided).
The Data Management competence group is the most important one for the Data Steward and FAIR principles implementation. Therefore, most extensions are proposed for this group. Specific extensions are suggested for the Data Science Engineering competence group and to Domain related competences to ensure linkage and interaction with different organisational units. In this case, a Data Steward would play the role of liaison, coordinator and communicator to ensure the FAIR principles are implemented and sustained.
The following are suggested extensions for the Data Management competence group (refer to the EDSF [2] and Deliverable D7.3 [19] for full list and details). Page 1724 masters as well as in the professional training programs [30,31,32].

IX. CONCLUSION AND FURTHER DEVELOPMENTS
The presented work on the CF-DSP definition has benefitted from contributions through a number of communities such as EOSC Skills and Training Working Group, Research Data Alliance (RDA) Interest Groups on Data Stewardship Professionalisation, GO FAIR and Dutch e-Science Center, and the FAIRsFAIR project. The progress of this works has been discussed at a number of workshops and events organised by the FAIRsFAIR in cooperation with RDA and CODATA. See FAIRsFAIR deliverable D7.3 for details and overview [19].
Further steps in the CF-DSP definition will address general workplace skills, also referred to as "soft" skills or professional attitude/aptitude skills, which are becoming increasingly important in modern data driven and future data driven economy. This would include two groups of skills that are increasingly demanded by employers: Data Stewardship professional skills, and general 21st Century skills that comprise a set of skills that include critical thinking, design thinking, communication, collaboration, organizational awareness, ethics, and others. This will be done by assessing similar skill groups defined in EDSF if they can be directly used or modified.
The next stages in the FAIRsFAIR project will target the definition of the required extensions to the Body of Knowledge and development of the Data Stewardship Model Curriculum, which expectedly will reuse EDSF definition and methodology.