Data warehousing for Open Data sharing and decision support in agriculture: a case study of the VDSA Knowledge Bank and its development process

Acceptance of Open Data by the governments, foundations, science councils and other sponsors of research across the world have prompted the trend towards Open Data. Many researchers and organizations have embraced the Open Data policy. Most often the datasets are released through Dataverse. However, just making datasets available is not enough. We need inbuilt user-friendly data extraction and transformation techniques, and analytical features that can facilitate analyses of research data and generate new knowledge and insights from such data. The Village Dynamics in South Asia (VDSA) Knowledge Bank, (http://vdsakb.icrisat.ac.in/), is a unique data warehouse with user-friendly data retrieval and online analytical processing features for household level panel data on India and Bangladesh. It is the first of its kind in the world for managing household survey data. It exemplifies how digital innovation tools (Microsoft SQL and Business Intelligence tools) can be used in processing diverse, complex long-term databases and promoting the cause of Open Data and facilitates decision support needs. This paper has documented the development process, technical features and impacts of the VDSA Knowledge Bank. It provides an in-depth understanding about how to plan, partner with information technology firms, design and implement a data warehouse project. Thus, the main contribution of this paper is featuring how developments in information technology can be innovatively used to promote Open Data and enhance impacts. Hopefully, this study will stimulate interests among researchers and donors in building data warehouse for management and dissemination of Open Data.


Introduction
Data has long been handled as the private property of those who developed them.Databases were, and often still are, created and stored in different ways, analysed by different methods, and thus can be deeply siloed [1].With the advancement of the global movement of Open Data and its acceptance by the governments, foundations, science councils and other sponsors of research all over the world has prompted the trend towards Open Data.The two main criteria for Open Data are that they must be freely available online and in a format that allows re-use [1].Such data are new global public goods [2].
There have been substantial advances in Open Data over the last decade.The Open Data Charter (opendatacharter.net)reports that more than 100 governments and organisations have committed to opening up data based on a set of global principles since 2015.The charter has influenced global data policies, helping shift focus to the purpose-driven publication of Open Data.
Many researchers as well as national and international research organizations have embraced the Open Data policy.Most often the datasets are released through Dataverse (e.g.ICRISAT Dataverse; dataverse.icrisat.org)containing 420 datasets.A Dataverse is a container for datasets, files, and metadata.This approach has undoubtedly improved access to data and advancing science and scientific communication.However, one major limitation is that data are typically stored in raw formats and external users need to invest considerable effort in processing data to make them fit to their research needs.Therefore, inbuilt user-friendly data extraction and transformation techniques can further improve data access.
Data warehouse (DW) approaches and online analytical processing (OLAP) have the potential to facilitate accessibility and analytical explorations of big data [3].DWs can take vast amounts of data and present them in meaningful formats for making better decisions [4].Data warehousing is currently one of the most important database technology applications in practice [5].High levels of user demand and return on investment have been reported in the literature for such applications [6].Additionally, developers face many challenges [7].Compared to software engineering, it is still quite a young discipline and does not yet offer well-established strategies and techniques for development processes [8].It has been reported that approximately 40% of DW projects fail to meet their design objectives [9].Therefore, it is important to document and share information about successful DW projects and elaborate the process followed in the project, problems faced by the DW team and actions taken to overcome the problems and challenges.
The International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the Indian Council of Agricultural Research (ICAR), the International Rice Research Institute (IRRI) and other partners took up the challenge and developed a centralized agricultural DW.This required making smart choices about suitable design features and taking measures to ensure appropriate data quality following the DW protocol.The warehouse was built in its core around a multidimensional panel dataset, collected as part of the Village Level Studies (VLS) and the Village Dynamics Studies in South Asia (VDSA) projects between 1975 and 2014 in India and Bangladesh.
The specific case has relevance as there is limited experience with agricultural DW developments [10][11][12].At the same time, there is strong demand on data capturing the complexity of agricultural systems [13,14] as support for improved policymaking [15].
The overall goal of this paper is to stimulate interests among researchers and donors in building data warehouse for management and dissemination of Open Data.We have documented the development process, technical features and impacts of the VDSA Knowledge Bank.This paper provides an in-depth understanding about how to plan, partner with information technology firms, design and implement a data warehouse project.Thus, the main contribution of this paper is featuring how developments in information technology have innovatively been used to promote Open Data and enhance impacts of research funded by governments, foundations and philanthropists.
2 Literature review DW design methodologies can be classified as data driven, goal driven, or user driven [8].The basic steps to follow in warehouse development are analysis of the existing information system, requirement specification, conceptual design, workload refinement, dimensional scheme validation (or design of dimensional fact model), logical design and physical design [16].Important aspects to consider while building a DW are datastore characteristics, data modelling, and architecture [17].Data volumes and expected queries are the characteristics of workload, an important aspect in the logical and physical design phases [16].The conceptual model needs special attention in converting user requirements into error-free, understandable, easily extendable schema [18].Data cleaning and data curation handle data redundancy, integrity and inconsistency in data warehouse development.The extraction, transformation and loading (ETL) process, which extracts data from various resources, transforms data into a suitable format and loads it into DW storage [19], has to be dynamic.
Learning from previous development processes is difficult, given the complexity and diversity of the data warehousing methodologies and tools [10][11][12].In this paper, we, therefore, adopt an analytical framework structure for the development process, as presented by Sen and Sinha [20].The framework includes the following attributes: 1.The core competency of the technical team includes the experience with (1) both hardware and software DW technologies, (2) handling big data volumes, (3) understanding data structures, and (iv) knowledge of appropriate statistical methods to analyse specific data types [21].2. Modelling requirements refer to the techniques for capturing user requirements and modelling them.Procedures need to be in place to clarify the users' expectations, as any missing elements usually make later changes in the architecture costly [11].3. Data modelling refers to the methodologies of developing conceptual, logical and physical models.The two most popular data modelling techniques for data warehousing are entity-relational and dimensional modelling.4. Support for normalization/de-normalization is the process of removing data redundancy by storing all data strictly in one place and ensuring data dependency.This improves the efficiency of the DW in terms of storage space and functionality. 5. Architecture design philosophy refers to choices in architecture approaches, which depend on the type of data and the purpose for which it will be used [17].
Integration with metadata allows users to communicate more effectively with experts involved in DW architectural design.6. Implementation strategies are often iterative processes.
The classic system development life cycle and rapid application development are widespread practices.The first is the standard process for planning, creating, testing and deploying.The second system focuses more on prototyping and less on planning.7. Metadata management is an essential component of a DW.Metadata comprises all the information necessary to design, build, use and interpret the DW contents [22].Well organized and structured metadata management enhances the efficiency of DW functionality [23].8. Query design needs to support fast and efficient retrieval.9. Scalability and change management addresses the question of whether and how the warehouse can handle future additional data loads.

Case description
The DW is in its core developed around the following data: (a) The household-level database provides multi-generational perspectives of social, agricultural and economic changes, along with an amazing amount of information on activities related to agricultural and household economies [24][25][26].For more details on the data collection process, please see Binswanger and Jodha [27]; Walker and Ryan [28].A description of the survey modules can be found on the VDSA website: vdsa.icrisat.ac.in/vdsa-microdoc.aspx.
The respective datasets are available in various formats, such as MS-Excel, ASCI, CS-Pro and STATA.ICRISAT has a strong open access policy and data were accessible already before the DW development under vdsa.icrisat.ac.in.Given the complexity of the underlying relations of different tables, it has been very inconvenient for users to extract data in formats appropriate for a specific research interest.Data downloading for a specific village or household type was not possible.The historical data were in coded form, and understanding it required referring to extensive codebooks and data manuals.This created a strong barrier for researchers and policymakers to use the data.These are the core pain-points of intended userbase.To meet the information needs of policymakers and practitioners, the team envisioned a VDSA-DW with additional features of data analytics, OLAP, and basic data downloading features.

Methodology
In this section, we briefly list the data sources used to describe the VDSA-DW development process.We group them into three categories based on their source of origin: (a) VDSA team, (b) technical team and (c) documents prepared by both teams.
VDSA team A number of documents were prepared to specify the needs for the DW, data characteristics and selection of the technical team.

A conceptual diagram depicting the relationships
between concepts.It helps to demonstrate the data structure and their relationships to the technical team and provides an overall view of the dataset (http://vdsa.icrisat.ac.in/Include/Posters/ConceptMap-micro.pdf).2. A mind map is used to show the different ideas associated with a particular concept.3. The user requirement(s) document (URD) and the technical evaluation criteria gave a clear understanding for both the data and the Information Technology (IT) experts on the expectations on the DW project.4. The terms of reference for the technical team. 5. Ad hoc reports and variable lists.6. Detailed documentation of the source datasets.

Technical team
The following documents of the technical team were used: 1.The proposal on business intelligence solution containing details of the technical team's understanding of the requirements, proposed high-level solution architecture, team structure, roles and responsibilities of team, assumptions and dependencies, delivery model and governance model.2. The hardware and software requirements for the application.3. The presentation/user guidelines on the ad hoc analysis tool using Microsoft (MS) reporting services.4. The data cleaning guidelines. 5.The administrator module describes the navigation flow of the application front end, user roles, etc.
6.The website user guide refers to guidelines provided for the flow and structure of the front end of the DW management system.7. The User acceptance testing documentation.8. Metadata or data dictionaries include the list of all variables for each of the integrated VDSA modules along with the data types, such as integer, character, and decimal.It is a logical view of data tables.

Other documents
1. Terms and conditions of the contract.
2. Agreement of relationship between ICRISAT and the vendor.
The documents were used by the authors to describe and reflect on the VDSA-DW design and development process.
In addition, open interviews were conducted with members of the VDSA and the technical team asking for remarkable aspects of the specific DW process, referring to the attributes of the analytical framework.The analytical approach was thereby qualitative and descriptive.The paper's results are, therefore, subject to the authors' judgements, which should be considered when interpreting the results.

Results
The VDSA DW architecture is presented in Fig. 1.It illustrates how the different attributes of DW development created the basis for the emerging architecture.
In this section, we describe the process of VDSA-DW development using the structure of the analytical framework.

Core competency
At the beginning of the DW development process, the VDSA team was flexible in terms of design features.The call for proposals asked to respond to information on the data structure.Concept maps, mind maps and user requirements with use cases were used as tools to explain the dynamics of the VDSA source data and expected outputs from the solution.Submitted proposals were evaluated on the basis of the following criteria: architecture, hardware and software platforms, team competency, track record of the company, strengths of the proposed solution, time required for completion of the project, overall technical proposal, software development costs, costs for hardware, yearly maintenance costs, compatibility with existing ICRISAT infrastructure and degree of understanding of the VDSA data.
The latter criteria turned out to be of major importance in the final selection.The proponents who best understood the VDSA dataset were also best able to tailor the other criteria to the needs of ICRISAT.For instance, while the structure of the data tables is quite complex, the number of users at any point in time was not expected to exceed twenty-five, and the size of the data was less than one TB.The warehouse was supposed to serve mainly academics.
As a consequence, speed was not a constraint, and Microsoft technology as a comparatively cheaper package in development and maintenance served the purpose.In addition, the Microsoft SQL server methodology was chosen as the most appropriate option for VDSA application development.The Microsoft SQL Server 2008 R2 comes with the necessary tools required for warehouse development.It consists of a SQL server database engine for creating and driving relational databases; SQL Server Analysis Services (SSAS) for online data analysis, creating OLAP cubes, data mining or responding to ad hoc queries; SQL Server Reporting Services (SSRS) for reporting; SQL Server Integration Services (SSIS) for the ETL process that cleans and formats raw data from source systems for inclusion in the database as ready-to-use information.Cube offers the slice and dice options necessary for development of summary reports.Facts (measures) and dimensions (attributes) are essential components of cube.For reporting, performance point services (PPS) and Excel Services were hosted on the SharePoint 2010 site.The Microsoft Report Builder tool for ad hoc reporting was used since it allows users without any knowledge of SQL to build queries.The reports were built on the SSAS cube or the DW based on report requirements.
The technical team conducted some sessions on the different report builder options available and helped the customer to choose a suitable option based on the requirements.The main theme of this phase is the specification of requirements in more clarity which is the basis for developing the software requirements specifications (SRS) by the technical team.
Presentation of the expected reports from the application with all the details including their look and appearance was a point of discussions.Elicitation of the required reports along with detailed business rules to be followed on the source data to obtain the reports was an important task of the customer.Continuous interaction of both the teams in reviewing the progress, addressing the issues with proper technical guidance was an integral part of the project development.

Requirement modelling
The VDSA-DW was developed through intensive interactions between the selected technical development team and ICRISAT.The intended uses were intensively discussed, and a list of required reports was prepared.Continuous interactions between key stakeholder subject matter specialists, the developers and the ICRISAT team formed the basis for the requirement modelling.This further deepened the technical team's understanding of the data structure, expected outputs, detailed list of summary and ad hoc reports, and the targeted user groups.Based on this information, the technical team developed a software requirements specification (SRS).It established the basis for the agreement between ICRISAT and the technical team and provided the details of what the DW would provide.It incorporated the systematic development of the DW and detailed procedure for the development of the reports, including all the formulae.It permitted a rigorous assessment of requirements before the design began and reduced later costly adjustments.It further defined how the DW interacts with system hardware, other programs and the user community.The SRS was strictly followed throughout the development process.

Data modelling
Enterprise relational (ER) conceptual model was used for data modelling.Data mart architecture was used for VDSA-DW development.Data marts are datastores that are subordinated to a DW.They fulfil specific application requirements of a certain user group [20].Data were split into marts for subregions and times of data collection.Separate data marts were created for data sets of different geographies since this lowers the query response time.
Dimensional modelling and star schema were used because they yield simpler designs and efficient retrievals, which is a prime requirement for large DWs [3].The design is composed of a fact table and several dimension tables.

Support for normalization/denormalization
Organizing the source data in a relational database, post data cleaning and curation, was the basic requirement.Data curation required some capacity building by the technical team on the guidelines to be followed.With continuous interactions, and technical guidance of the vendor, the customer achieved this important task.This also involved addition of systematic code books for some of the variables in coded form.This includes the support in development of unique keys throughout the data sets for establishing the links among them.Table joins were essential for complex querying in the DW.Ad hoc report and summary report preparation were supported by joints.Primary keys and foreign keys were defined to support the joins for each fact table.

Architecture design philosophy
Kimball's bottom-up approach was followed in VDSA-DW development (Fig. 1).Data marts were created first, data were loaded into data marts, and then through the ETL process loaded in the warehouse.The data flow in the bottom-up approach starts by extracting data from various source systems into the staging area where it is processed and loaded into the data marts that handle specific business processes.After data marts were refreshed, the current data were once again extracted in the stage area, and transformations were applied to create data in the data mart structure.The data extracted from the data mart to the staging area were aggregated, summarized and loaded into the enterprise data warehouse (EDW) and then made available for the end user for analysis and critical business decisions.The technical team was interacting continuously with the VDSA team in designing the appropriate architecture for the application.Any abnormalities identified or modifications requested in the design at a later stage will be a huge cost to the customer in terms of finances, manpower and deviation in timelines.
Discussions related to website development were integral part of this phase.This included estimating the expected number of concurrent users at any single point of time, the time taken to display the results, different types of user groups, and their roles and privileges.The registration process, password policy, administrator roles, and ETL record-tracking were among the many points of discussions.

Implementation strategy
This was the physical development of the application.An iterative implementation strategy was followed.The developed solution was tested with test cases, and identified bugs were addressed in the revised solution.The application was deployed successfully in a production server to enable open access to the global user community.A single server jointly hosted the SQL Server Integration Server (SSIS) and the SQL Server 2008 R2 database server (including staging databases and DW).User test cases were prepared by the VDSA team for all types of planned reports.These test cases were used to critically test the developed product for producing the desired outputs, and any possible bugs were noted.Any deviations in terms of appearance, text, and results were immediately reported to the technical team for incorporation in the development process.This was repeated until the reports complied with the expectations of the VDSA team.There are three critical functions of this crucial part of DW development: (a) extract, (b) transform and (c) load (ETL).The ETL migrates data from one source to another target database.The extraction process extracts the VLS/VDSA data from source systems (Excel, text, CS-Pro), validates, cures, consolidates and transforms it into a common format that is compatible with the DW.The transformation process transforms the data format to a SQL server.It also applies business rules, combining different types of data, calculating new measures and dimensions.The loading process pulls data out of the diverse source files of different formats situated in different tables and transforms them along the lines of the business rules and loads the data into target databases that are known as data marts.This process is responsible for converting the various tables in the DW into a form that end-users can analyse, visualize, or model.Special attention was paid to implementing ETL efficiently [7].
Deliberations during this stage, focused on resolving the issues in implementing the ETL.Though a clear metadata schema was developed, some anomalies and issues emerged while actual implementation of the ETL.Some negotiations had to be made without compromising the data quality and maintaining data integrity.

Metadata management
Microsoft SQL server technology has a built-in metadata management component.The VDSA-DW uses a metadata repository to integrate all of its components.The metadata stores definitions of the source data, data models for target databases, and transformation rules that convert source data into target data.[18].It also stores list of tables, columns along with data types.Metadata development has been a major activity of VDSA-DW development, and it is developed in the logical phase in ER modelling.The development of metadata required regular meetings of the teams with obvious focus on quality metadata.The clear specification of the data structures of the harmonized data sets was a milestone achieved in this phase.Sometimes, there were abnormalities, contradictions or outliers in the source data in this regard.Bringing harmonization of these aspects was aided by the technical support of the vendor.Some of the variables were both qualitative and quantitative in different data sets, and a common structure needed to be established.For example, age was designed as an integer variable, but later on it was found to have age in months in some cases.After discussions, it was decided to convert it to decimal form (rational number).

Query design
Parallel processing can handle massive data sets efficiently for loading as well as querying [28].Microsoft technology supports parallel processing of queries and hence minimizes the query response time.Queries were designed by technical experts based on the needs of the users.MDX queries were used for summary reports.During continuous testing of both the tabular and graphical reports, many issues arose.These include graphical presentations not as per specification or tabular reports not presenting correct values.These were resolved with continuous interactions, revisions and re-checking of the reports, providing feedback.

Scalability and change management
The VDSA-DW is based on multilevel data collected over time.There is a chance that additional data will be collected in the future, and this possibility was considered in the VDSA-DW development.The used data mart architecture offers the required flexibility.In addition, it is possible to include macro (country and state level) and meso (district level) level data in the DW.The VDSA team expects the integration of additional databases related to meso level and special purpose surveys into the VDSA-DW at a later stage with technical support.

Final DW product
The VLS/VDSA dataset is an international public good that became much more accessible due to the VDSA-DW (vdsakb.icrisat.ac.in).The VDSA-DW has special features that enabled users to select and aggregate data across different dimensions and platforms for analytical reporting.It also has business intelligence tools, user feedback systems, key performance indicators, dashboard facilities, differential access facilities to different types of users, security, project management, user management, admin reports, recent reports and survey instruments.The DW facilitates trend reports based on common fields from all households, as well as region-specific households.As a result, policy researchers can quickly address issues of agricultural growth and development of rural farm households.Its effectiveness is visible from the number of downloads across the globe.Summary reports (predefined or canned reports) on key indicators of rural economy, such as productivity and profitability in farming, cropping patterns, household income by source and class, consumption and expenditure on food and non-food items, asset ownership and value, distribution of land ownership, coping mechanisms adopted by the farmers to face natural calamities, wages and prices, credit, benefits received from government welfare programmes, and rainfall, are used by researchers to analyse the performance of the rural economy and articulate their implications for agricultural research and development policies.

Impacts of the VDSA data warehouse
The VDSA-DW supported the intensified use of the specific data set.As of 15 May 2020, 3317 unique users (including 1200 Ph.D. students) from over 300 universities and research organizations from 74 countries of Asia, Africa, Australia, Europe and North America obtained access to the data.Approximately one-third of the users are females.The access statistics clearly show that the number of users substantially increased once the DW was online in July 2013 (Fig. 2).

Discussion
This paper used the conceptual framework and the attributes described in [18] to reflect on the experience in developing the VDSA-DW.ICRISAT's experience in developing an agricultural DW suggests that a hybrid and iterative approach is required that is simultaneously data, goal and user driven [8].Most critical for the success of the development process was a clear understanding by the whole team of the structure of the data and its intended use by the target groups.Clarifying this at an early development stage supported efficient project implementation.
At the same time, VDSA-DW development faced numerous challenges that were also reported earlier [4].Frequent and regular interactions of all partners involved in the VDSA-DW development process were essential to address these challenges.
1.It requires effort to explain the goals of the DW to the technical team, especially if the dataset has complex content and structure that requires domain expertise to understand.Systematic requirement modelling, as described in Sect. 5 was, therefore, critical.2. It was a difficult task to normalize data into an integrated structure given that information was collected over many years across 42 sites by different interviewers.Sometime-specific situations that occurred during data collection affected the data structure.The DW development process was, therefore, instrumental in finding appropriate integration strategies and, in this way, improved the usability of the VDSA data in general.Data modelling and suitable selection of primary and foreign keys enabled addressing many of these challenges.3.During the data harmonization process, it was revealed that there was a gap in understanding the standards and actions required for delivering compatible dataset needed by the technical team.We overcome this challenge though a 6-week long mandatory training on Microsoft SQL for all members of the data team.Similarly, the DW development process revealed a substantial need for data cleaning to enable ETL processes.The preparation of guidelines regarding cleaning of data was essential for successful ETL.This process also improved the general data quality.5.A data dictionary was developed, and data must be rearranged according to the data types prescribed in the dictionary.This work successfully supported the harmonization of data across locations and across various sources.6.Even for scientists involved in the VDSA project, the source-to-code linking of variables and tables was a challenge.Formulation of common codes and descriptions of the various components of the code enabled linking of all variables, tables of code files and tables of data files.7. Smooth running of ETL was a difficult task as it needed multiple iterations before the errors in the source data could be rectified.Operator-level training from technical experts helped to overcome this challenge.8. From the perspective of the technical team, lack of proper business understanding in the customer was the main issue.Converting the business logic into technical terminology was an important aspect that was satisfactory to them.

Conclusion
Data warehouse (DW) approaches and online analytical processing (OLAP) have the potential to facilitate accessibility and analytical explorations of agricultural big data and steer forward the Open Data Charter.It has been reported that approximately 40% of DW projects fail to meet their objectives.Therefore, it is important to document and share information about successful DW projects and elaborate the process followed in the project, problems faced by the DW team and actions taken to overcome the problems and challenges.This paper contributed to sharing lessons learnt.The VDSA-DW has greatly facilitated harmonization and user-friendly access of longitudinal data collected at different scales, geographies, and across a wide range of domains of agricultural systems.This created an opportunity for a global community of scientists to study South Asian agricultural systems across geographies over almost half a century.The DW substantially reduced the time and effort to generate reports and extract data for more in-depth analysis.
An important lesson from the VDSA-DW experience is that the DW development process provided a frame for better data management in general.One unique feature of this project was around the data format that had to be migrated to the data warehouse.The source data was in a variety of formats as it was captured from time when electronic media was not prevalent.The DW structure forced the data owners to clean and harmonize the data.Earlier, this work had to be performed individually by each data user, thereby creating strong constraints for using the data.This point at an important issue related to Open Data management.The probability of data actually being used substantially increases if potential users are provided with tools to quickly understand the data, orient themselves to its structure, intuitively merge module data into data formats fitting to their research interests and analyse the data for immediate adaptive and dynamic decision support.This is of special importance given that the DW for agricultural data targets more diverse users than e.g.most private sector DWs.One more lesson learnt was that the internal data team would be able to formulate future data collection efforts according to the better data management practices so that integrating with a DW would require less effort in the future.Structuring the DW development process according to the framework attributes helps ensure that these objectives are achieved.To this end, the VDSA-DW can be a role model for Open Data management and sharing for research and decision support.
The need for continuous interactions between the technical team and scientists is a key lesson learned from the VDSA-DW process.Only such an inclusive, teamwork approach ensures that the DW product fits the data and the needs of the users and fulfils its envisioned function.
This study has documented how developments in information technology can be innovatively used to promote Open Data and enhance impacts of research funded by the governments, foundations and philanthropists.Hopefully, this paper will stimulate interests among researchers and donors in building Data Warehouses for better management and use of data.

Fig. 2
Fig. 2 Year wise access of VDSA data by users across the world