Semantic approach for multi-objective optimisation of the ENTICE distributed Virtual Machine and container images repository

Summary New software engineering technologies facilitate development of applications from reusable software components, such as Virtual Machine and container images (VMI/CIs). Key requirements for the storage of VMI/CIs in public or private repositories are their fast delivery and cloud deployment times. ENTICE is a federated storage facility for VMI/CIs that provides optimisation mechanisms through the use of fragmentation and replication of images and a Pareto Multi-Objective Optimisation (MO) solver. The operation of the MO solver is, however, time-consumingduetothesizeandcomplexityofthemetadata,specifyingvariousnon-functional requirementsforthemanagementofVMI/CIs,suchasgeolocation,operationalcost,anddelivery time.Inthiswork,weaddressthisproblemwithanewsemanticapproach,whichusesanontology ofthefederatedENTICErepository,knowledgebase,andconstraint-basedreasoningmechanism. Open Source technologies such as Protégé, Jena Fuseki, and Pellet were used to develop a solu-tion.Twospecificusecases,(1)repositoryoptimisationwithofflineand(2)onlineredistributionof VMI/CIs,arepresentedindetail.Inbothusecases,datafromtheknowledgebaseareprovidedto the MO solver. It is shown that Pellet-based reasoning can be used to reduce the input metadata size used in the optimisation process by taking into consideration the geographic location of the VMI/CIsandtheprovenanceoftheVMIfragments.Itisshownthatthisprocessleadstoreduction of the input metadata size for the MO solver by up to 60% and reduction of the total optimisation time of the MO solver by up to 68%, while fully preserving the quality of the solution, which is significant.

enough without an optimised knowledge management at the level of VMIs. 24 The knowledge about cloud resources and virtualization environments such as similarities among VMIs, image clusters, and data centre topologies generally supports design of de-duplication and other management mechanisms for VMI files. 25 Jayaram et al. 25 concluded that empirical analysis of VMI files can be leveraged to design smart image distribution schemes, take information about operational environment into account, and help VM provisioning, cloning, and migration.
Sack et al. 26 proposed a semantic approach for addressing NP-complete decision problem over a bibliographic domain. Our work differs from the cited literature in a major way that the presented automatic distribution and decomposition of VMIs uses a knowledge intensive approach based on W3C interoperability standards, such as OWL2, 12 and aims to generally improve software engineering practices. Even though there are methodologies describing the knowledge base ontology with various graph-based models 27 and interactive semantic approaches, 28 the ENTICE system knowledge base consist of a single graph-based ontology. The use of semantics (ontology, knowledge base, and reasoning mechanisms including heuristic reasoning rules 29 ) may contribute significantly to the operation of the ENTICE distributed repository for VMI/CIs by providing important logistics, starting from the management of SLA agreements between the software engineers and the ENTICE environment, then by providing input data and information for important functionalities of the environment, such as MO, and other logistics for VMI/CIs migration between the repositories, portability, availability, reliability, costs, and other non-functional properties.
The present work focuses on the area of semantic modelling and knowledge engineering for multi cloud environments. Semantic approaches have already been used to address problems in distributed, grid, and cloud computing environments in a variety of studies. Some past projects covering various semantic aspects have been myExperiment, 30 InteliGrid, 31 OntoGrid, 32 mOSAIC, 33 and others. Currently ongoing related projects are Smart Cloud Engine 34 with which the projects SWITCH 35 and ENTICE 36 can compare. Among those, ENTICE is the only project focusing on the development of a distributed repository, the projects Smart Cloud Engine and SWITCH are concerned with the runtime of cloud applications. These projects have concentrated on the delivery of high quality, fully functional, and elastic cloud applications. An example is the design and development of a component-based cloud application for solving fluid and solid mechanics problems. It is composed of fully elastic components called Cloudlets, which can be managed across multiple clouds. 8 This application has compute, memory and communication intensive software components, which can greatly influence the user experience. It is therefore important to provide for elasticity to applications like this one, for example, an ability to move, deploy and scale, the number of running VMs or containers.
This research domain poses some very specific challenges, such as the needs to support both strategic decisions, when the software engineer negotiates a SLA for her/his VMI with the ENTICE repository, and dynamic decisions, when automated elasticity has to be obtained by the cloud application at runtime. Our goals are therefore not only to semantically model the ENTICE distributed repository environment but also to provide mechanisms in the knowledge base that will support both strategic and dynamic reasoning and information support. In a wider context, this work is posed to contribute to more optimised and modern software engineering approaches that include more machine processable knowledge and information in the software engineering life cycle.

The ENTICE federated repository
The ENTICE federated repository is developed to allow for efficient operations, including moving, replication, and delivery of VMI/CIs, as illustrated in Figure 1. The developer has the possibility to access VMI/CIs via a graphical User Interface (UI) and from there deploy them on a cloud provider.
The overall software engineering process is informed by using metadata which is systematically gathered, exchanged, and otherwise managed in a knowledge base. The key components of the ENTICE environment that encapsulate all the use cases are presented as follows: 1. Development tools (such as Fabric8 5 ). These are basic instruments used by a developer of a cloud application. Such tools do not yet provide enough information to the developer concerning the deployment stage (e.g., the best deployment location). Programs, APIs, and other services have to be deployed using a server that runs on a specific operating system. Creating VMI/CI, even for only testing purposes, that contain specific services takes time and the developer must have some knowledge about library dependent software of the service, the scalability of the software and similar. By using the ENTICE environment a developer can access the images through a RESTful service or through a specifically designed UI facilitating repositories-wide software search and discovery.
2. ENTICE knowledge base service comprises of an RDF store and mechanisms, such as validations, reasoners, rules and similar, serving information to all other components of the architecture. The service interchanges the data through the ENTICE external interfaces, the developer FIGURE 1 The ENTICE environment uses knowledge for optimal operations with Virtual Machine and container images API, and the ENTICE dedicated services. All the data are organised according to a domain ontology. Based on the ontology, sets of reasoning mechanisms are implemented.
3. The ENTICE User Interface (UI) provides to the developer several features such as optimising and searching VMIs/CIs, uploading new images to the repositories that include pre-installed user specific services and applications, or even system based settings (e.g., network configuration, SSH, and system credentials). Besides the image upload, the user has the possibility to generate images by using a script (e.g. ,Chef 37 or Puppet recipe 38 ) that includes functional requirements. In the UI search screen, the developer is able to search among all available public VMIs/CIs, which are stored in public VMI/CI repositories, such as those of Amazon S3 10 or Docker Hub. 11 4. The services of the ENTICE environment provide functionalities that are exposed through the UI. These services are elastic and geographically distributed in order to optimise the operation based on information and knowledge on network performance (e.g., latency and current bandwidth) through monitoring services. For each VMI/CI containing user-developed software components (e.g., services, databases, APIs, and other applications) a special ENTICE service can be used to reduce the image size (e.g., remove unused libraries and documentation) in order to significantly reduce the VMI/CI size while not affecting the mandatory running applications functionalities. By performing an operation like this, the user may save significantly on the storage cost as soon as the software asset is deployed in the VMI/CI repositories and experience faster auto-scaling operations of a running VM.

Using the ENTICE environment
A VMI/CI distributed upload is more specific use case which the best describes the intent of this paper. Figure 2 shows the steps involved from a viewpoint of the ENTICE user (e.g. VMI/CI developer).
First, the image upload request is made to the ENTICE environment. Then the user is able to specify QoS requirements about the image such as geographic location of the service, storage and/or network latency, image availability, image delivery time to the cloud, and cost of storing and downloading image. Some of these requirements might be specified as constraints to the system (e.g., due to legislations upon countries for storing data, or due to storage quota/capacity limitations) while others as optimisation objectives (e.g., find the storage with the best price). A multi-objective optimisation algorithm then considers the constraints and optimisation goals and tries to find the best storage for the request. An outcome of the algorithm is a set of feasible storages (with respect to the constraints) that are at the same time the best fit (with respect to the optimisation objectives); thus, the outcome facilitates the user in decision making. Because the user is able to specify several optimisation objectives at the same time, the best solution does not result in a single storage, but instead in a set of best solutions, known as Pareto front. The final deci-FIGURE 2 VMI/CI distributed upload by specification of non-functional properties sion upon the image placement is then left to the user to allow for fine-tuning her optimisation objectives before the image is stored into the best storage.

Semantic approach
The role of the knowledge base in the above use case ranges from collecting and storing the QoS for every image to providing information about the storages and supporting the decision making process. The relation between the knowledge base and the multi-objective optimisation service is of particular interest in this study. We experimented with two different procedures, which mainly differ in the information exchange between the knowledge base and the multi-objective optimisation service. They are described in the following.
In the first procedure, the knowledge base supplies the multi-objective optimisation service with the list of storages and their properties, as well as the list of constraints for the image upload request. The optimisation algorithm then performs the Pareto front computation by considering the whole set of storages and takes care of the constraints after the Pareto front is computed. This serves as a reference model that we want to improve with the second procedure.
The second procedure is therefore designed to efficiently reduce the initial set of storages provided to the Pareto front computation. This way, less data have to be exchanged between the knowledge base and the multi-objective optimisation service, and the optimisation is performed on a feasible set of storages only, with respect to the user's constraints. Therefore, the multi-objective optimisation algorithm in this case does not need the list of constraints.
The second procedure has performance and cost advantages over the first one. Applying constraints to the set of storages directly in the knowledge base consumes less network bandwidth which can in some cloud settings result in lower cost. It can also improve transfer times, if application of rules is not too computationally intensive for the knowledge base. Next, if the whole set of storages is very large and the applied constraints eliminate most of the storages, the computation of the Pareto front might become difficult, because the multi-objective optimisation algorithm works on random subsets of its input set, as is described in Section 6. Even if that is unlikely to occur with carefully designed genetic algorithm, it is more likely to fit a small subset of storages into a memory than the larger set, which should also improve the time required for the Pareto computation.
Nevertheless, if the multi-objective optimisation procedure needs to combine storages in order to meet the constraints (e.g., replicate image over a subset of storages to meet high request for image availability), the number of combinations grows exponentially with the number of storages. Some of these issues are summarised in Figure 3.
The detailed analysis of the various use cases leads to a collection of functional and non-functional requirements for the development of the ENTICE ontology and the associated knowledge base. FIGURE 3 Comparison of two different procedures (A and B) of data preparation for the Pareto front computation. Black dots denote storages subject to the user's constraint, e.g., particular VMI has to be stored in Europe; white dots denote storages that do not meet the constraints. Plots below the map signify the difference between the two procedures. The first (A) procedure starts with all storages and computes the Pareto front on the samples taken from the whole set of storages. In case when the white dots outnumber the black dots by a large margin, the method may require significant computations to find the Pareto front satisfying the constraints. This is not the case for the second (B) procedure

REQUIREMENTS ANALYSIS AND DEVELOPMENT OF THE ENTICE ONTOLOGY
In order to develop an ENTICE ontology and associated knowledge base, it was first necessary to collect and analyse key requirements. Here, we first identify the functional requirements, which are needed to integrate the semantic technology with other services of the ENTICE environment.
Following this, non-functional requirements are elaborated. These are necessary to preserve elevated overall system Quality of Service (QoS) and Quality of Experience (QoE). The QoE measures the actual user's experience with the ENTICE environment. The second part of the section explains the development of the ENTICE ontology.

Functional requirements
The new system can be accessed only by registered users. Therefore, the knowledge base must support different authorisation and authentication levels (e.g., only administrators can access to special panels and further interact with operations in progress, check current status of services, and compare them to the knowledge base content). Various search mechanisms should be provided, from the simple basic search queries to obtain data stored in a single entity (e.g., search VMI/CI by different criteria and check a repository resource status) to more complex search mechanisms in order to satisfy the capabilities for: • search between individuals of the same type and their chronological differences, usually derived from the same ancestor, due to updates (e.g., updating VMI operating system and adding new applications or functionalities). In those cases, the mechanism should be able to find redundant and outdated individuals in order to remove them; • rule-based constraint verification (e.g., constraint check of compatibility for the new added software in CI/VMI). From the ENTICE system heterogeneity, the knowledge base service must also support connectivity through external interfaces. Thus, the knowledge base has to support common used connectivity protocols like REST. In some cases, the knowledge base service has to access sensitive data and certification mechanism must be supported (e.g., Secure Shell (SSH)) to fulfil the security requirement aspect.

Non-functional requirements
The knowledge base needs to address a variety of non-functional requirements that can be identified in the majority of ENTICE environment components. The most important of them are the following: • achieving high performance of the services inter-connected via the knowledge base (e.g., fast query responses); • ability to distribute the knowledge base service or even the knowledge base environment itself to avoid single point of failure; • scalable knowledge base architecture, its ontology (e.g., by adding new cloud providers the system must be able to store new pricing metrics) and possibility to increase resources for the growing knowledge base storage; • availability where monitoring of RDF store service has to be running and minimize the down time of the knowledge base service or notify the administrators for major system faults; • security that overlaps with functional requirements; and • data integrity by using different validation supporting reasoners (e.g., HermiT and Pellet) and adequate action in case of data integrity violation (e.g., data types or relationships do not match the ontology scheme definition).
In the following, we elaborate the design and development of the ENTICE ontology.

Ontology development
The development of the ENTICE ontology was done iteratively, by identifying the functionalities and their matching entity classes among all subsystems, their dependencies in a form of relationships, and the constraints of entity attributes to satisfy the data exchange between ENTICE services.
The aim of the ontology design process is to create a robust ontology schema with possibility to be scaled in order to support new functionalities by minimally affecting the existing ones (e.g., by only adding new relationships and entity attributes). Particular care was put on developing reasoning capabilities with Pellet and possibilities to use RDFS 40 and SWRL 41 rule engines. By using reasoners, it is possible to facilitate environment information flow, minimize human errors while inserting new RDF data, check existing ontology data inconsistency, and even simplify data transfer between the ENTICE services.
The entire ENTICE ontology with interconnected entities is presented in Figure 4. Since the knowledge base is the main database of the ENTICE environment, all essential use case data must be covered. The most important ontology entities are described in Table 1 and are used as main knowledge assets in the experiments. Thus, besides the ontology scalability aspect other important criteria are also satisfied, including  • data must be efficiently accessed through queries and reasoning mechanisms by following Ontology Design Patterns; 42 • new expressivenesses must be achieved by inferencing the ontology content. Therefore, new relationships should be generated to facilitate the querying through matching the ontology; • minimum redundancy level should be reached during the ontology development; and • classes must be designed in a way to support the future scalability in a way of generalize concepts (e.g., ProvenanceData class should not be used only for Fragments).
The developed ontology stores important concepts about the entire ENTICE environment, such as: • concepts of software resources (e.g., VMI/CIs and cloud-based environmental settings); • programming concepts (e.g., storage complexity and taxonomy of functional properties); • virtual organisation concepts (e.g., privileges, credentials, and ownership); • resource negotiation-related concepts (Pareto SLAs); • QoS concepts; and • runtime environment concepts (e.g., monitoring).
Based on the ontology design, it may be understood that the ENTICE knowledge base focuses on the cloud domain with specific use cases covering broad aspects, such as VMI/CI distribution, fragmentation, and SLAs management.

Role of the ENTICE knowledge base
General complexity of the nowadays cloud-based systems is increasing due to integration of new functionalities, arising from new users which leads to the needs for continous maintenance of datasets. There are common approaches to improve or at least maintain reasonable data flow performances such as migration to new technologies (e.g., from MySQL to NoSQL storage systems), hardware upgrades, and software optimisation approaches. The later in terms of semantic design and optimized access of ENTICE metadata is the main role of the ENTICE knowledge base. The semantic approach is used not only to describe the data in a interoperable graph-based representative ontology, but also to enrich the relationships among the entities by using reasoning mechanisms.
Through the process of system implementation and integration, new entity data can be seamlessly created including attributes with constraints, relationships, and rules which have to be fulfilled in order to inference new knowledge. Moreover, the complexity of specific SPARQL queries can be significantly reduced that leads to reduced query results and faster metadata management for the functioning of the overall ENTICE environment.
The concrete result, which is evaluated further in this study, is the simplification of SPARQL queries by using pre-inference knowledge created with rule-based reasoning mechanisms.

DESIGNING AND IMPLEMENTING THE KNOWLEDGE BASE
In order to be able to use the ENTICE ontology, it is necessary to develop specific mechanisms for managing complex-structured data. In the course of this study, experiments were performed with various degrees of expressiveness to model the relationships among the entities, the use of advanced constraint mechanisms, RDF validation, 43 complex data querying, inferencing new knowledge using reasoning mechanisms, and other approaches.
The primary use of the knowledge base is to provide RDF metadata to all subsystems and services of the ENTICE environment through a main API.
The API is developed in a way that supports various queries and reasoning mechanisms and other aspects, such as security. The knowledge base is also designed in a way that its software components can further scale, e.g., via the use of new container instances.
In order to fulfil some basic interoperability requirements, the knowledge base service must support the exchange of RDF data using simple queries, different reasoners, and rule-based constraint verification to minimize human errors that may occur through the graphical UI when executing write-based requests into the knowledge base or scripts (e.g., functional descriptions of VMIs/CIs). Those mechanisms are indirectly described in the following section on how the knowledge base provides metadata to other ENTICE services, particularly, the Pareto SLA and Multi-objective optimisation part with substantiate calculation time improvements through experimental results.

Design
The ENTICE environment can be seen as repository-based system that encapsulates a variety of subsystems. Basically, it provides a universal backbone for Infrastructure as a service (IaaS) VMI/CI, which supports different use cases with dynamic resource (e.g., running resources for few seconds or continuously for years) and other QoS requirements. The ENTICE technology is strongly decoupled from the application and their specifics such as runtime environments, but continuously supports them through optimised VMI/CI image creation, assembly, migration, and storage. The ENTICE environment inputs are unmodified and functionally complete VMIs or CIs from users. Unlike the other cloud provider environments in the market, our environment transparently tailor and optimise them for specific Cloud infrastructures with respect to their size, configuration, and geographical distribution, such that they are loaded, delivered (across Cloud boundaries), and executed faster, with improved QoS and decreased final cost for the end users. The proposed high-level architecture is shown in Figure 5 and comprised of the following key subsystems: • knowledge base service, which is a central part of our work, presented in our study and supports all the other services of the ENTICE environment with information needed for strategic and dynamic decisions making; • VMI/CI portal, which is the ultimate graphical UI used by the developer to search for software artefacts across the distributed repositories; • VMI/CI Synthesis, which facilitates the synthesis of VMI/CIs based on recipes; • VMI/CI Analysis, which facilitates the optimisation of the size of VMI/CIs, • VMI/CI Distribution, which facilitates VMI/CI movements and other operations among the potentially unlimited set of geographically distributed repositories, thus, optimising VMI/CI delivery time at particular geographic locations and storage costs; • Multi-objective Optimisation (MO) Framework, which addresses the needs for optimised operation of the overall environment through its MO methods implemented via JMetal and can also support VMI/CI Distribution and Online VMI/CI Assembly; • Pareto Service Level Agreements (Pareto SLA), which is a method used by the users to negotiate terms of contract with the distributed repositories of VMI/CI; • Online VMI/CI Assembly, which is in charge of assembling the VMIs/CIs from fragments; and • VMI/CI Management Template that represents available Cloud management systems (e.g., OpenNebula and OpenStack) which are deployed on different geographical locations (Slovenia, Hungary, Austria, and UK) and the ENTICE environment can access them through their APIs. In the future, new Cloud management systems can be added. As it can be seen in Figure 5, knowledge management and information supply play a crucial role in the operation of the overall ENTICE environment. Due to space limitations, in the following, we focus on the use of the newly developed knowledge base and its reasoning methods in relation to some NP-hard optimisation problems, which are addressed by a combination of MO and the Pareto SLA technique. NP-hard optimisation problem, for example, is the optimal distribution of VMIs across the distributed repositories which allows to agree (e.g., via SLA) specific maximum allowed delivery time at a specific maximum allowed cost.

Implementation
The ontology of ENTICE project was developed by using the ontology editor Protégé. The main language for the ENTICE ontology is implemented in the OWL2 44 using Turtle format due its human readability. The knowledge base service was developed using Java-based technologies and frameworks such as Java Jersey for RESTful web service, Apache Maven for software management, and Apache Jena Fuseki for serving RDFs. The last was chosen because it supports various powerful reasoners (e.g., Pellet, TrOWL and ELK), 45 its default integrated reasoner and have satisfactory performance. 46 During the implementation phase of the project, new mechanisms will be integrated to facilitate the knowledge base service deployment, testing, and security (e.g., Apache Shiro).
The integration process of the ENTICE services followed a well-defined path: (i) identification of services/APIs as presented in the previous chap- Detailed workflow for the supported use cases was identified through a detailed UML diagram definition which followed the requests definition supporting the data scheme of the ENTICE ontology. To reduce the risks concerning errors during the development, a testing system based on Unit tests was introduced in a continuous build integration using Jenkins, 47 for example, at each code update (e.g., git push), updated ENTICE services were build and Unit tests were executed.
The final stage of the implementation and integration mainly involves implementation of the graphical UI and identification of possible boundary conditions that can affect the system.

MULTI-OBJECTIVE OPTIMISATION FRAMEWORK
This section elaborates the development and integration of the ENTICE multi-objective optimisation framework for optimized VMI distribution. 48 The optimisation framework can be applied on multiple distinctive application levels within the ENTICE environment. For the implementation of each application level, within the optimisation framework, diverse heuristic tracks have been pursued. Above all, a consolidated service-based application program interface has been provided for easy integration of the framework within heterogeneous cloud environments.

Background
Optimisation is a process of denoting one or multiple solutions that relate to the extreme values of multiple specific objective functions within given constraints. When the optimisation task encompasses a single objective function, it typically results in a single solution, called an optimal solution. Furthermore, the optimisation could also consider several conflicting objectives simultaneously. In such circumstances, the process will result in a set of alternative trade-off solutions, so-called Pareto solutions, or simply non-dominated solutions. The task of finding the optimal set of non-dominated solutions is known as multi-objective optimisation. In what follows, an outline of the basic concepts considering multi-objective optimisation theory is provided. The set of all optimal solutions in the objective space is called Pareto frontier. The Pareto front can be considered as a efficient tool for aiding the decision-making process. The shape of the front can provide insights, which allow to efficiently explore the space of non-dominated solutions with certain properties, and reveal regions of particular interest which cannot be seen in advance, before the optimisations process has started. Therefore, the users do not need to set their preferences before finding a set of optimal solution. Furthermore, when considering multiple different Pareto solutions, a specific method is required to compare the quality of the solutions set. A Pareto set is considered to be of good quality if it provides accuracy and diversity. One way of determining the quality of a set of Pareto solutions is the hypervolume. Given a set of trade-off solutions X, the hypervolume HV(X) calculates the area encircled between the points in X and a given reference point W. This way, the better the points contained in X and the most diverse they are, the HV(X) will be higher.

Designing Multi-Objective Optimisation (MO) framework for ENTICE
VM images are currently stored by cloud providers in proprietary centralised repositories without considering application characteristics and their runtime requirements, causing high deployment and instantiation overheads. Moreover, users are expected to manually manage the VM image storage which is tedious, error-prone, and time-consuming process especially if working with multiple cloud providers. Current state-of-the-art does not provide any substantial means for streamlined adaptation of distributed repositories and efficient utilisation of the storage resources. The vast majority of existing work in this field has been focused towards optimisation of the utilisation of the computational resources. Regrettably, limited research has been conducted on the management of the VM images, as essential storage resources in federated environments. Inadequate management of those crucial resources can easily lead to inefficient utilisation and overall degradation of the computational performance of the whole system. In this context, the optimisation of the VMI's distribution across federated repositories is required both by the applications and by the underlying cloud providers for improved resource usage, operational costs, elasticity, storage use, and other desired QoS-related features. Multiple optimisation requirements have been identified, and suitable strategies have been proposed for overcoming of those barriers.
Based on these considerations, the following optimisation modules have been developed: (i) initial VMI distribution, (ii) Offline VMI redistribution, and (iii) online VMI redistribution. In this paper, the focus will be on (ii) and (iii) due to the possibility of knowledge base collaboration in the problem solving.
ENTICE is developed to support VM image redistribution both offline, as well as online during application execution by pro-actively moving the most demanded VM image fragments close to the resources the application is currently running on. 49 In case of online image delivery, ENTICE will automatically discover user demand patterns by analysing the meta-data (e.g., sequence and number of downloads of particular images or fragments) published by the provider-operated repositories (e.g., similar to Glance from OpenStack) and replicate the highly demanded images or fragments according to user demands. For example, if some VM images are always instantiated at a high frequency, they will be placed at other providers where the users might need them. Furthermore, based on the performance requirements, use patterns, and structure of images or location of input data, ENTICE (assisted by its knowledge base) will automatically optimise in the background the distribution and placement of VM images to significantly lower their provisioning time for complex resource requests (which can be in the orders of hours using today's provider lock-in technologies) and for executing the user applications (thus focusing on their functional use scenarios). The optimisation will consider the requirements of applications built as a composition of VMs and arrange for simultaneous delivery of multiple VM images to selected clouds, optionally enhanced with application input data. Moreover, the online discovery and assembly of VM image fragments employs the same multi-objective optimisation framework by assembling a VM image in a running VM at the provider with the best performance, lowest instantiation overheads (fragment transfers and VM deployment), and execution cost trade-off. Additionally, the ENTICE multi-objective optimisation framework has been envisioned to provide a specific module for efficient initial distribution of the VM images across the distributed storage repositories. The advantage of implementing such a module is twofold, as it can provide means for balanced distribution of the VMIs and it can reduce the complexity for selecting an initial storage repository by the VMI's owner.
Afterwards, multiple solutions vectors are created and then randomly populated with values in the range from one to the number of available storage sites, thus creating the initial population. Every single individual represents one possible distribution solution that has to be evaluated. Then, the evaluation of each individual is performed by reading the values stored in the vector fields. Based on those values, starting from every element in the vector, a neighbouring sub-graph is constructed and the appropriate objective functions are applied. Those values are then grouped together, and the median value is selected as the overall fitness of the given individual.

Online VMI redistribution
One very important aspect that should be considered in federated cloud environment and repositories is the optimisation of specific user's, VM images and corresponding data sets while correlated applications are being executed. Even though the offline VM image redistribution should place the VM images in the optimal storage site, there might be cases where the optimisation is required only "locally" for some particular images or data sets. For example, if a user continuously deploys particular VM image within a short period of time, the position where that image is stored can be additionally optimized based on the newly available data. Consequently, the image can "temporarily" be transferred to the more optimal solution for the given scenario. The same principle can be applied to the associated datasets, which can be redistributed "closer" to the physical machines where the VM images are deployed. By using the same methods implemented in the offline VM image redistribution, the online VM image provisioning can be managed. As both processes are analogous, the only difference comes from the scope and the time interval in which the optimisation is performed.
With the online VM image redistribution, the optimisation is only executed by user's request, and only on its own images. When the user asks for optimisation of the VM images storage position while deployment, the algorithm is initiated with a limited scope. The input data of the optimisation module is only narrowed to the user images and the optimisation only takes into account the user's usage patterns in a previously set time interval.
In this way, it becomes possible to further optimize the position of VM images in the cases when they are frequently deployed in a short interval of time.

REASONING FOR THE MULTI-OBJECTIVE OPTIMISATION
Semantic representation of the data by using ontologies and associated knowledge bases can be exploited to identify new knowledge by using Constraint-based reasoning, as a concept, has connections to a wide variety of fields, including formal logic, graph theory, relational databases, combinatorial algorithms, operations research, neural networks, truth maintenance, and logic programming. RDF-based stores can be deducted as a combination between relational databases that includes formal and graph theory logic. For ontologies such as ENTICE, satisfying the syntactic constraints, the most suitable candidates are rule-based reasoner, tableaux reasoner, and other query engines such as ABox. One of the most used for Web Ontology Language OWL1 is the Jena's default reasoner; however, for processing OWL2, the Pellet reasoner can offer more powerful reasoning capabilities.
The ENTICE knowledge base and its reasoning mechanisms are used in two steps of the MO process: (i) for the online repository redistribution and (ii) for the offline redistribution.

Using the ENTICE knowledge base
In the first step of online distribution, the knowledge base is used as an asset to speed up the execution time of MO algorithm. Due to the need of fast execution time to assure reasonable QoE, the ENTICE environment tends to use different subsystems and their approaches, in this case knowledge base RDF-based data retrieval. To be more concrete, the knowledge base provides only those data that are reasonable to be used as the input for the MO algorithm and it constructs this data by taking into account knowledge base defined constraints that are sorted by their relevance as shown in the SPARQL query 1. There are also other constraints considered in this stage that are presented in the following section. The online redistribution flow is depicted in Figure 6.
For the offline redistribution, the data needed for MO is more because derives from different entities that stores information about VMI or CI and the entire life cycle of redistributions containing delivery and deployment time. These two times are used to represent practical SLA that can differ from a theoretical one. The approach that provides relevant input data for MO is the following: 1. With a SELECT query, 20% of the cheapest repositories with a minor delivery and deployment times are removed in the further steps.
2. With a SWRL-based rule by using Pellet reasoner, a property inference is applied on the same provenance data on the same repository to be stored on each of those individuals. The rules are mainly applied on entity property level by using Assertional Box (Abox) to represent OWL facts.
For example, by using owl:propertyDisjointWith, data are additionally marked as not relevant for additional querying.
3. The relevance of the VMIs/CIs that should be redistributed is sorted through a SELECT query where the selection criteria is based on higher deployment time, delivery time, and SLA of the cloud. For these experiments, additional QoS metrics (e.g., services can be deployed only in one continent) were not applied.
The use of the knowledge base in connection to the MO algorithm is analysed in the following section.

Integration with the MO framework
The multi-objective optimisation framework is reliant upon the user's usage patterns to properly optimise the distribution of the VMIs and associated dataset across the federation. To this aim, the knowledge base is an essential tool that provides crucial information for proper modelling of the usage patterns, thus prompting efficient operation of the optimisation framework. Furthermore, to be able to accurately evaluate the objective functions, the framework requires information on the previous data transfers within the distributed repository. In addition, various other parameters, such as cost for storing, interconnections bandwidth, and latency, are necessary.
There are two pivotal points of interest in the integration of the knowledge base within the domain of the optimisation framework: (i) provisioning of decision variables and (ii) performing the decision making policy on the obtained Pareto trade-off solutions.
For the purpose of the presented research work, the Multi-objective optimisation framework is viewed in the terms of its inputs and outputs, without providing any deeper knowledge of its internal workings. The interaction between the knowledge base and the MO framework is performed through RESTful service-based API. As previously described, the Multi-objective optimisation framework can be applied on multiple distinctive levels. Nevertheless, on each level, the provisioning of the input variables is performed in a similar manner. When the optimisation process is initiated, the framework sends a query for the required data to the knowledge base. The knowledge base has been constructed in such a way, to provide only the relevant input data that will be described in the following subsection and substantiated through the presented evaluation.
The flow of interactions between the knowledge base and the optimisation algorithm, in the cases of online redistribution, is depicted in Figure 7.
The entities involved in the process are (i) the client who initiates the upload process, (ii) the knowledge base service which manages the data, (iii) the MO service which computes the Pareto front, and (iv) Image redistribution system that actually performs the requested action. The process starts when the client requests an upload of the VMI/CI to the ENTICE federated repository, by specifying the non-functional properties of the image, such as geographic location (e.g., preferred location and political legislations). Then the knowledge base serves to the MO service a subset of repository metadata information, depending upon the constraints selected by the client in the request. The MO then computes the Pareto front and returns the result to the client through the knowledge base service. The client selects a desired cost/performance trade-off from the Pareto front of optimal solutions (i.e., selects a single point from the set that best matches client requirements) and passes the information to the Image redistribution service through the knowledge base service which then uploads the VMI/CI. Finally, the Image redistribution service notifies the client upon the success of the upload action. Interactions between the knowledge base and MO framework for online redistribution FIGURE 8 Interactions between the knowledge base and MO framework for offline redistribution The offline redistribution flow, as shown in Figure 8, involves the same entities as presented in the online redistribution, albeit with a different aim-to redistribute VMIs/CIs residing in the ENTICE federated repositories in order to minimize the overall cost while preserving the performance (e.g., images deployment times). The major difference compared to the online redistribution entails in the steps from 2 to 5 that includes several metadata exchange between the knowledge base service and the MO service. Basically, the knowledge base service triggers the MO service with a VMI/CI redistribution request that returns the list of provenance metadata. The provenance metadata contains the tracks of previous redistribution executions of the fragments comprised of deployment times of images to the clouds, delivery times of images from one repository to another, and timestamps information. The MO service uses this information to calculate the new candidate redistribution placement and passes them to the client through the knowledge base service. After the client confirms the new redistribution placement, the Image redistribution service is notified through the knowledge base service and starts the redistribution of images. Finally, the Image redistribution service notifies the client upon the success of the redistribution action.

EXPERIMENTAL EVALUATION
The performance and behaviour of the knowledge base, in case of providing the input data for the MO framework, have been evaluated implicitly by assessing and analysing the outcomes of the multi-objective optimisation framework in different simulation scenarios. Moreover, a comprehensive examination was conducted to determine the dependencies between the aforementioned modules. Essentially, the experimental evaluation provided crucial insides on the influence of the knowledge base reasoner on the efficiency of the optimisation process. Lastly, broad analysis was performed in isolation, both on the knowledge base and the optimisation framework, to determine the most suitable execution parameters for both modules.
The evaluation activities were conducted by utilizing the ENTICE test bed environment, which has been distributed across multiple locations in Europe. To be more concrete, for the purposes of this research work, the multi-objective framework was deployed at the University of Innsbruck  Xeon processor with 8 logical cores per socket, 8 GB of RAM, and 2 x 2 TB of storage. The physical interconnection between the two sites has been established over the Internet network, while the logical communication between the processes was based upon RESTful and SOAP services.
To begin with, both online and offline redistribution modules share the same heuristics, thus entailing unified assessment of the most suitable execution parameters. Therefore, it is essential to properly evaluate the optimisation framework, and thus enable the specification of the proper inputs for the reasoning mechanism behind the knowledge base. Table 2 provides a comprehensive examination of the quality values for the Pareto optimal set of solutions and the required execution time by the optimisation framework in contrast to the number of evaluations within the genetic algorithm. To properly asses the quality of the Pareto solutions, a comparison has been presented with a set of mapping solutions determined by using "round robin" mapping model for storing VMIs in the ENTICE federation. The statistical significance of the results has been analysed by applying ANOVA test, which has shown significant difference between the proposed algorithm and the "round robin" mapping strategy, both in respect with the cost and performance objective. The cost objective has been calculated based on the publicly provided price list for storing data in the Cloud by Amazon. The performance objective has been modelled based on the reported communication performance measures for 10 and 1 Gbit Ethernet. 50 For readability reasons, the bandwidth values were converted to delivery time needed for 1 Mbit of data to be transferred from the source to the destination. The optimisation framework was specified to search for a set of optimal trade-off distribution solutions for a problem size of 1000 VM images.
The experimental results clearly show that the number of evaluations within the genetic algorithm has substantial impact on the execution time and the quality of the solutions. For example, increasing the number of evaluation from 10 000 to 100 000 can lead to 140% better quality and 700% higher execution time. Therefore, it can be deduced that for online VMI redistribution, which requires real time optimisation, it is essential to select the minimal number of evaluations which guarantees satisfactory quality of the solutions. On the contrary, the offline VMI redistribution is not time depended, which implies that higher limit on the number of evaluations can be specified.
Once proper execution parameters for the multi-objective optimisation framework have been determined, it is possible to proceed with the evaluation of the knowledge base and its role in the reduction of the optimisation search space. Optimisation problems are typically constrained by some bounds. Constraints divide the search space into two distinctive regions: feasible and infeasible. The stage in which the constraints are applied can have a great effect on the computational performance of the algorithm. If the constraints are applied after the evaluation of the solutions, it would induce unnecessary computational overhead. The knowledge base provides means for setting the constraints in advance and reducing the input data set, before the process of evaluation of the solutions, thus allowing higher computational efficiency. Some of the constraints that are currently considered are: • actual free space of the (private) repository, • geographical distance which is determined by user IP or even manually by user from selecting the preferred continent (e.g., targeted audience), and • SLA which is taken into account by users requirements. Table 3 presents the correlation between the execution time, the quality of online redistribution, and the input data set provided by the ENTICE knowledge base reasoner. The experiments have been conducted on a set of 500 VMIs, with 1000 repetitions of the optimisation algorithm and the population size of 50 individuals. For each scenario, the experiment was repeated 10 times.
In the case of online VM image redistribution, the number of redistributed VM images is usually fixed, thus limiting the opportunities for reducing the search space. This implies that the knowledge base can only constrain the optimisation by reducing the number of possible repository sites.  Note: The execution times and spread are given as a median value from fifteen distinctive execution per experiment.

FIGURE 9
Online and offline redistribution-execution times and quality of the solution on different data set subsets This process results in lower execution time of the optimisation process by up to 8%, which could be essential for real-time applications. Additionally, reducing the search space induces lower query times for the data access and reduced network bandwidth. Furthermore, to guarantee the quality of the solutions, the hypervolume of the optimal solution has been measured and compared in relation with different input data sets. From the analysis of the hypervolume values, it can be concluded that there is no statistically significant difference between the distributions, thus implying that the quality of the solutions is not affected by reducing the input data set, except for the case in which the dataset was reduced to 20% of the original size.
Lastly, Table 4 shows the correlation between the execution time, the quality of the offline redistribution and the input dataset provided by the knowledge base reasoner. The experiments have been conduced on a varying set of VM images determined by the reasoner. The genetic search algorithm was executed with a population size of 100 individuals until it reached 10 000 distinctive evaluations. For each scenario, the experiment was repeated 10 times.
The offline VM image redistribution is usually conduced across all repositories that fulfil the relevant SLA criteria. This implies that the knowledge base can apply reasoning in advance to constrain the number of VM images that are required to be redistributed, thus leading to significant reduction of the search space. This process results in lower execution time of the optimisation framework by up to 68%. It is essential to be noted that in the case of offline redistribution, it is difficult to measure the quality of the solutions directly. The main reason behind this limitation is the fact that every reduction of the number of redistributed images results in smaller size of each individual in the population, thus making the hyper-volume unsuitable for comparison. To overcome this limitation, the quality of the Pareto solutions has been compared based on the spread of each individual solutions compared to a given centroid. From the analysis of the spread values, it can be concluded that there is no statistically significant difference between the distributions, thus implying that the quality of the solutions is not affected by the reducing the input data set as is clearly shown in Figure 9.

CONCLUSIONS
This study represents viable usage scenarios for knowledge management in the cloud computing domain in general, and an application to the area of distributed VMI/CI storage repositories. It is shown that semantics can be used to facilitate faster optimisation process and management of complex non-functional requirements, including QoS and QoE requirements of the software engineers and applications' end users. This work complements recent research and innovation projects that use semantic technologies in the cloud computing domain including the Smart Cloud Engine, SWITCH, and mOSAIC projects. However, all present developments concentrate on using semantics for the runtime of cloud applications and services and not on the storage of software components.
The multi-objective optimisation problem, which is addressed by this study is well-known to be NP-hard. 51 This poses significant computational complexity in case of increasing input data size to the MO solver. In such circumstances, it is shown that the knowledge management and reasoning approach developed in this study can be effectively used to reduce the input for the optimisation algorithm, which in turn reduces the total computational time. The use of the approach leads to more efficient operation of the ENTICE environment and better resulting performance.
Other areas where the ENTICE knowledge and information management approaches is useful are in the actual cloud application design stage. The software engineer may directly pose queries in the ENTICE knowledge base and receive guidance leading to more informed selection of software components (VMI/CIs) and better overall quality and self-adaptive properties of the resulting cloud application.
The lesson learned from this work is that a knowledge management approach can be very instrumental when dealing with heterogeneous federated cloud environments. This area poses some new challenges for the use of semantics, such as the needs for greater expressiveness and for using more complex reasoning mechanisms (other than constraints-based reasoning).
An important area which was not been addressed in the present study is the possibility to geographically distribute the ENTICE knowledge base similarly to the federated storage. The overall ENTICE environment is designed to be distributed and non-centrally managed. Hence, our future goal is to improve the design of the knowledge base so that centralized RDF-storage and management will not be needed.