Category Theory Framework for Variability Models with Non-Functional Requirements

. Software Product Lines (SPLs) make use of Variability Models (VMs) as an input to automated reasoners, which are mainly used to generate optimal product conﬁgurations according to certain Quality Attributes (QAs). However, VMs and more speciﬁcally those including numerical features (i.e., NVMs), do not natively support QAs, and con-sequently, neither do automated reasoners commonly used in variability resolution. However, those satisﬁability and optimisation problems have been covered and reﬁned in other relational models such as databases. Category Theory (CT) is an abstract mathematical theory typically used to capture the common aspects of seemingly dissimilar algebraic structures. We propose a uniﬁed relational modelling framework subsuming the structured objects of VMs and QAs and their relationships into algebraic categories. This abstraction allows a combination of automated reasoners over diﬀerent domains to analyse SPLs. The solutions optimisation can now be natively performed by a combination of automated theorem proving, hashing, balanced-trees and chasing algorithms. We validate this approach by means of the edge computing SPL tool HADAS.


Introduction
Variability Models [22] (VMs) are used in highly configurable systems to represent their common and variable features, usually represented as a rooted tree graph, and some constraints. These models define two types of constraints: the hierarchical ones or tree constraints, and the cross-tree constraints, where the absence or value of some features instantiates or precludes other features (e.g., f eature A implies/excludes f eature B ). Variability models are the key asset in Software Product Lines (SPLs) [31], where valid configurations are generated by reasoners called solvers, such as Choco [21] and Z3 [11], that take into account some external requirements. The most used VMs are the Feature Models (FMs), but our problem formulation is agnostic of the VM type, so we will always refer to VMs throughout the rest of the paper.
One of the most valuable uses of VMs is the generation of optimal solutions [8] based on Quality Attributes (QAs) or Non-Functional Requirements (NFRs), e.g., maximise performance or minimise energy consumption [27]. This becomes a tough issue when tackling some emergent domains characterised by intensive variability such as Internet of Things (IoT) and/or Edge Computing (EC) systems [32] that present variations at the hardware (e.g. sensors and edge devices), communication network (e.g. WiFi, BLE), application (e.g. filtering, mixing, collecting tasks) and infrastructure (e.g. virtualization) dimensions. Regarding to these application domains, one possible scenario is to use VMs to specify the variability dimensions and use a solver to generate optimal application deployments in certain IoT/EC environments considering certain NFRs, such as latency or energy consumption. However, VMs cannot natively represent non-functional properties, especially if we need to express a relationship between one product and a NFR measured with a quality metric represented as a measurement function [16]. For example, the feature 'WiFi' of an IoT device consumes more or less energy, depending on other feature 'distance' to the Edge or Cloud device. The same happens with VMs automated reasoners that neither consider NFRs, nor quality metrics as a built-in characteristic.
This problem has been tackled in different ways in recent years. For instance, Extended VMs [5] proposed to extend features with attached attributes, and they are used to indicate a QA value (e.g. energy consumption, latency) of that specific feature. For example, we can only express that the 'WiFi' feature consumes 'x' Joules or has a latency of 'y' Seconds -Joules and Seconds are attributes. Extended VMs cannot represent that a certain QA is measured as a function of several features. But, QAs usually depend on several features representing a complete running product [30]. Another approach is having independent VMs extended with a bunch of variables representing QA measurements and lots of cross-model constraints as part of a constraint satisfaction problem [20], but not as part of the VM itself. This results in improper semantics, and variables and constraints overloading. A hybrid model that rudimentary links a VM with a QAs database is our previous work HADAS [29]. But again, the management of two different and interconnected models as well as two independent reasoners (i.e., Choco/database) is complex and computationally overloading.
The goal of this work is to extend the core definition of VMs with NFRs associated to product solutions, so that we can reason and generate optimal solutions that fit certain QA. We propose Category Theory (CT) as a means to abstract and unify dissimilar relational models. We present a CT framework aiming to represent that: "each SPL product, which is defined as a set of 'n' features, is related to a set of QAs with concrete values that fulfil certain NFRs". In our CT framework, relational models are specified as objects and their relationships. As a result, we have unified in a category VMs and the definition of NFRs and QA metrics as measurement functions.
In the IoT/EC, it is common that some features (e.g. different message sizes) are numerical features; different numerical values can influence the energy consumption or the computation time NFRs. Therefore, our CT proposal considers to effortlessly represent and reason about Numerical VMs (NVMs), something that it is not straightforward with traditional VMs [28]. This can be done only if they are part of the variability tree hierarchy when generating valid products. Contrarily to many of the existing Boolean VMs (e.g., FeatureIDE, Glencoe, UVL), NVMs (e.g. Clafer [2], Z3 [11]) additionally support numerical features, and the relationships between them (i.e., variables and equations). However, the limitations of NVM solvers have prevented software developers from intensively consider modelling numerical features [28]. Our contributions are: 1. A unified CT framework to model NVMs and QAs with NFRs, and their relationship, to generate products as solutions with a sufficient quality. 2. As a proof of concept, we transform HADAS [29], a SPL to reason about energy consumption of IoT/Edge applications, into a category. We perform optimisation analyses with a combination of different reasoners, including a theorem prover and relational search algorithms, each one acting as one being able to reason about both VMs and quality metrics at the same time.
This paper is organised as follows. Section 2 focus on the VMs and QAs, while Section 3 defines CT while presenting the framework to subsume VMs and QAs into categories. In Section 4 we test our approach transforming an SPL tool into a category, and then reasoning about a EC case study. Section 5 reviews and discusses the pertinent related work, ending in Section 6 with a summary highlighting the next steps of this research.

Motivation
Our goal is to use CT to define a joint model encompassing variability and NFRs modelling to reason about solutions that satisfy certain quality attributes. In this section we discuss some background on both variability and NFR modelling. The third part of our proposal, the CT, is explained in detail in Section 3.

Variability Modelling
Feature-oriented Domain Analysis (FODA) was the first formalisation of variability modelling and reasoning [22] as FMs. FMs are used to model the commonality and variability, and external solvers are used mainly to automatically generate the product variants. FMs are represented as a rooted tree graph -one parent, many children, composed of features as Boolean variables, and relationships (see Figure 1). Relationships among features are specified as propositional logic, including tree (e.g., And, Or) and cross-tree constraints. Consequently, it is possible to reason about FMs as a Boolean satisfiability (SAT) problem [6].
Several application domains, such as our IoT/EC illustrative example, require additional constructs that traditional feature models do not include. In fact, more than 45 extensions have been proposed for different needs [7], being the NVM [28]  represent systems that also contain numerical features along with arithmetic cross-tree relationships (e.g., automated reasoner GreenScaler [9]). NVMs consider both Boolean (B) and discrete numerical domains such as Integers (Z) 4 . As depicts the rooted tree graph of top- Figure 1, a NVM supports the variables and operations of those domains together, allowing mixing B conditions and arithmetic (e.g., F eature Z Y 1 * 3 ≥ 9 → F eatureA B ). Another extension to FODA that we also consider in this work (see Figure 1) is the specification of the exact number of children features (i.e. feature cardinality [10]). One more extension required by variability intensive systems is the sub-tree labelling presented in top- Figure 1, which allows: (1) variability tree composition of layered NVMs [29], (2) partial instances [2], (3) cloneables like T 3 [14], and (4) intrinsic hierarchy between trees [17]. In summary, VMs cover the functional requirements -the actions that a system must be capable of doing. However, optimisation analyses of SPLs require NFRs -the non-behavioural aspects under which the system must operate [15] and, as previously stated in the introduction, this is not natively included as part of VMs.

Non-Functional Requirements Modelling
For this work, we define a QA model (QAM) as any model that specifies and hosts QAs being name-domain-metric with NFRs. Quality models [1] are a broader type of models not used in this work. QAs whose values can be quantified, such as Performance and EnergyConsumption of bottom- Figure 1, can be modelled as a set of measurements. To reason about the quality of a certain VM solution, these measurements need to be somehow linked to the variability model. In bottom- Figure 1 there is as an example of a user NFR "Performance < 10 Seconds" for the defined QA (e.g., performance measured as execution time). To clarify, the QAM in the example potentially encodes the total performance and energy consumption of each possible valid solution (i.e., product).
There is no consensus on how QAs measurements should be linked to features in a VM, existing two main approaches. One in which measurements are linked to individual features (i.e., each feature contributes individually to the system QA). Another, which is in line with our approach, considers that the set of measurements of a QAM should be univocally linked to a VM valid solution.
The first specific SPL solution for QAs is Extended Variability Model [5] already cited in Section 1, where the concept of feature in FODA is extended with attributes. Attributes have a name and a domain, and they are linked to individual features. This is useful if we consider that a feature can be assessed by a single quality measurement (e.g., an encryption code consumes 1.3 Joules). But, the quality of certain features usually cannot be assessed using a single feature. For example, to adequately assess the energy consumption of an encryption code, we need to specify several features, modelling the different key sizes, modes and paddings [25]. Therefore, we cannot model the energy consumption of an encryption code with an attribute, we need a way to link a complete solution, for example, composed by the features: AES algorithm, Mode CBC, NoPadding, key size = 256, to an energy measurement. Another argument is that with this approach, we can only assess the overall quality of the system as a simple direct addition of individual QAs. But, first of all, not even linear equations adjust real-world QA metrics, and second, the process of adjusting a function to a set of measurements is computationally costly, not immutable, and probably inaccurate in average [34].
Another common approach to model QAs in SPLs is a balanced tree graph alike hierarchical activity/data models describing metrics in a top-down approach [15]. However, hierarchical trees cause extreme repetition, as each solution must be intrinsically modelled in order to connect them to their NFRs [17]. To overcome this limitation, multi-NVMs interconnected with a bunch of crossmodel constraints have been proposed [20]. However: (1) hierarchical trees are useless to optimisation-type metrics (e.g., energy consumption constraining runtime metrics), and (2) so many cross-model constraints complicate the model, while decreasing reasoning performance. Nonetheless, most of these solutions are not directly compatible with automated reasoners.
We use the HADAS tool [29] as a running example, where a NVM defines the systems components, and an entity-relationship schema defines the QAM. The Clafer [2] and the database reasoners are supported with a direct Solution-to-QAs mapping procedures allowing automatic reasoning of that hybrid model. Databases functionality, as querying in batches or random sampling, offers potential advantages for SPL analyses. However, the drawback of maintaining two individual but different models, and the computational overhead of two co-existing reasoners, break the balance negatively for very large NVMs. Hence, SPL reasoning lack of a unified model that appropriately supports Boolean and numerical variability with non-functional metrics.
Every alternative contains a high degree of interlocking relationships -we are dealing with relational models. While originally, they dealt with different problems developing different methods, there are overlaps -different methods to solve the same problem. But yet, there are specific limitations of each alternative [19]. Contrarily to these approaches, we propose to abstract SPLs systems into a single relational modelling framework, where a unified semantics can jointly define seemingly dissimilar structures and the connections between them.

Category Theory for Software Product Lines
In this section we give a light-weight description of Category Theory (CT) and the way we use it as a unifying modelling and reasoning framework. For a deeper introduction to CT, we refer the interested reader to, e.g., [3].
Category Theory is a general mathematical theory of algebraic structures that allows the common aspects of different structures to be captured and related, while abstracting from the individual specifics. Informally speaking, a category C is any collection of objects representing spaces that can be related to each other via arrows (i.e., morphisms). Two standard examples are the categories: (1) Vec with objects vector spaces and arrows linear maps, and (2) Bool with objects Boolean algebra entities and arrows first-order logic. CT is built from the following main concepts: -Object: a structured class X ∈ Ob(C), graphically depicted as a node • X .
-Arrow: a structure-preserving function a ∈ Arr(C) with source and target objects X = src(a) and Y = tgt(a), respectively, depicted In addition, we shall need the following concepts and terminology, borrowed from a CT framework for algebraic data integration [33]: -Path: a concrete sequence of arrows -Element: one of the distinct components x ∈ X that belong to an object X.
Its domain is defined with an arrow, for example x Integer − −−−− → Z. -Instance: a set-valued functor that populates Ob(C) with elements.
In the following subsections we illustrate on intuitive examples how to represent NVMs and QAMs as related categories. In summary, each model will be represented as a category with objects variability trees (for NVMs) and metrics sets (for QAMs), and relationships will be represented as arrows. This will allow us to generate joint solution spaces (i.e., SPL products with their QAs) with any automated reasoner for any type of model. The NVM from Figure 1 is transformed into the category N VM, depicted in Figure 2. Since the NVM is a composition of trees, Ob(N VM) is a set of four variability trees: T 1 and T 2 having numerical features, and T 3 and its clone T 3 * having Boolean features. Arr(N VM) is the set of relationships in NVMs: hierarchy (i.e., Parent/Child), cardinality, and Boolean and arithmetic cross-tree constraints. A tree trace is an N VM path, and an instance is populating N VMfeatures with values. The basic datatype objects are programming languages library types. In summary, the N VM category is Ob(N VM) ∪ Arr(N VM). While the example objects are mono-type, multi-type is supported.

Category of Quality Attributes Model (QAM)
The QAM from Figure 1 is transformed into the category QAM, depicted in Figure 3. Since the QAM is a set of QAs, Ob(QAM) consists just of a single object, QA, and Arr(QAM) representing the data-type and the NFRs. Its elements measurement ∈ M S are name-domain-metric, where the arrow is measurement metric − −−− → String × Z × String. Now we connect N VM and QAM.

Solution Space Categories Isomorphism
While we have unified the models, we still did not cover how to connect a specific set of features with its specific set of QAs and values. N VM and QAM are solution-space related categories as illustrated in Figure 4. Each solution  The basic automated reasoners for categories are mathematical theorem provers; however, they are typically supported by other optimisation engines for specific tasks, such as a Knuth-Bendix completion prover with a Chase searching algorithm [33]. Some CS x solutions do not correspond to MS x measurements ( Figure 4). The reason is that we need to consider that not every system has been measured. Hence, M N VM is the sub-category of measured N VM, where the CS object has a bijective (i.e., one-to-one) arrow cs/ml to the ML object of QAM. Consequently, there is an isomorphic functor 5 [18] between M N VM and QAM.

Validation and Discussion
To validate our framework, we deploy a CT prototype of a running SPL tool 6 -the NVM and QAM optimisation assistant for Edge Computing HADAS [29].
In HADAS, edge devices are defined in a composed Clafer NVM, as illustrated in Figure 5 on the left, with two main trees: (1) Hardware and (2) Software. The last is composed of four trees: Operating System (OS), Programming Language (PL), Operation and Context. Again, the latter is composed of Libraries and Numerical Parameters trees. All the trees have Boolean features, besides Numerical Parameters, which only contains Integer features (e.g., Encryption Key: 64 bytes [27]). HADAS QAM is a relational database that links NVM solutiontree leaves with QAM dynamic identifiers (since parent features are irrelevant when traceability is considered).  In Figure 5 we can see on the left the HADAS base NVM and QAM structures, and on the right its unified category HADAS by means of the framework presented in Section 3. Our framework is as flexible as CT; existing models can be transformed into categories differently and yet perform equally. For example, an object could be modelled as a category with a single object and vice-versa. Our philosophy in this proof-of-concept is to keep the category simple; hence, we applied this example combining N VM and the single object category QAM into HADAS, where the technical implication is switching the categories functor by an objects arrow. In summary, data-types, NVM trees and QAM are 12 HADAS objects, and variability trees relationships, cross-tree constraints and NFRs are a minimum of 6 arrows. HADAS consists of the following components: -Elements: based on Arr(HADAS), there are Boolean and integer features, QA metrics with format name-domain-metric, QA metadata with string domain, and solution as a set of object features leaves corresponding to QAs (i.e., HADAS solution space).

Optimal deployment
The next step in this proof-of-concept is to instantiate (i.e., populate) HADAS, to later generate the solution space (e.g., application deployments in IoT/EC environments), and optimise its QAs. EC and IoT systems require fast realtime processing of random amounts of data, and have relatively strict NFRs on the performance and energy consumption [32]. Hence, we propose to turn into a category the model shown in Figure 6 on the left, aiming to gain insights of which features and solutions are affecting those QAs in transmitting and/or compressing operations. The NVM contains 28 Boolean features and two numerical features, while the QAM contains two QAs -performance in Seconds and energy rate in milliWatts. Operations are partial configurable benchmarks of the Phoronix Test Suite 7 .
Having a clear picture of the category base model on Figure 5, we need to program and deploy it. While there are libraries aiming to add CT support to SPL reasoners (e.g., Conal Elliot libraries for Z3 [12]), the only production-ready Integrated Development Environment (IDE) is the Categorical Query Language (CQL) IDE: an open-source software, commercialised by Conexus AI 8 . It is a canonical functional IDE that generates CT graphs as the presented figures.  On the right of Figure 6 there is a tiny snapshot of the code; the complete CQL model can be downloaded from the HADAS server 9 . There one can find the 30 NVM features and the 2 QAs distributed in the HADAS objects shown in Figure 6 on the right. While cross-tree constraints exist in the NVM, we did not include them in the graph due to extension limitations; however, they can be found as cross-object arrows in HADAS.

Results and Discussion
CQL IDE reasoning is automatically performed with a combination of different algorithms. The ones that apply to our work, in order of usage, are: automated theorem prover with Knuth-Bendix completion [26] for logic and equations, and hashing, balanced trees and chasing for data-type and cross-object arrows.
We have obtained 162 valid solutions with their respective 324 measurements in 0.1 Seconds. If we reduce the category, the runtime is still 0.1 Seconds. Extending the category as a supra-category formed by a self-cross-product 3 times results in 0.2 Seconds. Running CQL IDE in another computer did not change the runtimes. This suggests that CQL IDE scales linearly, and that the minimum runtime is 0.1 Seconds independently of the computer, probably due to being a Java application running on a Java virtual machine.
Optimisation arrows are a step further of the solution space. Maximising performance or minimising energy rate increases the reasoning runtime by 0.1 Seconds, independently of the solution space size. However, we expect linear increments for larger models (i.e., linear scalability). Regarding the interaction of features with regards to the QAs in our EC case study, the main insights are: -Compressing/uncompressing while sending/receiving data improves the runtimes for large batches of data, but for small ones it is the opposite independently of the original data size. In any case, compressing increased the energy rate -more Joules per Second. -The more powerful the CPU is, the lesser is the compressing time, and the higher the energy rate; the maximum energy rate of Snapdragon 855 was 3.4 Watts, of A53 was 3.7 Watts, and of M5Y71 was 11.8 Watts. In case of communication without compression, CPUs barely affected QAs. -Communication peripherals affected equally the QAs. WiFi and Bluetooth channels performed equally for small batches of data. WiFi has the tendency to be faster above 300 MegaBytes, while the Bluetooth energy-rate is substantially lower (with an average of 0.5 Watts) than these 300 MegaBytes.
To check the internal validity of our proposal, we tested it by first transforming an SPL tool into a category, and second, by modelling an EC case study in a category. Additionally, we moved further implementing that category in CQL IDE, performing reasoning to generate the solution space, and also performing optimal search in order to obtain quality insights from the EC category.
To mitigate wrong scalability assumptions, we ran CQL IDE with different solution spaces in different computers.
Concerning the external validity, we have identified two threats to validity. First, we have not tested our approach on large models with other IDEs, since our aim was a proof-of-concept. Second, as it is the first CT NVM/QAM framework to the best of our knowledge, and we implemented it in just one IDE, we cannot compare it with other CT alternatives.  Having already presented the relevant publications for the foundations of this paper, we now discuss further related work. Firstly, while we have discussed the advantages of CT, one could argue that more simple structures could be used instead to unify NVMs with QAMs. In Table 1 there is a summary of the alternatives, where we highlight first the needs of NVMs, and second what CT provided as a reference.

Related Work
Whether we are talking about NVMs with/without QAMs, we need a complete Boolean and Numerical domain. FODA, the first VM formalisation [22], has already been discussed and discarded due to its lack of support for numerical features and constraints necessary in EC analyses. In fact, as identified in Table 1, most of the alternatives lack numerical support. One of them is Set Theory (ST), which, similarly to CT, is a branch of mathematical logic that studies sets, which informally are collections of objects [13]. ST lacks support for numerical equations, inequalities, and infinite data-types. Similarly, Order-Logic deals just with declarative propositions, predicates and quantification (e.g., ∀x) [24]. Codd Theory is the first and only formalisation of relational algebra, which uses algebraic structures with well-founded semantics for modelling data and defining queries on it. While databases support a wide range of numerical components as datatypes, counting, grouping, arithmetic, etc., they are programming workarounds outside of Codd Theory. In other words, it is not yet clear that Codd relational algebra should be extended above a pure Boolean domain [23]. On the other hand, Arithmetic is the study of numbers and their operations; logic domain is just partially supported by a pseudo-Boolean (i.e. [0,1]) domain [23]. As an interesting fact of the CT capability, ST, Order Logic, Codd Algebra, etc. are already well-formalised categories in CT.
A computational design framework based solely on objects and arrows was proposed in [4], where Model Driven Engineering meets (Boolean) SPLs. This approach was extended with an explicit use of CT in [35], where VMs and Domain Models are unified. In Clafer SPL suite, VM are modelled as abstract classes, literally an idea borrowed from CT [2]. A generic CT approach for different data domains integration is formalised in [33], where as a case study entity-relational models (i.e., database models) are transformed into a category in which tables are objects, columns are elements, and foreign keys are arrows.

Conclusions and Future Work
In this paper, we uncovered the lack of automated tools to model and optimise SPLs defined as an NVM related to sets of QAs with values. We aimed to define a unified model supporting: (1) Boolean and numerical domains in the form of features and their relationships, and (2) a map between the solution spaces of NVMs and QAMs. For that, we propose a CT framework with two categories. The first one is N VM where variability trees and data-types are objects, and hierarchical and cross-tree constraints are arrows. The second one is QAM where the sets of QAs and their data-types are objects, and NFRs are arrows. Finally we establish a functorial relationship between measured products of N VM with QAs sets of QAM. As a proof-of-concept we transformed the SPL HADAS into the category HADAS. Then, we have implemented and deployed it in the CQL IDE, and performed a brief EC case study using a combination of theorem provers and database algorithms as automated reasoners. As future work, we plan to improve the framework to support other proposed extended functionalities of NVMs, as well as integrate quality models. In any case, we are in the process of evaluating this approach with large SPLs.