A Model-Based Approach Towards Real-Time Analytics in NFV Infrastructures

—Network Functions Virtualization (NFV) has recently gained much popularity in the research scene for the ﬂex-ibility and programmability that it will bring with the software implementation of network functions on commercial off-the-shelf (COTS) hardware. To substantiate its roll out, a number of issues (e.g., COTS’ inherent performance and energy efﬁciency, virtualization overhead, etc.) must be addressed, in a scalable and sustainable manner. Numerous works in the scientiﬁc literature manifest the strong correlation of network key performance indicators (KPIs) with the burstiness of the trafﬁc. This paper proposes a novel model-based analytics approach for proﬁling virtualized network function (VNF) workloads, towards real-time estimation of network KPIs (speciﬁcally, power and latency), based on an M X /G/1/SET queueing model that captures both the workload burstiness and system setup times (caused by interrupt coalescing and power management actions). Experimental results show good estimation accuracies for both VNF workload proﬁling and network KPI estimation, with respect to the input trafﬁc and actual measurements, respectively. This demonstrates that the proposed approach can be a powerful tool for scalable and sustainable network/service management and orchestration.

environments would entail rethinking network/service management and orchestration to realize scalable and sustainable 5G networks.
Network Functions Virtualization (NFV) -an emerging softwarization solution -explores the software implementation of network functionalities that would run on COTS hardware [7]. Such a paradigm grants customization and portability to virtualized network functions (VNFs) that would accelerate service innovation and facilitate seamless service support, while minimizing capital expenditures (CAPEX). Despite the numerous gains attainable with NFV, some operational issues that stem from the underlying COTS hardware and virtualization approach adopted need to be handled effectively and efficiently; otherwise, the operational expenditures (OPEX) that result in meeting future demands will prove to become unsustainable.
Contrary to the special-purpose hardware mostly deployed within classical network infrastructures, lower performance and energy efficiency are intrinsic to COTS hardware. While the Advanced Configuration and Power Interface (ACPI) specification [8] equips most -if not all -of it with power management mechanisms (e.g., Low Power Idle (LPI) and Adaptive Rate (AR)), power savings come in trade-off with performance degradation [9] and, consequently, such features are typically disabled for more predictable (and manageable) operations [10]. Moreover, virtualization typically adds extra layer(s) in the networking stack that result in additional processing delays, further lowering the performance. Based on the analyses presented in [11]- [14], VNFs could take much longer execution times, as well as consume even more energy than their physical counterparts for a given amount of workload; the latter work also pointed out that simple workload consolidation to dynamically turn on/off servers would not suffice.
Furthermore, given the highly modular and customizable nature of the virtualized network architecture, coping with the ensuing management complexity entails automated configuration, provisioning and anomaly detection. The ETSI NFV Management and Orchestration (NFV-MANO) framework [15] designates these responsibilities to the virtual infrastructure manager (VIM) of the NFV infrastructure (NFVI). The VIM seeks to obtain performance and anomaly information about virtualized resources based on capacity/usage reports and event notifications, and then to manage them accordingly -yet usually, measurable data do not directly expose network key performance indicators (KPIs). A way to add intelligence to this evolved architecture is exposing meaningful information to suitable stakeholders, such as supporting user view/access to underlying resources [16]. Along these lines, depicting VNF power consumption according to actual packet processing could enable more accurate and fair pricing models, while exposing real-time power consumption and latency trade-offs according to ACPI configurations to the VNF owners could further advance the notion of on-demand services.
Starting from available and easily measurable performance monitor counters (PMCs) in Linux host servers, this paper tries to bridge this gap through a model-based analytics approach for real-time VNF workload profiling and network KPI (i.e., power and latency) estimation. Specifically, the contribution of this work is two-fold: • a complete analytical characterization of the powerand performance-aware virtualized system, taking into account the inherent workload burstiness, and • a novel model-based analytics approach for profiling VNF workloads, towards the real-time estimation of the ensuing power consumption and system latency. An initial version of this work has been presented in [17], in which various PMCs are evaluated for the black-box estimation of key statistical features of the VNF workload, considering a fairly general renewal model (M X /G/1/SET queue [18]) that captures traffic burstiness and system setup times. In this extended version, we provide a complete analytical characterization of the M X /G/1/SET queue, which includes power and latency models. This not only augments the capabilities of the VIM, but is also suitable for state-of-the-art dynamic resource and service provisioning approaches. Moreover, we present a new and more thorough experimental analysis and validation of the models adopted.
The remainder of this paper is organized as follows. Firstly, some technological background and related work are presented in Section II. Section III then describes the system under test (SUT), giving details on key power-and performance-aware parameters foreseen to impact its behavior. Section IV provides the analytical characterization of the different aspects of an M X /G/1/SET queue; then, the key model parameters are exposed from available and easily measurable PMCs in Section V. Experimental results are presented in Section VI, and finally, conclusions are drawn in Section VII.

II. BACKGROUND AND RELATED WORK
In this section, a brief background on the technological scenario is presented, along with some related work.

A. The ACPI Specification
Most -if not all -of the COTS hardware in today's market is already equipped with power management mechanisms through the ACPI specification. The ACPI exposes the LPI and AR functionalities at the software level through the power (C x ) and performance (P y ) states, respectively. The former comprise the active state C 0 and the sleeping states {C 1 , . . . , C X }, while the latter correspond to different processing performances {P 0 , . . . , P Y } at C 0 . Higher values of x and y indexes indicate deeper sleeping states and lower working frequencies and/or voltages, respectively. Sleeping states, although resulting in lower power consumptions, incur performance degradation due to the wakeup times, whereas reduced processing capacity increases the service times. Some pre-defined control algorithms for AR are implemented in the Linux kernel as CPUFreq governors, which includes statically setting the CPU to the maximum (performance), minimum (powersave) and user-/program-defined (userspace) frequencies, as well as dynamically according to the load (ondemand) [19].
It can be noted how LPI and AR have opposite effects on the burstiness of the traffic (i.e., the former clusters packets into bursts, while the latter smoothens the traffic profile). Joint adoption of both mechanisms does not guarantee greater savings [20], [21]; negative saving may even result with the naïve use of the ACPI [22]. The optimum configuration largely depends on the burstiness of the incoming traffic.

B. Performance vs Flexibility in NFV
Basically, the NFVI can employ various virtualization layer solutions towards the deployment of VNFs. This involves selection among (or mixing of) different virtualization technologies, as well as their corresponding platforms and I/O technologies, which govern the overall performance and flexibility of the implementation [23]- [25].
In more detail, the performance yardstick in NFV is a set of network KPIs (e.g., power consumption, latency, response time, maximum throughput, isolation, mobility management complexity, instantiation time, etc.); this set varies (or at least the weight of each component does) with the application. Nonetheless, the overall performance is closely linked to the level of abstraction, and hence, to the virtualization overhead introduced in the chosen implementation. For instance, as described in [23], typical hypervisor-based solutions create isolated virtual machines (VMs) that are highly abstracted and flexible but with relatively high overhead, while container-based solutions create isolated guests (referred to as containers) that directly share the host operating system (OS), thus avoiding much of the overhead, but with a number of flexibility limitations (e.g., consolidation of heterogeneous VNFs, mobility support, etc.).
Other ways for reducing the virtualization overhead regard the handling of network I/O. A number of works in the scientific literature (e.g., [26]- [28], among others) consider technologies like Single Root I/O Virtualization (SR-IOV) and Intel's Data Plane Development Kit (DPDK), which bypass the OS network stack. However, this entails building a specialized network stack on applications that require one, and the device cannot be shared with other applications [25].
In this work, we focus on the power consumption and latency as network KPIs, and consider a traditional VM-based VNF implementation in order to minimize the dependence of the proposed approach on the virtualization and I/O technologies. Nevertheless, the methodology adopted can also be applied to container-based and bypass VNF implementations.

C. Modeling and Analytics of Network KPIs in NFV
The power models proposed and used in the scientific literature vary in complexity -among the simplest ones are: constant values (e.g., from datasheets) [14], linear equations as a function of the load [3], [4], [29], [30], and more recently, an exponential equation as a function of the number of VMs sharing CPU, as proposed in [13]. The same goes with latency models: the simple delay formula of the M/M/1 queueing system is widely used (e.g., [3], [30]), while the authors in [13] have recently proposed a linear model that seeks to account for the system setup and virtualization overhead. It can be noted that even on different aspects of 5G (e.g., [3], [13], [30]), power and latency modeling have been jointly taken into account in network/services management and orchestration.
A large part of the state-of-the-art software-level modeling approaches use machine learning (ML) techniques, also based on measurable PMCs. For instance, numerous PMC-based weighted sum power models have already been proposed at the VM and core levels, which employ linear regression to obtain the weight coefficients, as elaborated in [31] and [32]. The authors in [33] use Support Vector Machine (SVM) in exploring correlations between application-level quality of service (QoS) parameters like throughput and response time with the power readings from Intel's Running Average Power Limit (RAPL) interface [34]. On the other hand, by building on the nonlinear relationship between the incoming traffic and VNF code, Artificial Neural Networks (ANNs) have been used in [10] to go beyond the performance and ondemand governors in finding the optimum AR configuration of the CPU hosting a VNF, based on the workload and RAPL measurements. High levels of accuracy can be obtained with ML-based approaches, provided that the appropriate set of PMCs is considered and an extensive dataset is available for training.
As previously anticipated, another well-known approach for modeling and analyzing telecommunications systems is the application of queueing theory and optimization principles which, more recently, is being adopted in the context of NFV as well. Most of the works in the literature regard estimating the system or queueing latencies towards efficient (QoSaware) network service provisioning, considering networks of queues to model interactions among service chain/virtual system/VNF components, with each component modeled as a (unique) queueing system [35]- [38]. Delving deeper into the infrastructure level, [39] takes into account the impact of interrupt coalescing (IC) in the Network Interface Card (NIC) on VNF performance (in terms of latency and packet loss), while [40] considers the burstiness of the VNF workloads (both incoming and aggregated) in the power modeling. Among other works, we can also cite [41] that seeks to optimize the use of coalescing with LPI, and [42] which, though not explicitly addressing NFV, considers the tradeoff between energy and processing delay in Edge Computing architectures.
In this work, we adopt the queueing model considered in [40], analytically characterizing the different aspects of the system; starting from there, a model-based analytics approach that uses -and adds value to -available PMCs is proposed towards real-time VNF workload profiling, as well as power and latency estimation.

III. SYSTEM DESCRIPTION
Considering that the system behaviour highly depends on the ACPI configuration of the host, as well as the virtualization and I/O technologies used in the VNF implementation, more details on these aspects are provided in the following Subsections.

A. ACPI Configuration
For a given (C x , P y ) pair, a number of power-and performance-aware parameters can be defined.
1) Power Requirements: The instantaneous power requirements vary with the core's state. Specific values of idle (Φ i ) and active (Φ a ) power consumptions are associated with each available power and performance state, respectively.
Moreover, transitions between C 0 and C x are not instantaneous; hence, the power consumed in these periods must also be taken into account. Since the average power consumption during sleeping transitions (C 0 → C x ) approximates Φ i , we only consider the wakeup transitions (C x → C 0 ) in this work. Particularly, the power spike in the latter is associated with a wakeup power consumption Φ w that is approximately 2.5Φ a , as pointed out in [43].
2) System Latencies: The total delay experienced by packets can be broken down into contributions of different system operations. As packets arrive at the RX queue, NICs may wait either for some time interval (i.e., time-based IC) or some number of arrival events (i.e., frame-based IC) before raising interrupts to notify the core of pending work. Generally, such service requests can occur while the core is in idle or active mode; in the former, there is an additional setup period due to the wakeup and reconfiguration operations before the actual packet processing begins.
At the NIC level, we consider the time-based IC, for which we define the period τ ic . At the core level, we consider two setup contributions (i.e., due to wakeup and due to reconfiguration), for which we define the periods τ p and τ r , respectively. For the Sandy Bridge EP platform, core wakeup latencies are in the order of nano/microseconds [44], yet power spikes during the wakeup transitions can last a bit longer [43]; in any case, the value of τ p depends on the core's power state C x . Once in active mode, the core then performs some reconfiguration operations; the value of τ r depends on the core's performance state P y . In the context of power and latency modeling, τ p and τ l = τ ic + τ p + τ r will be considered, respectively.
After the completion of the setup period, backlogged packets are suppose to be served exhaustively (considering that packets have been already transferred in the main memory via the standard Direct Memory Access (DMA)) with an average processing capacity μ, which corresponds to the operating energy point of the performance state P y .

B. VNF Implementation
With the current ubiquity of Linux servers and x86 hardware with virtualization extensions, Kernel-based Virtual Machine (KVM) [45] -being the default virtualization infrastructure of Linux that is already integrated in the kernel -offers simplicity in the VNF deployment and mobility management. In this respect, we consider a KVM-based VNF running on a Linux host in this work, but the approach is also applicable to (or easily adaptable for) other platforms.
Particularly, as a full virtualization solution, KVM is able to run VMs with unmodified guest OSs. Guest networking is implemented by the user space process Quick Emulator (QEMU), as detailed in [46]. VM processes are allocated a certain number of vCPUs, each one seen as a physical CPU by the guest OS. Then, VNFs run as guest user space processes in the corresponding VMs. Fig. 1 illustrates the traditional KVM architecture for VNF implementation.
For simplicity, but without loss of generality, we suppose a one-to-one correspondence between the VM and the core to match the core workload, utilization, power consumption and latency with those of the VNF. More complex VNFs may consist of multiple VMs -each one running a VNF component (VNFC), but the overall performance can be derived from the individual performances of the components.
We note that with such a traditional VM-based implementation, switching between VNF and interrupt handler codes can be rather costly. To reduce this overhead, the interrupt and VM process affinities are set to different cores in this work. Like a pipeline model of some sorts, the core tasked with interrupt handling then notifies the one running the VM process via an inter-processor interrupt (IPI) for backlogged packets. Moreover, as also illustrated in [47], setting affinities or core pinning in such fashion improves the energy efficiency of the system.
Using the ethtool command [48], a number of parameters can be tuned in the NIC. In order to preserve as much as possible the shapes of the incoming/outgoing traffic of the VNF, we look into the IC and RX/TX ring parameter settings. For the former, we decided to keep the default settings since they are more or less equivalent with respect to the generated input traffic -specifically, options for adaptive IC (i.e., adaptive-rx and adaptive-tx) are off, the parameters for frame-based IC (i.e., rx-frames and tx-frames) are set to 0, and the parameters for time-based IC (i.e., rx-usecs and tx-usecs) are set to 3 μs and 0, respectively. On the other hand, the RX/TX ring buffer sizes are again set to the pre-set maximums (i.e., 4096) in order to maximize the NIC's ability to handle burst arrivals.

IV. ANALYTICAL MODEL
The energy-aware core hosting the VNF (or VNFC) is modeled as an M X /G/1/SET queue, as in [17], [20], [21]. This model generalizes the well-known M X /G/1 queue [49] for burst arrivals, by also covering the cases in which an additional setup period SET is necessary before service can be resumed.
In more detail, batches of packets arrive at the system at exponentially distributed inter-arrival times with a random batch size X. If the system is empty at the arrival instant, SET is initiated; service only begins after the completion of SET. Packets are queued as they arrive and served individually with generally-distributed service times S. Moreover, we approximate the loss probability in a queue with finite buffer N by the stationary probability that the number n of customers in the infinite-buffer queue at a generic time t be greater than N (Pr{n > N}). Therefore, hereinafter, we will consider the infinite buffer case.
More details on the different model components are presented in this section -from the arrival, setup and service processes, to key networking KPIs. The model notation is given in Table I.

A. Traffic Model
In telecommunications networks, where burst packet arrivals are more representative of the traffic behaviour rather than single arrivals, effectively capturing the burstiness is essential. The Batch Markov Arrival Process (BMAP) has long been established in this respect [50], [51]; BMAP allows for dependent and non-exponentially distributed packet interarrival times, while keeping the tractability of the Poisson process [52]. Starting from this, we suppose that packets arrive according to a BMAP with batch arrival rate λ.
To characterize the random batch size X, let β j be the probability that an incoming batch is composed of j packets (j = 1, 2, . . .). Then, the Probability Generating Function (PGF) of X is given by from which we obtain the first (β (1) = ∞ j =1 j β j ) and second (β (2) = ∞ j =1 j 2 β j −j β j ) factorial moments of the batch size. The offered load in packets per second (pps) is then obtained as OL = λβ (1) .
Given λ, β (1) and β (2) , the burstiness of the traffic can already be well estimated. However, a common difficulty stems from the fact that the discrete probability distribution {β j , j = 1, 2, . . .} may not be given, and typically requires detailed analysis of packet-level traces [20]. As an alternative approach, we propose to estimate the factorial moments from easily measurable parameters (e.g., VNF workload, idle and busy times), which will be discussed in Section V.

B. Setup Model
With the deterministic nature of the setup period due to core wakeup transitions and reconfigurations considered in this work, the Laplace transform of the probability density τ (t) can be reduced to From this, the first and second moments of the setup time are simply given by τ (1) = τ and τ (2) = τ 2 , respectively, with τ = τ p in the context of power consumption, and τ l in the context of latency. (3)

C. Service Model
Once setup is completed, the core starts to serve backlogged packets and remains busy until the system becomes empty. We suppose that the VNF (or VNFC) running on the core has deterministic service times.

1) Service Process:
Generally, multiple VMs may be consolidated on the same core, and the service process can be captured by a discrete set of service rates μ m with corresponding probabilities π m , m ∈ {1, . . . , M }, where M is the number of VMs sharing the core. The Laplace transform of the probability density s(t) is then obtained as However, in the special case of one-to-one correspondence between cores and VMs, Eq. (4) reduces to s * (θ) = e − θ μ , giving the first and second moments of the service time as s (1) = 1/μ and s (2) = 1/μ 2 , respectively. Note that the core utilization due to actual packet processing is obtained as ρ = OL/μ (< 1 for system stability).
2) Busy Period Distribution: From the derivations presented in Appendix A, the Laplace transform of the busy period density B(t), specialized for deterministic setup and service times, is given by from which we obtain the first and second moments of the busy periods as B (1) = respectively. It is important to note that these expressions are particularly useful for estimating the power consumption and system latency, as we shall see further on.

D. Power Model
We adopt the power consumption model proposed in [40] for an energy-aware core running VMs. The model works according to a renewal process, where the idle (I) and delay busy (SET + B) periods constitute independent and identically distributed (iid) "cycles" (R), as illustrated in Fig. 2. A delay busy period, as defined in [18], starts with the arrival of the batch initiating the setup, and ends with the depature of the last packet in the system.
Based on classical renewal theory principles, the steadystate behavior of the stochastic process can be studied by looking at a representative cycle [49]. With this in mind, the average power consumption of the core is expressed as a sum of the average contributions incurred during the idle, setup (due to wakeups) and busy periods, Φ = (1) is the average length of a renewal cycle, and I (1) is the average length of an idle period. Specializing these to the case of BMAP arrivals (i.e., I (1) = 1 λ ), and deterministic setup and service times (i.e., τ (1) = τ and B (1)

E. Latency Model
The system latency D is derived as the sum of the average waiting time W of a packet in the queue and its average service time (i.e., s (1) = 1/μ).
In more detail, Little's law defines the former as: W = L/λβ (1) , where L is the average length of the queue that can be derived from the PGF P(z) of the number of packets in the M X /G/1/SET system at a random epoch, as By specializing the general expression for P(z) presented in Appendix B to the case of deterministic setup and service times, we obtain whence, and with τ = τ l D = ρβ (1) + β (2) 2β (1) It is interesting to note that Eq. (8) can also be derived starting from the Laplace transform of the waiting time density, as in [53].

V. EXPOSING MODEL PARAMETERS
With the M X /G/1/SET queue as a basis, here we expose the key model parameters starting from available and easily measurable PMCs in the Linux host, in effect profiling the VNF workloads. Starting from this, the corresponding power consumption and system latency can then be readily derived from Eqs. (6) and (9), respectively.

A. Performance Monitor Counters
Linux has different utilities for performance monitoringamong them, the PMCs described in the following are considered in this work; other PMCs used in [17] did not work well with the BMAP emulation. Note that in the syntax of these utilities, the term 'CPU' refers to a core (or logical core, in the case of hyperthreading). 1) Idlestat: As a tool for CPU power/performance state analysis, the idlestat command [54] in trace mode is able to monitor and capture the C− and P− state transitions of CPUs over a user-defined interval. To run it in trace mode, the -trace option is used together with the <filename> and <time> parameters to specify the trace output filename and the capture interval in seconds, respectively. With the -c and -p options, C− (including the POLL state, in which the CPU is idle but did not yet enter a power state) and P− states statistics are reported in terms of the time spent in each state per CPU.
2) VnStat: As a network traffic monitoring tool, the vnstat command [55] is able to report how much traffic (in terms of average rates) goes through a specific interface over a user-defined interval. This is done by using the -tr option together with the <time> parameter to specify the monitoring interval in seconds, and the -i option together with the <interface> parameter to specify the interface.

B. Estimation With PMCs
In this Sub-section, we seek to expose the model parameters from the considered PMCs, for a given (C x , P y ) pair.

1) Offered Load and Utilization:
Measuring OL with the vnstat command is straightforward, as it corresponds to the rate of incoming traffic on the network interface bound to the VM process.
On the other hand, the utilization measurable with the idlestat command as ρ =T Pỹ T Py +T Cx +T POLL (10) whereT Py ,T Cx andT POLL are the measured average times spent in the corresponding states, encompasses all operations (including reconfigurations, context switching, sleep transition, etc.) performed by the active core. While this gives indications on the utilization overheads incurred in this VNF implementation, we prefer to estimate the utilization due to actual packet processing, for which we consider ρ =ÕL/μ (11) whereÕL is the offered load measured with the vnstat command.
2) Batch Arrival Rate: By considering exponentially distributed inter-arrival times, λ can be estimated from the average idle times measurable with the idlestat command. Theoretically,Ĩ (1) ≈T Cx +T POLL ; however, with the high variance observed onT POLL , we propose to consider onlỹ T Cx for stable estimates. Linear regression is then used to compensate for the discrepancy, givinĝ where α 1 and α 0 are the computed regression coefficients.
3) Factorial Moments of the Batch Size: GivenÕL andλ, β (1) (or the average batch size) can be directly estimated by definition asβ While estimating β (2) is a bit more involved, it is essential for estimating D. In this regard, we propose to start from the expression of the second moment of the busy times B (2) (see Section IV-C2), and consider with the max function ensuringβ (2) ≥ 0. Now we are left with how to obtainB (2) , for which we adopt the well-known theorems in statistics regarding the mean and variance of sample means.
In more detail, let ) starting from sample means [56], where Δt is the observation period over which each sample of B n (1) is obtained. Then, we can estimate the second moment of the busy times aŝ In this work, we considerB (1) ≈T Py − ΔT , where ΔT is the busy overhead due to operations other than actual packet processing (which includes τ r , context switching, sleep transitions, etc.). Eq. (15) is then applied on the set of samples {B n (1) , n = 1, . . . , η} to obtain an estimate of B (2) .

VI. EXPERIMENTAL RESULTS
The proposed approach is evaluated considering a SUT equipped with two Intel Xeon E5-2643 v3 3.40GHz processor packages, running an OpenWrt [57] virtual firewall (VF). The latter is pinned (or with affinity set) to a single core and the interrupt request (IRQ) handling to another one. The SUT is connected via RX/TX Gigabit Ethernet links to an Ixia NX2 traffic generator, as shown in Fig. 3. The setup creates a controlled environment that allows monitoring of the SUT's PMCs, power consumption and system latency, as the burstiness of incoming traffic is varied.
The ACPI configuration of the pinned core is set to (C 1E , P 0T ) to maximize the system throughput, where the power state C 1E corresponds to the Enhanced Halt -the lightest sleeping state with improved power requirements, and the performance state P 0T to the maximum turbo frequency; while the rest of the cores are put to deep power saving state C 6 . Under this configuration, we approximate the values of the following model parameters as: μ ≈ 199628 pps, τ ic ≈ 3 μs, τ p ≈ 10 μs, τ r ≈ 10 μs, Φ i ≈ 8.33 W, Φ a ≈ 53.38 W and Φ w ≈ 133.45 W.
In more detail, the processing capacity is estimated as the maximum system throughput measured from the user interface of the traffic generator. Setup components are derived from the IC configuration, wake-up latencies specified in the kernel's cpuidle sysfs and the results of [58]. Power related parameters are estimated starting from actual package-level (i.e., core part) measurements obtained with the turbostat command [59] -one of the many tools that expose power measurements from Intel's RAPL interfaces; [60] confirms from extensive tests that RAPL exposes true averages that are updated at fine-grained intervals.
Lastly, the results obtained with the model-based approach are validated with respect to the inputs (for the workload profiling) and actual measurements (for the network KPI estimation); for each test point, 100 samples are collected, from which the 99% confidence intervals are obtained and indicated with error bars.

A. Emulating BMAP Arrivals
Rather than simply tuning deterministic parameters in the Ixia NX2 traffic generator, as in [17], incoming traffic is more accurately generated to emulate BMAP arrivals in this extended version. Tcl scripts are used to specify batch inter-arrival times and batch sizes. Realizations of the batch inter-arrival times are drawn from exponential distributions (setting the desired mean value), while those of batch sizes from truncated generalized Pareto distributions (with default shape parameter, and varying the scale and location parameters to approximate the desired mean value). The resulting pdfs of the inter-arrival times and batch sizes are illustrated in Figs. 4 and 5, respectively.
In the Tcl scripts, BMAP is emulated by assigning each batch (of X packets) to a stream, as well as an inter-stream gap (ISG) that approximates the inter-arrival time in the model. Starting from the first stream/batch, the next one is generated after the specified ISG, and so on. Then, the system loops back to the first stream/batch after the ISG of the last one, as shown in Fig. 6. Since the traffic generator allows up to 4096 streams when using 1 Gigabit ports, and only 512 streams when using 10 Gigabit ports, we use the former to achieve better approximations of the batch size and inter-arrival time distributions.

B. Validation of Workload Profiling
As initially motivated in [17], the proposed approach seeks to profile VNF workloads beyond offered loads and utilization -specifically, to capture the workload burstiness, as characterized by the model parameters λ, β (1) and β (2) .

1) Offered Load and Utilization:
Using the vnstat command, we obtainÕL with maximum and mean absolute percent error of 5.69% and 2.15%, respectively. Looking at Eq. (11), the same accuracy is expected forρ with a constant value for μ. Fig. 7 shows a comparison between the measured (ρ) and estimated (ρ) utilization values, with values computed from the input model parameters (ρ in -the utilization due to actual packet processing). It can be observed howρ fits with the model, whileρ exhibits an overhead that seems to be highly correlated with the batch size; this further motivates the need to capture the traffic burstiness.
2) Burstiness: In this work, it was observed that with BMAP emulation the idlestat command gave reliable estimates for λ, which also comply with the theory of Poisson processes, while the software PMCs initially considered in [17] failed. Fig. 8a shows the estimates obtained based on Eq. (12), with α 1 = 0.910651 and α 0 = −60.418339. Tight confidence intervals and low absolute errors are achieved, even with varying values of β (1) aggregated in each test point.  A similar level of accuracy is expected with the β (1) estimates as they are solely based onÕL andλ, as indicated in Eq. (13); the obtained estimates are shown in Fig. 8b. On the other hand, Fig. 8c shows the β (2) estimates obtained based on Eq. (14). Although the input and estimated values follow the same trend, a relatively higher variance (and hence, errors) are observed inβ (2) stemming from the fact that the starting point was the busy times (i.e., Eq. (15)), which also depend on other model parameters.

C. Validation of Network KPI Estimation
In this Sub-section, we apply the results obtained from the workload profiling to the real-time estimation of networking KPIs -specifically, the VNF power consumption and system latency. Then, the estimates obtained from the power (i.e., Eq. (6)) and latency (i.e., Eq. (9)) models are compared with actual measurements.
As regards the power consumption, we suppose that the core power consumption due to the VNF can be obtained as Φ =Φ cpkg − ΔΦ, whereΦ cpkg is the power consumed by the core part of the package (measurable with the turbostat command), and ΔΦ is the overhead due to the other cores in the package. Recalling that the VNF is pinned to a core under (C 1E , P 0T ), and the rest of the cores are in state C 6 , we consider ΔΦ ≈ 15.93 W in this work.
On the other hand, we suppose that the VNF latency can be obtained asD =D ixia − ΔD, whereD ixia is the storeand-forward latency measurable from the user interface of the traffic generator, and ΔD is the overhead due to the 2-way transmission of 64-byte Ethernet frames with 20-byte header on a Gigabit link (i.e., ≈ 2β/1488095) plus the busy overhead Δ T ≈ 90 μs (that includes τ r , context switching (i.e., for which [39] proposed a rule of thumb of 30 μs), sleep transitions, etc.). 1) Power: Fig. 9a illustrates the behaviour of the power model for varying traffic burstiness. Intuitively, by looking at Eq. (6), the average core power consumption is linearly dependent on both λ and β (1) (embedded in ρ), although a stronger correlation is observed with the former. Results on the model-based power estimation, and its comparison with the actual measurements (in terms of absolute error) are shown in Fig. 9b. As before, tight confidence intervals and low absolute errors are achieved, even with varying values of β (1) aggregated in each test point.
2) Latency: Fig. 10a illustrates the behaviour of the latency model for varying traffic burstiness. Contrary to the power consumption, the average VNF latency incurred is more strongly linked to β (1) (and β (2) ) than to λ. Results on the model-based latency estimation, and its comparison with the actual measurements (in terms of absolute error) are shown in Fig. 10b. Tight confidence intervals and low absolute errors are also achieved in this case, even with varying values of λ aggregated in each test point.

D. Validation on Facebook's Dataset
Finally, to further support our assumptions and exhaustive validation results with the wide range of (λ, β), we sample Facebook's Web server cluster dataset [61], [62]. The timestamps corresponding to each (source) IP address are analyzed to derive the batch inter-arrival times and sizes.
Looking at the top 500 addresses in the trace files (in terms of occurrences), Fig. 11a and 11b show the average batch Fig. 11. Average inter-arrival times and batch sizes for the top 500 IP addresses by occurrence. Fig. 12.
inter-arrival times and sizes, respectively. A sort of steadystate phase can be observed from the 185 th address (marked by the red dotted lines) -intuitively, this means that addresses in this subset have similar traffic burstiness and hence, comparable system behaviors. With this in mind, 5 IP addresses are randomly chosen from the said subset for the detailed validation.
In the following, the considered batch inter-arrival times and sizes are first fitted to exponential and generalized Pareto distributions, respectively. Then, considering the traffic of each address as input to the SUT, the network KPI values obtained with the proposed model-based estimation are compared with actual measurements. Fitting batch sizes to generalized Pareto distributions, GenPareto(<shape>,<scale>,<location>).

1) Distribution Fitting:
Using MATLAB's fitdist function, we obtain the distribution parameters resulting from fitting the input distributions with the considered models. The goodness of fit is then measured in terms of the coefficients of correlation (R) and determination (R 2 ).
Figs. 12 and 13 show how well the distributions fit for the 5 IP addresses. Particularly, it can be observed in Fig. 12 that the input batch inter-arrival times have R > 95% and R 2 > 90% with the exponential distribution, while the batch sizes in Fig. 13 have R and R 2 values over 99% with the generalized Pareto distribution. Such high values of R and R 2 confirm that the samples considered from Facebook's dataset are well-represented by the models.
2) Network KPI Testing: The traffic of each address is fed as input to the SUT in order to evaluate the corresponding power consumptions and latencies. As in Section VI-C, the values obtained with the proposed model-based estimation and actual measurements are compared in terms of absolute errors. Fig. 14a shows the average power consumption for the 5 IP addresses, while Fig. 14b the average system latencies. Tight confidence intervals and low absolute errors (i.e., ≈ 3% for power and ≈ 6% for latency) can be observed for both KPIs across all cases. Such accuracies for real-time network KPI estimation demonstrate how the proposed approach can be a powerful tool towards achieving the required scalability and sustainability levels in next-generation network/service management and orchestration.

VII. CONCLUSION
NFV is an emerging softwarization solution that brings flexibility and programmability through the software implementation of network functions (i.e., as VNFs) on COTS hardware. A number of issues surround the performance and energy efficiency of such virtual implementations, with respect to their physical counterparts. This work seeks to facilitate scalable and sustainable network/service management and orchestration mechanisms, through a novel model-based analytics approach for profiling VNF workloads, towards real-time estimation of network KPIs.
Particularly, the M X /G/1/SET core model is considered to capture both the workload burstiness and system setup times. A complete analytical characterization of the system is presented, based on which the model-based analytics approach is built upon. Key model parameters are exposed from available and easily measurable PMCs in Linux host servers. In terms of generalizability, the proposed approach goes beyond current trends in ML-based analytics, where models are tightly coupled with the training data.
Experimental evaluations have been performed on a SUT equipped with Intel Xeon E5-2643 v3 3.40GHz processors, with input traffic generated to emulate BMAP arrivals through scripting in an Ixia NX2 traffic generator, as well as some samples from Facebook's Web server cluster traces. Results show good estimation accuracies for both VNF workload profiling and network KPI estimation, with respect to the input traffic and actual measurements, respectively. This demonstrates how the proposed approach can be a powerful tool, not only for augmenting the capabilities of an NFVI's VIM, but also in the development of next-generation resource/service provisioning solutions.

APPENDIX A BUSY PERIOD ANALYSIS
We adopt the approach presented in [63], decomposing the busy period B of an M X /G/1/SET queue into two components: (a) the initial busy period B τ -in which all the customers that arrived during the setup SET are served, and (b) the ordinary busy period B X that corresponds to the busy period of an M X /G/1 queue -in which the batch initiating the setup and the rest that arrived while the core is busy are served.
Considering that the busy period density B(t) is given by the convolution of the probability densities B τ (t) and B X (t), then its Laplace transform is simply obtained as the product B * (θ) = B * τ (θ)B * X (θ). Let the random variable η τ denote the number of batch arrivals during SET. Given that SET = t and η τ = m, then B τ is distributed as the sum of the lengths of m independent ordinary busy periods B X 1 , . . . , B Xm [49]. By first conditioning on SET and η τ , and then averaging out, we obtain the Laplace transform of B τ (t) as: Similarly, let the random variables S X 1 , X 1 and η 1 denote the service time of the initiating batch, the number of customers in this batch, and the number of batch arrivals during S X 1 , respectively. Given that S X 1 = t, X 1 = j and η 1 = n, then in the same way as before, B X is distributed as the sum of the lengths of t and n independent ordinary busy periods. By conditioning on S X 1 , X 1 and η 1 , and proceeding as before, we obtain the Laplace transform of B X (t) as: × E e −θ[t+B X 1 +...+B Xn ] · s 1 (t) * · · · * s j (t) dt β j s * (θ + λ − λB * X (θ)) j = X (s * (θ + λ − λB * X (θ))). (A.2)

APPENDIX B SYSTEM STATE PROBABILITIES
The Probability Generating Function (PGF) P(z) of the number of customers in an M X /G/1/SET queueing system at a random epoch can be expressed as the product P (z ) = P τ (z )P X (z ), with P τ (z ) being the PGF of the number of customer arrivals during the residual life of the vacation period (i.e., I + SET) and P X (z ) = (1 − ρ)(1 − z )s * (λ − λX (z )) s * (λ − λX (z )) − z (B.1) the well-known PGF of the number of customers in the ordinary M X /G/1 system.