Quality estimation models for gaming video streaming services using perceptual video quality dimensions

The gaming industry is one of the largest digital markets for decades and is steady developing as evident by new emerging gaming services such as gaming video streaming, online gaming, and cloud gaming. While the market is rapidly growing, the quality of these services depends strongly on network characteristics as well as resource management. With the advancement of encoding technologies such as hardware accelerated engines, fast encoding is possible for delay sensitive applications such as cloud gaming. Therefore, already existing video quality models do not offer a good performance for cloud gaming applications. Thus, in this paper, we provide a gaming video quality dataset that considers hardware accelerated engines for video compression using the H.264 standard. In addition, we investigate the performance of signal-based and parametric video quality models on the new gaming video dataset. Finally, we build two novel parametric-based models, a planning and a monitoring model, for gaming quality estimation. Both models are based on perceptual video quality dimensions and can be used to optimize the resource allocation of gaming video streaming services.


Introduction
Gaming has recently expanded the associated services by stepping into live streaming services. Many players broadcast their gameplay to thousands of viewers, who are passively watching the user generated content on popular platforms such as Twitch.tv. While the game industry is growing, more complex games in terms of processing power are getting developed. In order to play these high-end games, players are required to update their end devices every few years. One solution for this is to move the heavy processes such as rendering to the cloud and cut the need for highend hardware devices for customers. Additionally, cloud gaming offers more flexibility to users by allowing them to play any game on any type of device which is capable of displaying a video and is equipped with a suitable input device. Apart from processing power, additional advantages of cloud gaming are its platform independency, i.e. a game can be played on every client operating system independently to the operating system of the server, and security against piracy as the game content is in the hand of service providers. Although cloud gaming has been seen as a promising application on the top of the IP network to grab a huge share of gaming industry revenue, so far only a few companies provide such a service. Cloud gaming as a real-time application suffers from the additional delay due to video encoding and decoding, end-to-end transmission delay as well as other network constrains such as limited bandwidth and packet loss. The Quality of Experience (QoE) of costumers can be negatively affected by these limitations. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. MMSys '20, June 8-11, 2020, Istanbul, Turkey © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6845-2/20/06…$15.00 https://doi.org /10.1145/3339825.3391872 In order to optimize user satisfaction despite these challenges, strategies for resource and network management are highly necessary. As a ground truth for these strategies, subjective user ratings for various system configurations are required. Due to the high costs of subjective tests, the number of available datasets of gaming video quality and with that also proposed models is very limited. Currently, typical quality models for video contents are only candidates for quality prediction of gaming video streaming services. However, video games are artificial and synthetic, which potentially can be used to improve quality predictions. Furthermore, cloud gaming is a highly delay sensitive service which requires the use of hardware accelerated video encoding and decoding such as NVIDIA's encoder (NVENC). This encoding performs differently compared to software encoders [1]. In addition, due to latency constrains of cloud gaming services, a special set of encoding parameters (e.g. llhq, llhp modes of NVENC or fixed macroblock size) might be used to reduce the compression delay. The differences in the process of video encoding for cloud gaming service compared to other video streaming applications raises the question if currently available video quality models can be used for gaming content. This paper focuses on the development of video quality models for gaming content in passive viewing-and-listening paradigm. It must be noted that for an overall gaming QoE model also other quality features related to the interactive nature of such a service have to be considered.
In the following section, we first review the existing quality models for gaming and non-gaming content. In section 3, we provide a large-scale gaming video dataset, which includes popular gaming contents processed with state-of-the-art cloud gaming settings. The performance of popular signal-based and parametric video quality models are presented in section 4 and 5 respectively. In section 6, we propose two parametric-based models for quality prediction of streamed gaming videos. Finally, the discussion and conclusion are presented in section 7 and 8 respectively.

Related Work
In this section, we first review the latest works that have been done considering gaming quality assessment and prediction. Next, we review a classification of video quality models.

Gaming Quality Models
We have seen several attempts to build quality models that predict gaming QoE. While several quality dimensions construct the gaming QoE [17], two important dimensions for gaming QoE are considered in the recently published ITU-T recommendation G.1072 [26]: interaction quality and video quality. Interaction quality. referred to as input quality in [26], takes into account the technical factors that may influence the interaction of users with the game. In addition to the interaction quality, video quality is important which refers to the visual aspect of the video stream. It has to be noted that gaming QoE is a multidimensional construct consisting of several dimension such as immersion, flow and presence. Some of these dimensions are mainly influenced by the game content itself, e.g. due to the game design, challenges, and rules, or by preferences of the player. However, from the perspective of a cloud gaming providers, who is not per se a content creator, these dimensions are not feasible to be measured or tracked. Therefore, most of the state-of-the-art work, model the gaming QoE based on interaction and video quality.
With respect to interaction quality, several works have been done to model the impact of packet loss and delay on gaming QoE. Schmidt et al. [24] model the effect of delay on gaming QoE for two different games and showed that the delay sensitivity level of games could be different even between two levels of the same game.
Furthermore, a considerable amount of work has focused on the development of models that predict video quality based on encoding parameters or the video signal. Zadtootaghaj et al [31] proposed a NR machine learning-based video quality metric named NR-GVQM for gaming content which is focused on frame-level feature extraction. The idea was to collect low-level image features from the frame and train it over the VMAF score. Göring et al. [7] proposed a NR metrics called Nofu, which is a pixel-based video quality model for gaming content. Nofu uses 12 different framebased values and a center crop approach for the fast computation of frame-level features. It further uses frame-level features pooling at video-level and feeds the features to a machine learning-based model for the model development.
Finally, a few works have been done which consider both video and interaction quality. Among them, Zadtootaghaj et al. [34] investigate the effect of temporal and spatial video compression on gaming QoE and build a model based on the bitrate and frame rate. Wang et al. [29] have conducted a study, investigating the effect of a wide range of influencing factors on perceived quality and modeled the quality based on the assessed factors. They proposed a model, named GMOS, which is inspired by ITU-T E-Model [5], considering seven influence factors: game type, codec, resolution, frame rate, PSNR, delay, and packet loss. Undoubtedly, the most comprehensive work has been done at ITU Telecommunication Standardization Sector (ITU-T) study group 12, which models the gaming QoE based on several influencing factors ranging from encoding parameters, game types to network degradations. The result is published as recommendation named G.1072 [26]. This recommendation takes into account three quality dimensions for gaming QoE, spatial video quality, temporal video quality and input quality (interaction quality). Since we focus on the video quality, in this paper we only consider the video quality dimension of G.1072.

Classification of Video Quality Models
We can categorize objective 1 quality assessment techniques according to different aspects. The first classification is based on the amount of source signal information that is needed to run the instrumental assessment. Typically, three types of objective quality assessment metrics are considered. No-reference models (NR), Reduced-reference models (RR) and Full-reference models (FR). NR methods require no knowledge from the original signal before transmission. RR methods use some extracted features of the source signal and predict the quality by combining this information with measurements of the received signal. FR methods have full access to the original signal and can compare it to the received video.
Alternatively, the objective quality assessment models can be classified according to the level of information used in the network and transmitted packets. Based on this approach, objective quality assessment metrics are classified into four categories, planning models, bitstream models, signal-based models and hybrid models.
Planning models take the assumed network and client parameters as input and calculate an expected quality score. Bitstream models, which are divided into two categories, use the packet information to predict the quality. The first category of bitstream models predicts the quality by packet-header information (e.g., HTTP or RTP) without accessing the payload information. Therefore, these models do not explicitly consider source and encoding information to evaluate the quality. However, these models try to predict the quality by considering coding and IP network impairments. Examples of such models include the ITU-T P.1201 recommendation series [18]. 2 Please contact the authors to have access to the dataset for private use.
The second category of the bitstream models uses the payload information. The quality prediction using bitstream payload-based models is not only achieved by application and transport layer information (e.g., HTTP/TCP or RTP/UDP), but also by extracting and analyzing content features from the coded bitstream. An example of this case is the ITU-T P.1202 and P.1203 family of recommendations [13].
Finally, in a signal-based model, quality is determined by analyzing the decoded media signal, and in hybrid models, decoded signals are combined with information from the bitstream or the packet layer. The above classification is illustrated in Figure 1 according to [19].
These classifications are important for QoS/QoE monitoring since the possibilities for the implementation of monitoring systems heavily depend on the amount of information that is available. For example, the use of a full-reference model for large-scale monitoring of an IPTV system would be practically unfeasible, as the amount of the information that is required to be transmitted from source to the monitoring points over the network is too large. For cloud gaming service, as the cloud gaming providers have access to the bitstream information, bitstream models are one of the practical models as they are less complex compared to signal-based model, while providing a decent prediction.

Dataset
There exist many video quality datasets, while to the best knowledge of the authors there are only three video quality datasets using gaming content [2,4]. For delay sensitive cloud gaming service, currently, most of the providers (e.g. Geforce Now and Parsec) use the hardware accelerated implementation of H.264/MPEG-AVC standard. Due to the lack of available datasets using such a fast encoding setting, we created a new gaming video dataset, which is called CGVDS 2 . Compared to the three other gaming video datasets, CGVDS uses hardware accelerated implementation of H.264/MPEG-AVC (NVENC), a wider spread of video games, more encoding parameters and videos captured at 60 fps. Table 1 gives a short overview of the three available datasets compared to the CGVDS dataset in this paper.  In series of subjective tests with over 100 participants, the influence of three important encoding parameters using 15 different games on the perceived video quality was assessed. The subjective tests were conducted as a passive viewing-and-listening test as described in ITU-T Rec. P.809 [10]. In order to avoid fatigue due to lengthy subjective tests, five studies were designed, each with 3 games and four different bitrate levels and two or three resolution and frame rate levels resulting in a total of 72 stimuli per study which requires one-hour test duration per study. Two studies are designed using three resolutions and two frame rate levels (60 and 30 fps), which is called part-1, while three other studies are created using two different resolutions (1080p and 720p) and three frame rate levels, which is called part-2. Such a split was done in order to fit 15 games into five series of subjective tests using a within subject design for at least a block of three games in each study. Each study was conducted with a minimum of 20 subjects. In order to mix all these studies, three video sequences as anchor conditions were added to the subjective test. Prior to the test, three video sequences are shown to the participants for training purposes and to ensure that the participants understand the used measurement scales (cf. Section 3.3). Table 2 gives a summary of encoding parameters used in the subjective test. In addition, Table 3 shows the selected bitrate per resolution-frame rate pairs.
GPU hardware accelerator encoding (NVENC) was used, as current industry strategies apply this type of encoding to reduce the time for the encoding process and with that, the overall round-trip delay for players. The reference gaming video sequences were captured losslessly in the RGB format at the frame rate of 60 fps. The sequences were recorded using Fraps 3 and encoded by FFmpeg 4 .  Details about the subjective tests are described in the following sections. The subjective methodology used for the experiments is mainly based on the work summarized in ITU-T Rec. P.809 [10]. However, a few differences applied in the procedure of experiments

Test Setup
For the subjective tests, the light conditions in the test rooms, its acoustical properties, as well as the viewing distance and position of participants were consistent for all experiments. In general, ITU-T Rec. P.910 [11] and P.911 [12] were considered. Participants were offered an adjustable chair and table to sit in a proper position (the chair allows participants to have their feet flat on the floor and their knees equal to, or slightly lower than, their hips). The viewing distance (D) was equal to three times the picture height (H).
For all five series of subjective tests, the following elements remained constant: § Participants were recruited as described in [10]. § Test duration: maximum of 1 hour for each experiment. § Video presentation: LCD monitor, 24" display for standard gaming, HD1080 resolution, no G-Sync, no Free-Sync, (others see [12]). § Audio presentation: With the aim at modeling the video quality, no audio was presented to participants. § Participants were asked to sit in the center of the video display at the specified viewing distance.
Additionally, the following monitor settings were used: § Default values (auto) for luminance, contrast and color shade of white were used. § Brightness was adjusted according to Rec. ITU-T P.910, but not the contrast (it might change the balance of the color temperature). § The gamma was set to 2.2. and color temperature to 6500 K (default value on most LCDs). § The refresh rate of the monitor was 60 Hz.

Participants
Gaming QoE is highly influenced by many user factors. While it is not the aim of this study to accurately predict the user types and their preferences, there are certain characteristics of participants that might be controlled and instructions for the participants should aim to maximize reliability, validity, and objectivity of the results.

Test Methodology
In addition to video quality, three perceptual sub-dimensions of video quality (cf. Table 4) as well as an acceptance rating of video quality have been assessed. Each dimension is explained to the participant in an introduction in written form using describing adjectives, and in the form of example videos. Here, typical gaming content is used. The rating scales are continuous, using the dimension name as an item label and antonym pairs to describe the range of the scales. An example of the Discontinuity is given in Figure 2. For more information about the method, the reader is referred to [22], where insight about the usage of the different rating scales as well as the test procedure are given. The dimension discontinuity is considered as temporal video quality, whereas the remaining dimensions, fragmentation and unclearness, form the spatial video quality. The dimension luminosity, as proposed in ITU-T Rec. P.918, was not used, as the video content was not changed in this respect and as the use would have increased the test duration unnecessarily. At the end of the test, a few questions about the importance of different quality dimensions where asked in order to investigate the judgement process of participants better.

Video Complexity
The selection of game content is a difficult challenge as the research community is still lacking consent on a suitable content classification, which also takes the influence of network impairments or encoding settings into account.
The importance of game characteristics on QoE has been studied and established in the work presented in [24] [32]. These characteristics were considered in the selection of material for the dataset. Figure 3 shows how the subjective ratings are distributed for 15 selected video sequences of CGVDS at 1080p and 60 fps which illustrates the proper selection of video games based on their complexity. We used the classification proposed in [32] to determine the complexity of video games. Therefore, three game characteristics are considered for game classification, Static Area (SA), Degree of Freedom (DoF) and Amount of Camera Movement (ACM). ACM is defined as how often the camera is moving plays an important role in the temporal complexity of a video. DoF is referring the movement of the camera, which may be attached to a character, can have up to six degrees of freedom. SA is defined as the ratio of static pixels between two consequent frames to the overall number of pixels of a frame. We refer the reader to [32] for more details on how to classify the games based on these three characteristics.

Performance of Signal-Based Models
We evaluated a total of nine video quality assessment (VQA) metrics on the dataset as follows: Peak Signal to Noise Ratio (PSNR), is the most widely used VQA metric and relies on the computation of the logarithmic difference between corresponding pixels in the original and impaired frame. Structural Similarity Index Metric (SSIM) measures the structural similarity between two images and usually provides better video quality predictions compared to PSNR.

Multi-Sale Structural Similarity Index Metric (MS-SSIM)
is the refined version of the SSIM which considers the variation of the viewing conditions. Video Multi-Method Assessment Fusion (VMAF) developed by Netflix, fuses three different metrics together to obtain a single score as the estimation of the video quality [35]. Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) using locally normalized luminance coefficient tries to quantify the possible loss of "naturalness" [15]. Natural Image Quality Evaluator (NIQE), which is based on a space domain NSS model, is a learning-based quality estimation metric [16].

Perception based Image Quality Evaluator (PIQE)
is an NR metric that uses cues from human visual system to build a blind image quality model [28].

No Reference Gaming Video Quality Metric (NR-GVQM):
A NR machine learning-based video quality metric for gaming content which is trained based on low level image features with the aim of predicting VMAF without having access to a reference video [31].
NDNetGaming (NDG), a deep learning-based model which is trained based on a prominent Convolutional Neural Network (CNN) architecture called Densenet [8]. This metric is trained based on gaming content and has shown a very high correlation with subjective test for gaming content [27].
For the computation of PSNR and SSIM, we used the VQMT tool available in [12]. For VMAF calculation, we used the Linux based implementation made available by the developers in [35] Table 5. Since, most of the metrics are imagebased quality metrics and we have different frame rate levels, we evaluated the performance of metrics once only at 60 fps and once on all frame rate levels. Surprisingly, for some metrics, performance goes up if all frame rates are considered. We believe that this is due to more available data for calculation of correlation. We also observed that the three NR metrics, PIQE, NIQE and BRISQUE, perform poorly in the presence of blurriness which is in line with the findings of [3]. It can be observed that NDG outperforms all other metrics. However, it should be noted that NDG is a deep learning-based model, which is trained based on gaming datasets, GVSET and KUGVD.

Performance of Parametric Models
In the remainder of this section, the performance of the following three standardized video quality models when applied to CGVDS will be described:

Classification of Video Games
Gaming videos are diverse in terms of video complexity in both temporal and spatial domain. Depending on the design of the game, a game could be abstract, which leads to lower spatial complexity, or designed with highly complex texture, which leads to higher spatial complexity. Also, depending on the camera movement and camera position (e.g. third perspective vs omnipresent), the temporal complexity might get affected [32] [6]. Without knowledge or assumption on the video complexity of games, it would be difficult to build a planning model that can accurately predict the video quality. Thus, planning models (G.1071, G.1072, and P.1203 mode 0) suffer strongly from varying complexity of video games due to lack of information about the video complexity. Therefore, it is clear that for most of the games, the complexity can be determined by investigating the game design. Thus, we have seen efforts to discover the characteristics that cause a game to become complex in terms of encoding complexity in the research community [6,32] and standardization bodies [26]. In this paper, we follow the classification that is proposed in [32] for development of the planning model, while the proposed monitoring model does not require any classification.

ITU-T Planning Models
The ITU-T Rec. G.1071 Annex A, provides models for estimating the quality experienced by a user of audio and video streaming services. The models are designed to be network planning tools hence it only uses properties of the network as parameters. Since the model is trained based on the none-gaming dataset without using hardware accelerator engines and with no knowledge on the game complexity, it was retrained to build a game specific model presented in G.1072, which focuses on predicting the QoE for cloud gaming services, based on a huge gaming dataset. Therefore, both ITU-T planning models, ITU-T G.1071 and G.1072, are using the same structure to predict the video coding impairment.
The video coding impairment takes six input arguments of which two are set constant ( = 0 slicesPerFrame = 0, TSburstinessV = 0) because they are not present and unknown in the given data set. The remaining parameters bit rate, number of pixels per frame, and frame rate are used to estimate the experienced quality of a video that is being transmitted with these characteristics. The module further depends on 21 coefficients of which only 7 are needed for training the dataset. ITU-T Rec. G.1072 uses the same equation for prediction of coding impairment but the model is retrained based on a huge gaming video dataset build using hardware accelerator engine and based on the three  Table 6 the change of the coefficients is shown. Significant changes of values can be seen especially for a3v, which is a balancing factor for calculation the content complexity (equation 2). This may be taken as an argument for changing the calculation of the content complexity for gaming content. In G.1072, the high complex class is considered as the default mode in case the game complexity class cannot be determined.
The model predicts the video impairment based on an exponential function of a parameter named . is defined as an average amount of bits assigned to a video sequence as it is calculated in equation 2, in which is the product of the width and height of the video resolution. It must be noted that both models considered in this section, ITU-T G.1071 Annex A and G.1072, follow an impairment factor approach similar to the well-known E-model. Therefore, the video quality ratings are transformed from a 5-point ACR (absolute category rating) scale to the R-scale, which ranges from 0 to 100, using the transformation presented in [14]. Since no transmission errors present in the dataset, for the evaluation of the performance of the models, only the coding impairment factors, which estimates the video quality impairment for video compression artefacts on the Rscale, is considered, and not the video quality impairment for video transmission errors ( 6789 ). This results in the following implementation of the models, which only differ in the coefficients trained on the different train datasets.

Performance of ITU-T G.1071 -G.1072
The performance evaluation result of G.1071 and G.1072 on the presented datasets in terms of PLCC as well as Root Mean Square Errors (RMSE) is shown in Table 7. As it can be observed from Table 7, the classification used to retrain the G.1072 increases the performance of G.1071. We observe that for both models, the performance of low complex class is very low, while G.1072 has the highest performance for high complex video games. It must be noted that both ITU-T models are planning models and they are not supposed to have access to the video complexity information in general. However, the network planner and service provider can make an assumption on the type of games they might stream. Therefore, such a classification can be assumed in the prediction process. Table 6: Coefficients of G.1071, and G.1072 impairments for each encoding complexity in which class 1, 2 and 3 represent low, medium and high complexity classes respectively.

ITU-T Rec. P.1203
The ITU-T P.1203 is a series of models that are designed to predict the audio-visual quality experienced by a user for adaptive and progressive-download-type media streaming that could be placed in end-point locations or at mid-network monitoring points. ITU-T Rec. P.1203 has four modes depending on the level of access to the information due to encryption as shown in Table 8. Mode 0 of ITU-T Rec. P.1203 could be considered as another potential planning model candidate to predict the gaming video quality. We investigate the video quality module of Mode 0 which only uses bitrate, frame rate, and coding resolution. Equation 5 illustrates the function that predicts video quality of Mode 0.
Due to change of coefficients for different modes and limited space, we did not provide them in detail in this paper.  However, the coefficients are updated, and the frame size was used instead of video bitrate. It must be noted that the content complexity is taken into account based on the ratio of the I-frames and none-Iframes average size. In addition, the MOS was predicted using a sigmoid function.
Mode 2 can have access to 2% of the media stream. Therefore, the quant equation was measured with higher accuracy using quantization parameters (QP) as follows: , where bbbb`a is based on the quantization parameters in the bitstream for non-I-frames. The MOS can be predicted based on the equation 5.
The Mode 3 has access to the full bitstream in which the QP values are parsed for all frames in the measurement window and uses the same equations as Mode 2. The main difference between Mode 2 and Mode 3 is the amount of information from the bitstream that can be accessed for training and predicting the video quality. In Mode 2, due to limited access to the bitstream some QP values are estimated based on the neighboring frames, while in Mode 3 all QP values are used in training and prediction Mode. The result based on the all three modes as well as new fits for Mode 0 is presented in Table 9. It has to be noted that the result for Mode 2 is not provided in Table 9 as the implementation of Mode 2 was not available to the authors. Also, it can be seen that Mode 0 has higher correlation compared to Mode 1, while the RMSE is significantly lower. Therefore, this only implies that Mode 0 prediction follows the VQ distribution slightly better than Mode 1 but overestimates the video quality significantly higher than Mode 1. In general, both modes overestimate the video quality due to higher requirement of bitrate to in NVENC, present of llhq, to reach a certain quality compared to software implementation of H.264.

Perceptual Dimension based Gaming Video Quality Model
In this section, we proposed two models, a planning model and monitoring bitstream model, that can predict the overall gaming video quality based on the perceptual video quality dimensions, Fragmentation, Unclearness, and Discontinuity presented in section 3.3. The planning model solely uses the encoding parameters bitrate, frame rate, coding and display resolution, while for the monitoring model, in addition to the encoding parameters, the packet header information such as size of I and P frames is used to predict the perceptual dimensions of video quality. The latter follows a similar approach as ITU-T Rec. P.1203 mode 1 in terms of the level of access to the data. We believe that such a structural based model benefits a service provider to update the model easier in case of having new parameters or interest in higher range of parameters. Moreover, such a model can act as a diagnostic model that can explain which degradation causes a potentially low video quality.
Both proposed models share the same impairment factor structure which is inspired by the E-model [20] as illustrated in equation 8.

Data Preparation
Before starting the modeling phase, we applied a few steps in order to prepare the data for the next steps. As mentioned in section 3, we followed the ITU-T Rec P.809 for the quality assessment of video games. Thus, the ACR with a hidden reference method was used for the subjective tests to collect ratings on a 7-point continuous scale according to [10].
In order to derive suitable impairment factors for the modeling, the extended continuous 7-points ratings had to be transformed to the R-scale (cf. ITU-T Rec. G.107 [20]), which ranges from 0 to 100.
This was done in the following manner: § Transformation of extended continuous 7-point ratings (EC) to 5-point ACR ratings using the transformation presented in The highest MOS of a video quality rating ( g8h ) in the dataset is 4.58, which represents a g8h of 100 in Eq. 8. To train the models presented in the following sections, the full CGVDS dataset containing 15 games for various encoding settings was split into a training and test dataset. For the test dataset, we randomly selected three source video sequence (one per complexity class). The remaining games were used as a training dataset to derive the model equations.

Planning Model
The planning model predicts the video quality based on the impact of compression impairments and targets cloud gaming services or passive gaming video streaming encoded using NVENC. The model is a network planning tool which can be used by network and cloud gaming providers for purposes such as resource allocation and configuration of IP-network transmission settings such as the selection of resolution and bitrates. Each dimension is predicted solely based on parameters relevant for each dimension, which also offers a parameter reduction.

Impairment of Video Discontinuity (IVD):
The video discontinuity dimension measures the perceived jerkiness of a video due to low encoding frame rate. Since, VD is only triggered by frame rate, we modeled it based on an exponential function of frame rate as the best fit based on our data. Two classes are driven based on the ITU-T contribution [25] which suggests to used information about the characteristics movements of a virtual camera, frequency of game object movements, and pace of the interaction with the game to classify game content in respect to frame losses. If no information about the class is available, the high complex class should be used as a default mode. VD is predicted using equation 12 using the coefficients per class presented in Table  10.

Impairment of Video Fragmentation (IVF):
The video fragmentation dimension gets affected by bitrate, resolution and frame rate. Therefore, we define variable " " which explains the number of bits that is spent in average for a pixel.

=.
* ℎ ℎ * ℎ (13) Three classes are driven based on the ITU-T contribution [25] which suggests to used information about the characteristics movements of a virtual camera, texture details, and frequency of movements of game objects to classify game content in respect to its encoding complexity. If no information about the class is available, the high complex class should be used as a default mode. VF is predicted using equation 14 using the coefficients per class presented in Table 11.

Impairment of Video Unclearness (IVU):
The video unclearness dimension gets affected mainly due to upscaling impairments. However, also strong blockiness can be perceived as unclearness. Thus, we used frame rate and bitrate in addition to the resolution of the video for the prediction of VU. Therefore, we define a variable called " " which explains the ratio encoding resolution to the display resolution.
Core model: The core model as mentioned in equation 8 follows the structure we introduced in the beginning of section 6. The

Performance of Proposed Planning Model
The goodness of fit for the prediction of each dimension is reported in Table 13 in terms of RMSE and adjusted R-squared. It has to be noted that the RMSE is reported based on R-scale (ranges from 1 to 100). The performance for the core model is reported based on the MOS scale and based on the test dataset. Table 14 reports the performance of the proposed planning model on the training and test dataset.

Bitstream Model
Bitstream models can be used by both service and network providers to monitor the quality of streamed videos of a certain service. Cloud gaming providers can benefit from bitstream models where z is the measured quality of frame i based on NDNetGaming.
{zx8 ‡ is the final predicted video quality of a certain video based on NDNetGaming.
Lastly, the MOS of an interval duration from frame i to frame j of a video sequence is calculated as follows: , where •zƒ‚e is the Mean Opinion Scores of a video sequence. It must be noted that MOS is not only referring to the MOS of video quality, but also the same interval-based approach was used for prediction of MOS for VF and VU. In order to build the model, we extract multiple features from each 5 second sequence of each video sequence. The following features are used for training the model: § fr: Frame rate § srvideo: based on Equation 15 § bravg: Average bitrate of five second sequence § numI-frame: Number of I-frames § bravg_I: Average of I-frames bitrate § statP-frame: Statistics of P-frames (average, standard deviation, etc) § stdP-frame: Standard deviation of P-frames bitrate § CPvideo: Video complexity based on as follows " •--˜™-š bbbbbbbbbb •--˜™-š bbbbbbbbbbb Video complexity is described using " •--˜™-š bbbbbbbbbb •--˜™-š bbbbbbbbbbb as it was used in P.1203 as well. CPvideo estimates the spatio-temporal complexity of different video gaming contents based on the ratio of I-frame and P-frame size in the measured window. It must be noted that in case of no I-frame in the 5 second window, the first past I-Frame from previous windows of the same video was used. We build the model based on the perceptual video dimensions.
Similar to the proposed planning model, the perceptual dimension based bitstream model, first predicts the impairment of VF, VU, and VD and then predicts the VQ impairment based on the core model, equation 8. In general, each perceptual dimension is trained based on the relevant extracted information using Random Forest regression. The list of features used for training and prediction of each dimension is listed as follows:

Impairment of Video Discontinuity (IVD):
Video Discontinuity is the perception of jerkiness in the video due to the low frame rate or frame loss, which is affected based on the frame rate, temporal complexity of a video. Among extracted features, two features that might be representative features of temporal video complexity are stdP-frame, CPvideo. Therefore, we used fr, stdP-frame, CPvideo for prediction of IVU.

Impairment of Video Fragmentation (IVF):
VF can be affected based on all extracted features. Therefore, we used all extracted feature to train our model that predicts IVF. We used Recursive Feature Elimination (RFE) method to eliminate those that do not significantly contribute to the model. After applying RFE, the following relevant factors remained in the IVF model: fr, stdP-frame, CPvideo , sr, srvideo, statP-frame, bravg Impairment of Video Unclearness (IVU): Video Unclearness is affected mainly by scaling ratio. However, as mention before VF can cause the video unclearness. Therefore, all features are added to the model similar to the IVF. Also, Recursive Feature Elimination (RFE) method is applied to keep only relevant features. After applying RFE, the following relevant factors remained in our model: fr, stdP-frame, CPvideo , sr, srvideo, statP-frame.
Core model: we used the equation 8 which the coefficients are updated as it can be seen in equation 20. In order to fairly train and test the model, we split the dataset into five bins which in each bin three games are randomly selected. The model was trained five times, each time one bin was hold-out and the model was trained based on the remaining videos. The performance of the model is calculated based on the average of five-time training phases. Figure  4 shows the performance of the model based on all videos in a scatter plot. The model achieves RMSE of 0.34 and PLCC of 0.90. @ef = g8h − 0.243 ⋅ 9i − 0.412 ⋅ 9j − 0.434 ⋅ 9k (20)

Discussion
The two proposed models are light weight models that can be used for video quality assessment of cloud gaming services for planning and monitoring purposes. Since the could gaming providers have access to the encoding setting as well as bitstream information, we believe that these models are more suitable than signal-based model that require lots of computation and storage. In addition, network planners and providers can benefit from these two models for planning purposes as well as monitoring of the gaming quality. The monitoring model can predict in the interval window of 5-second which allows cloud gaming providers live monitoring of the QoE.
In this work, we only assessed the video quality in passive viewingand-listening tests. However, the gaming QoE in cloud gaming services is not only video quality but also interaction quality plays a role. While, the gaming QoE is beyond the scope of this paper, in [23] it is shown that passive view-and-listening tests can be used to predict the gaming QoE when only the spatial video quality is affected.
The proposed planning model relies strongly on the knowledge about the game complexity class. It has been shown in ITU-T Rec P.1072 as well as this paper that without such a classification the model strongly suffers from the variation of spatio-temporal video complexity. Therefore, we followed the same proposed game classification which is proposed in [33] and also used for ITU-T Rec G.1072. However, similar to ITU-T Rec 1072, the default mode is considered as high complex class.

Conclusion
In this paper, we present the first large scale video quality dataset based on hardware accelerator encoding setting for gaming content. This dataset can be used for building gaming video quality models which can be employed in quality assessment of cloud gaming services. In the presented dataset, we evaluate three perceptual quality dimensions in addition to the video quality which can help the research community to build models for a specific type of distortion such as blurriness or blockiness estimation.
In addition, we investigate the performance of well-known signalbased and parametric models for gaming content which helps the research community in the selection of proper quality models for different purposes. Finally, we build two parametric models for planning and monitoring purpose based on perceptual video dimensions. The models outperform the existing the quality models, however, we did not retrain them based on the new dataset which is due to unavailability of implementation for the training the models. Using dimension-based approach is a new approach and offers promising results and analytic insights, while adding minor errors due to two phases of modeling.