(2018). Edge dissimilarity reduced-reference quality metric with low overhead bitrate. Indonesian

Abstract


INTRODUCTION
Video quality assessment is crucial in everyday life as it is the key function in visual processing and interaction between humans, machines and systems.The ultimate end-users are humans, hence it is ideal to have video perceptual quality assessment that reflects or matches the requirement and satisfaction of human end-users.However, the subjective assessment is time-consuming, not ideal for real-time applications and very restrictive in nature as it requires a standardized environment.Therefore, an automatic and computable quantitative visual quality measure of visual content is in need to evaluate the quality and assist any further visual processing.
The aim of this paper is to provide an automated quality of experience indices, which can be used for quality control that reflects on human visual perception.The challenge of formulation of an objective video information content measure that is consistent with its subjective video information content measure.In order to be consistent, the assessment needs to be sensitive to underlying substantial intensity dissimilarities in image or video.The usage of human visual system (HVS) knowledge in video processing may be performed by recognizing the characteristics of human perceptions.HVS characteristics may be divided into its physical structure, visual perception and its image processing theories [1].The two most common theories are the line-edge detection theory, and spatial frequency theory.The line-edge detection  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 631 -640 632 theory is based on the primary visual cortex cells that have areas of excitation and inhibition that respond to a luminance edge, as well as to bright or dark lines, which ultimately detect edge and line, respectively.This finding is used as the basis for the proposed metric that is based on edges as well as the shapes of objects.In this work, a novel Edge Dissimilarity Reduced-Reference (EDIRR) metric for reduced reference video quality assessment with low overhead bitrate is proposed.The metric is derived from the motivation that human perception understands an image mainly according to its low-level features, specifically the edges, which are a measure of the significance of a local structure.Detecting accurate information of distortion level on received videos is an important technique to measure the visual quality of transmitted video over unreliable wireless channels.During pre-or post-processing, compression, acquisition, transmission and storage, video frames may change due to various artifacts or noise which is viewed as distortion that degrades the visual quality [2].The artifacts can be divided into two types, compression artifacts and also transmission errors.In this paper, both types of errors are introduced to the tested sequences as it is important to have a diversity of distorted test sequences.In this work, the last frame in each Group-of-Pictures (GOP) of the reference and distorted frames are used as inputs.The system produces an output in a form of a value that quantifies the quality of the distorted visual signal.
In video transmission wireless networks, there is rather limited support for bandwidths capacity.Even more, when a reduced-reference metric is implemented on the video transmission, a side information would need to be transmitted via ancillary channel with some information from the reference video.Therefore, it is pertinent to have low overhead bitrate for the side information side.In [3], a novel EDIRR metric has been proposed, where its assessment has been found to correlate well with DMOS obtained from LIVE.However, since the overhead bitrates from the metric are not sufficiently low (2.8 Mbps), an investigation focuses on compressing and evaluating the ability of the metric after being compressed is proposed to be investigated in this paper.The compression performance is investigated by looking at different combinations of spatial and temporal resolutions.This algorithm allows an effective assessment of reduced-reference video quality with much lower overhead cost.

RESEARCH METHOD
The proposed RR metric in this paper is performed by comparing the edge information of the original and distorted sequence in terms of structural distortion.The edge degradation can be detected in this manner as it is highly associated with the structural edge.The development of the edge-based distortion measure is by assessing the decoded grey scale images.The low-level content of visual importance or salient image features such as the edge intensity definitely has image information and this is perceptually important.Therefore, using this observation, the edge information is incorporated into the RR dissimilarity measure to develop video quality indices -EDIRR: Edge-based Dissimilarity Reduced Reference Metric.
The test sequences are obtained from the Laboratory for Image & Video (LIVE) Video Database [26] and the proposed metric would be tested against the subjective quality score, Differential Mean Opinion Score (DMOS), provided from the subjective study carried out by LIVE [27,28].The DMOS values range from 0 to 100, where the smaller value expresses the greater quality and the larger value states the worse quality, are collected using the subjective test model specified in ITU-R BT 500.11[29][30][31].The subjective study was conducted using a single stimulus procedure with hidden reference removal and the subjects indicated the quality of the video on a continuous scale.Subjects also viewed each of the reference videos to facilitate computation of difference scores using hidden reference removal.From the database, 80 distorted video sequences were obtained from 10 different high-quality videos with a wide variety of content as reference videos.
A set of 80 distorted videos are tested using two different distortion types: H.264 compression and simulated transmission of H.264 compressed bitstreams through error-prone wireless networks, as these type of distortions relate the most to the work performed in this chapter.The diversity of distortion types is to test the ability of the proposed objective model to predict visual quality consistently across distortions.The H.264 compression system produces fairly uniform spatial and temporal distortions in the video.Network losses, however, cause transient distortions in the video, both spatially and temporally.The H.264 compressed videos exhibit a visual appearance of typical compression artifacts such as blur, blocking, ringing and motion compensation mismatches around the edges of the main body in the frame.Videos obtained from the wireless transmission error exhibit errors that are restricted to small regions of a frame.Errors sustained by an H.264 compressed video stream in a wireless environment are also spatio-temporally localized distortions, due to the small packet sizes or temporally transient and appear as glitches in the video.A packet transmitted over a wireless channel is susceptible to transmission errors due to various factors such as shadowing, attenuation, fading and multi-user interference in wireless channels.All of the ten uncompressed high-quality YUV sequences used have the resolution of 768 x 432 pixels.Each sequence was assessed by 29 valid human subjects in a single stimulus study where the scores are based on a continuous quality scale.The DMOS from the subjective evaluations are used to compare with some of the similarity measures.In this work, all of the frames are used to determine the most suitable similarity measure.However, only the last reference frames in each GOP from both the reference and received sequences perform as inputs in the proposed quality assessment system.This is due to the fact that it is crucial in keeping the overhead bit rate as low as possible, as well its practicality and realistic in keeping with real-time wireless transmission over a multicast network scenario.The system outputs a value to quantify the quality of the distorted visual.The LIVE Video Database has been evaluated by many researchers and has been verified with various objective performance metrics.

RESULTS AND METHOD
The proposed RR metric in this paper is performed by comparing the edge information of the original and distorted sequence in terms of structural distortion.The edge degradation can be detected in this manner as it is highly associated with the structural edge.The development of the edge-based distortion measure is by assessing the decoded grey scale images.The low-level content of visual importance or salient image features such as the edge intensity definitely has image information and this is perceptually important.Therefore, using this observation, the edge information is incorporated into the RR dissimilarity measure to develop video quality indices -EDIRR: Edge-based Dissimilarity Reduced Reference Metric.
The test sequences are obtained from the Laboratory for Image & Video (LIVE) Video Database [26] and the proposed metric would be tested against the subjective quality score, Differential Mean Opinion Score (DMOS), provided from the subjective study carried out by LIVE [27,28].The DMOS values range from 0 to 100, where the smaller value expresses the greater quality and the larger value states the worse quality, are collected using the subjective test model specified in ITU-R BT 500.11[29][30][31].The subjective study was conducted using a single stimulus procedure with hidden reference removal and the subjects indicated the quality of the video on a continuous scale.Subjects also viewed each of the reference videos to facilitate computation of difference scores using hidden reference removal.From the database, 80 distorted video sequences were obtained from 10 different high-quality videos with a wide variety of content as reference videos.
A set of 80 distorted videos are tested using two different distortion types: H.264 compression and simulated transmission of H.264 compressed bitstreams through error-prone wireless networks, as these type of distortions relate the most to the work performed in this chapter.The diversity of distortion types is to test the ability of the proposed objective model to predict visual quality consistently across distortions.The H.264 compression system produces fairly uniform spatial and temporal distortions in the video.Network losses, however, cause transient distortions in the video, both spatially and temporally.The H.264 compressed videos exhibit a visual appearance of typical compression artifacts such as blur, blocking, ringing and motion compensation mismatches around the edges of the main body in the frame.Videos obtained from the wireless transmission error exhibit errors that are restricted to small regions of a frame.Errors sustained by an H.264 compressed video stream in a wireless environment are also spatio-temporally localized distortions, due to the small packet sizes or temporally transient and appear as glitches in the video.A packet transmitted over a wireless channel is susceptible to transmission errors due to various factors such as shadowing, attenuation, fading and multi-user interference in wireless channels.
All of the ten uncompressed high-quality YUV sequences used have the resolution of 768 x 432 pixels.Each sequence was assessed by 29 valid human subjects in a single stimulus study where the scores are based on a continuous quality scale.The DMOS from the subjective evaluations are used to compare with some of the similarity measures.In this work, all of the frames are used to determine the most suitable similarity measure.However, only the last reference frames in each GOP from both the reference and received sequences perform as inputs in the proposed quality assessment system.This is due to the fact that it is crucial in keeping the overhead bit rate as low as possible, as well its practicality and realistic in keeping with real-time wireless transmission over a multicast network scenario.The system outputs a value to quantify the quality of the distorted visual.The LIVE Video Database has been evaluated by many researchers and has been verified with various objective performance metrics.

Edge Detector Evaluation Methodology
Ideally it is aspiring to have a set of connected curves that indicate the boundaries of objects, while preserving the important structural properties of an image, after applying an edge detector.However, it is not always possible to obtain such ideal edges from real life images of moderate complexity.In reality, edges extracted from significant images are often hindered where the edge curves are not connected; missing edge  ISSN: 2502-4752 Indonesian J Elec Eng & Comp Sci, Vol. 9, No. 2, February 2018 : 631 -640 634 segments as well as false or weak edges emerge where they are not relating to appealing phenomena in the image.This complicates the process of interpreting the image data consequently.For standardisation and simplification purposes, Canny and Sobel edge detectors are going to be implemented throughout this paper as opposed to use only Sobel edge detector as in [3].
The process of filtering the edges and its quality metrics that can retain the metric score correlated with subjective quality assessment from LIVE database after undergo a few sub-sampling and image processing is depicted as this process flow as shown in Figure 1.

Spatial Resolution Reduction Methodology
There are various ways to reduce spatial and temporal resolutions.It is important to compare the effects of compressing methods on the images to be used as side information in the proposed Soergel distance from [3].One of the compression processes is down-sampling where the reduced sequence would have a horizontal and a vertical sampling frequency that is less than the maximum value of all components in the original sequence.It is decided to use sampling frequency of half and one third of the maximum value using the method of nearest-neighbour interpolation where it estimates an image value in between image pixels that the point falls within.Sampling reduces the size of an image by extracting pixels and the reduced image takes a value of interpolation for every block of two by two pixels for down-sample by factor of two and a value of interpolation for every block of three by three pixels for down-sample by factor of three.This type of interpolation is low in computation complexity as the number of pixels considered is low as compared to other types of interpolation such bilinear or bicubic interpolations.It also does not use any lowpass filter to prevent aliasing which is an unnecessary computation complexity cost since no high-contrast images are in used in this experiment.The nearest neighbour assignment resampling technique is used because rarely do the centres of the input cells align with the transformed cell centres of the desired resolution.The other spatial resolution reduction process is done through pixel sampling.The reduced sequence takes a sample value of every block of two by two pixels for spatial reduction by factor of two and a sample value of every block of three by three pixels for spatial reduction by factor of three.It reduces the spatial resolution of a component in an image.Both spatial resolution reduction methods are tested in order to decide the better method with low complexity in mind.The results of both spatial reduction methods are presented in Figure 2  Figure 2(a) and (b), show the edge extracted using Canny and Sobel edge detector from the first frame of Tractor sequence and down-sampled using pixel sampling method respectively.On the other hand Figure 2(c) and (d) show the edge extracted using Canny and Sobel edge detector from the first frame of Tractor sequence and are down-sampled using nearest neighbour interpolation respectively.Referring to Figure 1, the process after reduce the spatial resolution is to ensure the edge extracted and down-sampled can be process accordingly.Morphology function in MATLAB use set of image processing operations that process images based on shapes.It is decided to use a built-in Matlab function bwmorph with operation 'remove' in order to reduce the unnecessary information which will increase the overhead cost where it applies a morphological remove operation to the binary image.This function removes the interior pixels to leave an outline of the edges where it sets an interior pixel to 0 if all of its 4-connected neighbouring pixels are 1, which will become its boundary pixels and thus leaving only the boundary pixels on which reconnects many disconnected edges.This process is ideal in reducing the complexity and the side-information overhead as it removes the non-edge information and suppresses noisy pixels from the main structure within the frames.The morphological filter is suited to the purpose of extraction of main objects or structures from the scenes.
The whole process is to reduce the overhead while keeping and capturing the main structure of the frame as it is one of the most important aspects in video quality evaluation.It is a fact that human vision focuses on certain areas of interest within the frame and their sensitivity is greatly reduced outside the areas of interest.In this case the boundary of edge pixels comprises as the interest region.Proposed quality metric takes this aspect into account and attempt to model the focus of attention that is the main structure of the frame for overall video quality score computation.Figure 3 shows the example of outputs after applying the function onto all the down-sampled factor of two test sequence Tractor and the down-sampled factor of three test sequence Tractor.Figure 3 proved that it is pertinent to use the nearest neighbour interpolation instead of pixel sampling as the later method has no constructive effect and none of the edges seem to be connected when remove morphological function is applied.Instead the outputs from using pixel sampling gave either too much information, from edges extracted by Canny edge detector, or too little information, from edges extracted by Sobel edge detector, but both outputs did not manage to capture the main structures or essence of the image.Therefore, an algorithm consists of reducing the spatial resolution employing nearest neighbour interpolation, using remove morphological function on both Canny and Sobel edge extracted frames, as well as applying Soergel distance measure is proposed.The down-sampling means a lot of information within the sequences are thrown away and might affects the correlation of Soergel distance measure with subjective quality as found in [3].Therefore, it is necessary to employ the Soergel distance measure on the sub-sampled and processed sequences and observe the apprised metric performance in quantifying the video quality and its correlation with DMOS.
It is decided to only apply nearest neighbour interpolation edge detector in order to standardise the results and the process continued by applying remove morphological operation before implementing the Soergel distance measure on the processed sequences.The reason being is to reduce any weak edges as well as unwanted noises.Elimination of the noisiest bit levels from the original image gives better results for the edge image analysis or edge detection.Mathematical morphology is interesting because it involves simple logical operations, thus making real-time application possible.

Temporal Resolution Variations
All sub-sampled sequences are compressed using CALIC and the bitrates for each sequences are calculated.The compressor called CALIC which stands for Context Adaptive Lossless Image Compression uses both context to obtain the distribution of the symbol being encoded, and prediction of the pixel values by using previous values of the sequence to obtain prediction of the value of the symbol being encoded.Bitrate of the side information is calculated using Equation (1): After the side information is compressed using CALIC with the original sequences bitrate (25 fps or 50 fps), the side information overhead cost is found to be still high.Video streaming multicast would generally deteriorate as the frames transmitted accumulate, therefore usually the highest distortions could be observed in the last frame of each GOP.For practicality, the side information is transmitted at the end of every GOP, where the frames would have the worst signal if compared to the signal in previous frames.The size of GOP can be decided by the video transmitter and provider depending on the bandwidth availability of the user's channel.Therefore, it is determined to reduce the temporal frequency.The effect of reduction in temporal frequency is investigated on the test sequences using three different scenarios.These scenarios are transmitting side information every one frame per second (fps), two fps and three fps.The test sequences in the LIVE database have either 25 fps or 50 fps bitrate and the scenarios for each temporally down-sampled are described as follow in Table 1.Temporal down-sample is to cater the need to reduce the overhead cost.The original frame rate may be 25 fps or 50 fps and the frame rates are reduced by performing frame drop in temporal down-sampling.

Results and Analysis
Experimental tests are conducted based on the LIVE Video Database in order to justify the proposed metric and its correlation with the HVS.Correlations between the subjective score and the objective metric have to be computed in order to verify the usefulness of the objective metrics using various performance metrics.The first analysis performed is the correlation between the DMOS and the quality indices from the variety of similarity measures.The correlation coefficients are acquired between the two parameters in order to compare and justify the performances relatively.The correlations of every sequence are averaged and the findings are reported in Table 2. Table 2 shows the correlations between spatially down-sampled, with remove morphological function applied and original resolution from [3], with LIVE DMOS score.The Soergel distance measure calculated on every sequences were averaged for each sequence over the range of wireless and compression distortions.

Figure 1 .
Figure 1.Methodology of Finding Out the Correlation between Metric Score for Spatially and Temporally Down-sampled Test Sequences with LIVE Database Score

Figure 2 .
Figure 2. Down-sampled by Factor of 2 on Canny (left) and Sobel (right) Edges using Pixel Sampling (top) and Nearest Neighbour Interpolation (bottom)

Figure 3 .
Figure 3. Sequence of Boundary Images Produced by Canny (left) and Sobel (right) detector.The Images are Down-sample by Factor of Two in (a)(b)(c)(d) and by Factor of Three in (e)(f) All with Morphological Post-Processing

Table 2 .
Pearson Correlation between Spatially-down-sampled then Morphologically Removed and Original Resolution Sequences with LIVE DMOS score.