Improvement of HEVC inter-coding mode using multiple transforms

Multiple transforms have received considerable attention recently, especially in the course of an exploration conducted by MPEG and ITU toward the standardization of the next generation video compression algorithm. This joint team has developed a software, called the Joint Exploration Model (JEM) which outperforms by over 25% the HEVC standard. The transform step in JEM consists in Adaptive Multiple Transforms (AMT) and Non-Separable Secondary Transforms (NSST) which are designed and adapted to the intra-coding modes. In inter-coding, only the AMT is allowed and it is restricted to a single set of five transforms. In this paper, adaptive transforms schemes suitable for inter-predicted residuals are designed and proposed to improve the coding efficiency. Two configurations are evaluated for the proposed designs, providing an average bitrate saving of roughly 1% over HEVC with unchanged decoding time.


I. INTRODUCTION
HEVC/H.265 is the latest video coding standard [1], released in January 2013 as the successor of AVC/H.264 [2]. HEVC provides more than 50% of bandwidth reduction compared to AVC, for the same perceived visual quality. Consequently it is well adapted to larger resolutions such as Ultra-High-Definition (UHD) contents [3]. With next generation formats in focus, such as 360 degrees video, MPEG and ITU jointly established the Joint Video Exploration Team (JVET) in October 2015 to prepare the next generation of video coding standard, beyond HEVC. A Joint Exploration Model (JEM) has been developed and this software provides more than 25% of coding efficiency compared to HEVC in Random-Access configuration (RA) [4].
The JEM, initially built at the top of the HEVC Test Model 16.6 (HM-16.6) [5], introduces many new tools [6]. Among those new tools, the transform stage introduces the notion of transform competition through two stages.
The first stage, called Adaptive Multiple Transforms (AMT) [7], proposes a block-level flag that signals whether the classical DCT2 (Discrete Cosine Transform kernel of type II) is used. If not, additional indexes are transmitted to signal the selected horizontal and vertical transforms, in a list of trigonometric kernels [8] (DCT and DST of types I to VIII). It must be noted that the indexes point to transform sets that depends on the Intra Prediction Mode (IPM) for intra residuals while a single set is considered for inter-predicted residuals. The DCT8 and DST7, combined in horizontal and vertical directions are available there.
A second transform-stage can be added for intra-coded blocks, called Non-Separable Secondary Transforms (NSST). Those transforms are based on hypercube Givens rotations [9] applied on the lower frequency coefficients after the AMT transformation. The impact of these tools has been evaluated for the JEM, where they each provide around 2% of bit rate savings [10]. The impact of the transform-related tools undeniably represents a significant part of the coding gains in the JEM version (JEM5 at the time of writing this article).
Several technologies have been proposed to improve the transform stage. For example, in [11], an extension of the AMT transform set is proposed by introducing two additional transforms kernels. This enables bitrate savings of roughly 0.3%. In addition, an alternative transform set design has been proposed in [12] to reduce the required computational power by replacing the most expensive transform kernels, providing in the range of 50% of encoding complexity reduction for equivalent compression performance.
It can be noticed that none of these methods have been optimized and deeply investigated for the inter coded residuals. The AMT authorizes a wide variety of transform kernels for intra residuals while only DST7 (Discrete Sine Transform if type VII) and DCT8 (DCT of type VIII) are considered in inter. Regarding the secondary transform stage, NSST is only activated for intra slices.
Several approaches have been proposed in the litterature to improve the transform stage efficiency on motion-compensated residuals. In [13], the authors propose to adaptively rotate the DCT2 for inter residuals (ROT). In this approach, the DCT2 is multiplied by a cascade of rotational transforms, where the angles composing the global rotation are estimated thanks to a gradient-based searching algorithm. Then a syntax scheme is proposed to signal the angles and whether a region uses the rotational transforms or not, which leads to 3.9% of coding gains compared to AVC. In [14], inter frame prediction residuals are modeled under the assumption that image intensities follow a first-order Markov mode in the direction of the motion trajectory. An adaptive transform, which requires update at the decoding side, is in competition with the DCT2 and the results reveal that 2% gains are achieved compared to AVC. In [15], the application of graph-based transforms (GBT) is also explored on residuals generated with HEVC inter-mode, where GBT achieves substantial gains compared to the DCT2 and KLT. The subject of transform competition in the case of inter-coded residuals remains not well covered in the literature.
Although, the ROT and GBT approaches previously mentioned are promising in term of performance, they require the transmission or update of the transform coefficients which can be an issue for hardware implementations: this typically prevents these transforms from fast implementations.
In this paper, an improved AMT scheme with adaptive transform set selection for inter-coded slices is proposed to resolve these issues. The proposed scheme extends the JEM using the same transform kernels, and dynamically adapts the transform sets used on inter residuals and provides an improved coding efficiency over HEVC.
This paper is organized as follows. The RDOT criterion is first introduced as a mean to select appropriate set of transforms for inter-coding. Then, the selection of the number of transforms is discussed and the coding performance obtained while the number of transforms is increased is presented. In the subsequent section, an adaptive transform set approach is discussed and evaluated.

A. Rate Distorsion optimized transforms
The Rate-Distortion Optimized Transforms (RDOT) have been introduced in [16] to efficiently learn transforms for a given set of residuals. In [17], the RDOT method is used to learn optimal sets of transforms for intra-predicted residuals in HEVC for the general case of non-separable transforms, then extended for separable transforms and Discrete Trigonometric Transforms (DTT). In this paper, set of DTTs is considered as a support to learn the transform sets used in the proposed design, in a fashion similar to the transforms adopted in JEM [7]. Hence, the RDOT learning aims at finding an optimal pair of vertical and horizontal transforms {A v , A h }, for a set of residuals {x i } defined by solving the following minimization problem: the horizontal and vertical transforms and c i the transformed and quantized residual. As demonstrated in [17], the Lagrangian multiplier λ depends on the quantization accuracy. In this paper, a transform set is learned based on inter-predicted residuals extracted from bitstreams coded with HEVC in RA configuration, for 70 sequences (with resolution varying from 240p to 2160p). Over 10 million of residuals blocks are considered at this stage.
For the purpose of this article, the learning process is performed to select a set of transforms for inter-predicted residual. Therefore the learning process is turned into a selection process of M pairs of vertical and horizontal transforms in the set of all possible discrete trigonometric transforms.
The learning design is illustrated in Algorithm 1 [17], [18]. For all possible transform sets, the residuals are clustered into classes related to each transform pairs according to the RDOT metric (Class m ). When a set minimizing the RDOT metric is reached, the convergence criterion is achieved and the current set is selected.
Data: Inter-predicted residuals x from a given size Algorithm 1: RDOT learning design With the considered learning design, the transform sets are built independently for each block sizes. In a second pass, the obtained sets are homogenized to obtain a set of transforms common for all sizes, from 4x4 to 32x32 blocks. Table I gives the transform sets obtained after the learning process, they contain from 1 to 9 transforms. According to the HEVC terminology, each TU (Transform Unit) will consider using one of those transforms for each inter-residual block. The number of transform per set is chosen to be 1 + 2 b to anticipate the signalization of the selected transform from the encoder to the decoder. A flag indicates whether the first transform is used, if not, an additional code on b bits is conveyed to signal the selected transform. As can be seen, the learning algorithm teaches that the DCT2 transform, for both row and column directions, is confirmed as the optimal transforms when used solely. The DST7 and DCT8 are the most frequent transform kernels for transform sets up to 5 transforms (transform sets 2, 3 and 5).
It must be noted that the TrSet 5, although designed independently, matches the transform set as used in the JEM. For TrSet9, it is remarked that one single additional transform kernel (DST1) is added to those of HEVC (as DCT8 and DST7 are dual, i.e. identical as a vector basis reversal). Some of the 2D transforms, are illustrated on figure (1a-1f). Figure (1a) displays the 2D-DCT2 as frequently encountered in video coding. Different combinations of DCT8 and DST7 are displayed (1b-1e). As can be noticed, each consider a particular spatial localization. Figure (1f) performs the transform decomposition on the vertical axis, as such it is appropriate for residual patterns with banded vertical textures.

B. Coding performance with transform sets
Five transform sets have been determined in the learning process, this section deals with testing each of them in a coding environment.
The HEVC coding scheme is extended to allow the usage of the proposed multiple transforms. Consistent with the approach in [17], a flag indicates whether the legacy HEVC transform, (DCT2), is used. If not, an additional code is conveyed on 1,2,3 bits for respectively Transform Sets 2, 3, 5 and 9. These flag and code are coded at the HEVC Transform Unit syntax when the luma residual signal is significant (it contains one or more coefficients different from zero).
The performances are evaluated in the Common Test Conditions (CTC), as defined by the JVET group. The test set includes 25 video clips with resolutions from 240 lines to 4096x2160 pixels [19]. The coding configuration, is Random Access, as such an intra picture is inserted approximatively every second, the intermediate frames are coded with a hierarchical B frames structure with a GOP size of 16. Both HEVC implementation and the proposed coding schemes are evaluated in this configuration, both codecs are based on the latest HEVC reference model (HM16.6). Table II presents the results expressed with the BD-rate metric commonly used in video coding [20]. The percentage  expresses the relative bit rate decrease over the HEVC which serves as the anchor for this study. The estimation is estimated over a bit rate range driven by a quantization parameter Qp from 22 to 37. It can be noticed that the gains increase as the number of transforms increase, from -0.2% of bit savings to -0.9% for the Transform Set 9. One also notice that the added encoding complexity with respect to HEVC also increases with the number of transforms, up to 28%.
These results highlight the performance of transform competition in the context of inter-coding solely, as the transform competition is enabled only for inter predicted residuals. The coding gains are lower than the ones obtained in the case of AMT for Intra: one source of explanation comes from the fact that inter-coding includes a significant number of blocks perfectly predicted for which there is no residual. Those blocks do not take profit from the additional transforms.
It can also be noticed, notably on the content of Class F, that comprises screen content scenes with mostly static scenes, that increasing the number of transforms has no effect on the coding performance: the potential gain vanishes as the rare coded residuals taking benefit from this increase is counterbalanced by the transform signaling.

III. ADAPTIVE TRANSFORM SETS
On the one hand, the benefit from an increased number of transforms is justified for inter-predicted residuals as stated in the previous section. On the other hand, the possibility for the encoder to limit its number of transforms seems also motivated because in some cases, such as easily predictable regions (i.e. motionless and immobile areas), a flat residual is more probable and thus a DCT2 could be sufficient.
Consequently, to further increase the compression efficiency, this paper proposes to dynamically adapt the number of transforms per Coding Tree Unit (CTU, i.e. per 64x64 pixel blocks).
To summarize, the advantages of an adaptive transform set design are the following: • Enlarge transforms sets when necessary : reduced distortion in the R-D trade-off as complex residuals take advantage of the multiple transforms • Reduce transforms sets when necessary : reduced bitrate in the R-D trade-off by avoiding wasting bits signaling the transforms when unnecessary.

A. Principles of Adaptive Transform Sets (ATS)
In the proposed design, the encoder is allowed to modulate the number of transforms used in inter-prediction mode. To enable enough flexibility to the encoder, it is proposed to dynamically adjust the transform set at the CTU level, signaled in a differential way.
The five transform sets defined in table I are directly used in this ATS design. The first transform set is basically a disabled-AMT mode (DCT2 only) while the four other sets used DCT2 plus 1, 2, 4 or 8 transforms.

B. Transform Set Signaling
The transform set index is signaled at the top of the CTU in a differential way. Indeed, it has been observed, especially, that the probabilty of having a given transform set index in a temporal layer is strongly correlated to the transform set index value of the colocated (same position) CTU in the lower temporal layer. Thus, it is wise to signal the transform set using a code based on the conditional information, as shown in table III. Using that method, the average cost for the current transform set is reduced to 1.11 bits on average when the collocated transform set includes only the DCT2. Note that the first bit of the table is encoded using a CABAC code. Consequently, there is an efficient signaling for sequences where fewer transforms are required.

C. Performance and Encoding complexity consideration
For the adaptive transform set, the encoder successively encodes each CTU for each transform set, therefore one of the main impact of the proposed design is its increased complexity. Indeed multiple redundant passes for the partitioning and prediction are reiterated.
To accelerate the encoding decisions, two acceleration tricks are considered. First, an early-termination method is implemented to break the Rate-Distortion Optimized (RDO) encoding if a CTU does not contain any residual for the first Transform Set, it is judged unnecessary to explore alternate transform sets. In addition, another technique can be implemented to reuse the quad-tree partitioning derived using a particular transform set for another one. In this case, transform sets are tested from the richer in terms off transforms (e.g. from transform set 9).
The ATS schemes are evaluated through several configurations: the simpler configurations use two transform sets, e.g. 1,2 can code a CTU either with the DCT2 or using the pair of transforms as selected for transform set (refer to table I). The number of transforms is progressively increased up to 9.  Both the AMT and the ATS systems ensure that the decoding complexity is kept practically unchanged compared to the HEVC, as roughly the same number and sizes of inverse transforms are applied.
The main drawback for the ATS approach remains the added complexity, although it permits to significantly outperform the AMT systems (gains progress from 0.8% to 1.2%). This is why the relationship between transform selection and partitioning should be better understood to reduced the redundancies in the encoding process.

IV. CONCLUSION
In this paper, an adaptive transform set design, using trigonometric kernels, is proposed to improve the coding efficiency of inter-predicted residuals in HEVC.
To further increase the performance, transforms sets, called AMT, with from 1 to 9 transforms are designed, in a ratedistortion sense. This design confirms the value of DCT2 for inter-coding when used solely and also confirms that the DST7/DCT8 are efficient kernels for these residuals. The AMT with 5 transforms are identical with the one derived independently for the ITU/MPEG JEM software. The design is conservative with the HEVC transforms as a single transform kernel (the DST1) is added, and the decoding complexity remains the one of this standard. Under strict testing conditions, it is shown that AMT can provide from 0.2% up to 0.8% with a reasonable increase of the encoder complexity.
The Adaptive Transform Sets are also introduced to further increase the coding gains. Thanks to ATS bit rate gains that were in the range of 0.8% for the AMT reach 1.2% at the expense of a significant complexity increase at the encoding side. Hence, the further work is required to reduce the complexity in the most performing configuration to increase the attractivity of such solution.