Enhanced frame-based video coding to support content-based functionalities

This paper presents the enhanced frame-based video coding scheme that enables some useful content-based functionalities such as user-interactivity with video content, assignment of different quality to video objects, and content-based video search and retrieval. The input video source for the proposed enhanced frame-based video encoder consists of rectangular-size video frames and shapes of arbitrarily-shaped video objects in the frames. The rectangular frames are encoded using a scheme similar to the conventional frame-based video coding and the shapes of video objects are encoded using the contour-based vertex coding. It is possible to achieve several useful content-based functionalities by utilizing the shape information in the bitstream at the cost of a very small overhead of shape bits.


1: Introduction
Over the years, the digital video hnology has been evolving quickly in terms of the ways video content is produced, delivered, and consumed.So the users are not content with only the compression efficiency in a video coding technology; they are now demanding more features such as content-based interactivity and content-based video indexing and retrieval.Furthermore, the new video coding schemes are required to cope up with bandwidth and bit error rates of various transniission networks and achieve backward compatibility with existing video coding schemes in addit,ion to providing a very good compression efficiency.
For the conventional frame-based video encoders, the input video source is in the form of a sequence of rectangular frames.Whereas, the new video encoders which supporl contentbased fiinctionalities require the source input video to he in such a form that the video content can be easily identified and characterized.The video source to the object-based video coding approach in MPEG-4 standard 1 1 1 is in the form of arbitrarily shaped foreground arid backgroimd video objects and their shapes.We have not yet seen the widespread usage and adoption of object-based video coding probably because the accurate segmentation of seInantically meaningful objects from a rectangular video still remains a cliallenging problem.Furthermore, the bitinap representation of shape employed by MPEG-4 is not well suited for semantic shape characterization for content-based video indexing arid retrieval.
In the sub-picture coding [a] proposed towards the H.264 standardization activities, a video frame is parlilioned into o ~i e or more user-defined non-overlapping rectangular foreground objecls called sub-pictures arid a rerriainirig background picture.However, the rectangular ures do not generally represent Lhe true video objects with which a user would like ct .The proposed enhanced frame-based video coding technique is described in Section 2. In the proposed enhanced frame-based video encoding scheme (EFBE), the input source video consists of rectangular-sixe video fi-arms and shapes of arbitrarily-shaped ob.jects in each video frame.We define the object of interest (001) in a rectangular frame video as the arbit,rarily-shaped sernantically meaningful video object which is of interest to a user.There are three steps involved in the process of obtaining the coded representation of video using the enhanced frame-based video coding scherne: 1) pre-processing, 2) encoding shape and texture, and 3 ) post-processing.In the first step, the 001 is identified in each video frame and its shape information is obtained in t,he form of either an outline contour sketch or a segmentation mask.The outline contour sketch of an 001 can be obtained by marking t,he outline of 001 on the screen manually; this way the shape inforniatiori can be easily generat,ed by a usel-.Several techniques have been proposed to obtain the segmentation masks using chroma-key [ 3 ] , aut,oniatic segmentation [4] and semi-autornatic segmentation techniques [5].For the proposed EFBE, the boundaries of segmented video objects need only represent an approximate outline of the area belonging to a video object and there is ~i o requirement for segmentation to be accurate.In the second step, the rectangular frame texture arid the video object,'s shape are encoded.The block diagram of the EFBE is shown in Fig. l ( a ) .The major components of the EFBE are frame-based texture coding and shape c:oding.The post-processing step involves linking the shape bits of 001 with the texture bits belonging to the 001 region on the rect,angular frame.The link information is Iriultiplexed into the bitstrearn alorig with texture and shape information.The video hit,streani in which a link and content information is associated with a region on a video frame is generally referred to as Hypervideo [6].In this paper, we focus mainly on the second step and propose a new video encoding scheme to enable content-based functionahties whatever the methods used for pre-processing arid post-processing.

2.1: Shape coding
The proposed shape coding in the EFBE employs the contour-based technique which lends itself to the sernant,ic shape characterization.The shape contour in the form of eit,her the segrnentat,ion mask's boundary or the outline contour sketch of 001 is approximated by a polygon such that the distance between the polygon and the contour is less than or equal to a given t,olerable approximation error 6.The vertices of the polygon are coded using the object-adapt,ive vertex encoding method described in [7].The aIlloUKlt of shape distortion is controlled by varying the value of 6; the larger the value of 6, the higher the shape distortion.Lossless shape coding is achieved by setting 6 = 0.
The dist,arice between the polygonal approximations of 001 shape in the current and t,lie reference franie is used to detect the amount of temporal variation in shape.The distance between two polygons is computed as follows.Let P , and P, be the polygonal api>roximations of t,he current and the reference 001 shape.Let h/ and vi be the horizontal and vertical distance ofitli vertex of P, from PT .Then the distance of I ' , from Rr is defined as

L
where di = inbn(h/, vi).For the ob.jects of interest such as talking-head, the variation in 001 shape over a sequence of contiguous frames is usually very small.Therefore when the lossy shape coding is desired, the shape information is not transmitted with every frame.
Instead, t,he polygonal approximation of the current 001 shape is coded and transmitted only if the distance between the polygonal approximations of the current and reference 001 shapes is greater than or equal to a threshold T (i.e., if D(P,, P,) 2 T ) .Otherwise no shape infortnation is t,ransrnit,ted in t;he current frame.At the decoder, the most recently decoded shape is used to ident,ify the 001 if no shape information is present in the current frame.

2.2: Texture coding
The basic steps of textiire coding in the EFBE are essentially the same as those in a l.ypical rectangular franie-based video encoder.T h basic steps consist of dividing a video frame int,o an array of basic units called macroblocks and processing each macroblock by applying discrete cosine transform, quantization and variable length coding.In fact, the texture coding block in the EFBE can be any rectangular frame-based video encoder (e.g., MPEG-1, MPEG-2 or MPEG-4 (simple profile) video encoder).In our implementation of trhe EFBE, we have used MPEG-4 (simple profile) encoder for frame-based video texture coding.
In the DST niode, the shape information is used to adjust the texture coding parameters.The main idea is to encode the texture in the region belonging to the polygonal approximatlion of the 001 shape with a finer quantization as compared to the rest of the video frame.The quatitixer values used for the two regions are embedded in the header inforrnat,ion.At the decoder, the dec,oded shape information is utilized to correctly identify the region belonging to video object on the video frame and the quantizer information in t,he bitstrearn header is utilized for decoding the frame texture.

2.3: Multiplexing shape and texture bits
We call t,he combined bits consisting of the shape bits, the bits required for coding the quantizat,ion parameters in DST mode and the bits that are generated by the postprocessing st,age for linking the objects of interest as udditional hits.In order to achieve liackward cornpatibility with the conventional frame-based coding, we need to place the c~dditio?ialbzits into the bitstream such that the conventional frame-based decoclers would simply ignore these eddiitioi~,d beits and decode the texture bits as usual.Tlie proposed enhanced frarne-based video decoders would utilize the additionnl bits to provide the contentbased functionalities.We employ the user data packet insertion scheme as described in [8] for this purpose.We cornbine the nddztionel bits into user data and place the user data into the bitstrearxi generated by the frame-based texture coding block of the EFBE.

3: Features of the EFBE
The arcliitect~ure of the proposed EFBE is designed such that it can be implemented as a simple extension of the exist,ing frame-based encoder architecture.The embedded shape inforrriat,ion in the bitstream of proposed EFBE can be utilized to support several content-based functionalities.At t,he receiver, the shape information of a video object in the bitstream facilitates the identificat,ion of the mgion belonging to the arbitrary-shaped video object as a hotspot on the rectangular frame for content-based interactivity ( see Fig.I ( b ) ) .For example, a hyperlink can he provided for the object on the rectangular frame when a user activates the ol>.ject by clicking 011 it.Such application does not, require the shape of 001 to he accurate.Furthermore, lossy shape coding can be employed to achieve higher compression.The annotation of video hyperlinks in the form of sniall icons containing the polygonal approxiniation of ob.ject contour can be displayed at the bottom of the rectangular video for a user to ident,ify the hot spots in a frame.During fast-forward or fast-reverse operations, only shape can be decoded and displayed instead of decoding the entire rectangular frame texture.Tlie vertex-based shape representation carries the semant,ic meaning of the object, and therefore it can be made use of in t,he future rniiltrirnedia applications such as video indexing and retrieval.

4: Experimental results
The conventional fi-arne-based video encoder (CFBE) in t,he MPEG-4 Verification Model (VM) software 1 1 1 is used as t,he basis to implement the pruposed EFBE.The CFBE forms the tc:xt,iirc% coding block of' the EFBE.We modified the VM sofware to irrcorporate t,lie following: 1) our c:cmtour-based shape c:odirrg t,ec:hniyiie, 2) algorithm for adjustment of quantiza1,ion s k p hased o t i po1ypon;l.lapproxirnation of shape (:oritour in the DST mode of shape information und t,he link and corderit, information of hotspots.
tJp<WLt ion, a.nd 3 ) the liriiyue of riiultiplexing the additioiial header information, e11c:oded erit,s we use the first 100 frames of the SOHz CIF-size Akiyo and E'osenrc2.nociatkd 001 shape c:onlours (st?Fig. 2).The Akdyo is a low-motion kdking-head video and t,lre Foiamuii is R rnodcrtzte-rnotiori video.The video sequences are oticxided at, 10 frames/sec; so there are 34 coded frames in tlie bitst,ream.In our experimental resuks presented in this paper, we do IKJt consider the bits required for eiiibedding the link aiid (:onlent inlimriatiori of hobspots.
First, we compare the perforrnaIice of t,he IST mode of tlie EFBE with the perforrnance of [,he CFBE.A fixed quant.ixationst,ep of Q=l6 is used for all the macroblocks.In thc case of oiir EF'BE, we set, the shape coding parameters 6 arid T as follows: ( 6 = 10, T = 0} for Forrmri,/i aiid { 6 = 10, T = 5) [or Akiyo.Thc texture coding in t,he IST rnodc? of EE'BE is t,he saiiie as t,lial in t,lie CFBE; ttierefore obviously both the eiicoders yield (,he same video qualitmy.The comparison of niirril~r of encoded bits is shown in Table 1.The riurn1)er of addit,iorial shape bits in t,he EFBE htslream are shown separately.Since T = 5 was iised for coding 0 0 1 shape of Akiyo in our tests, we observed that the 001 shape was ericoded in only four out of t,he 100 frames arid t,he t,cital numBer shape bits in the ent,irc bitstream was 720; thus, the average number of shape Bits per fi-ame is 720/34 CY 21, which is a iiegligildy siriall value as compared to the average nunher of texture bits.In case of the Forcmun, t,he shape is ~I I C ( J I : I ( ~ for the OOJ in each frame in the 100 frames bec:ause we set rvc t,liat tlie shape bits are 0.125% arid 1.29% of (,he t,exture bits for Q = 8 aiitl (2 = 16, respectively.However, this additional overhead of shape bit,s in t,he EPEE bitstream as c:ornpared t,o the CPRE bitstream is great1.yjustified l q the benefit achieved iri krrris of several usefid r:oritrrit-hased fi.inc:tionalities that the sliape information enables.
I n Ta1)lr 2, we present t,he (comparison of the performarice of t,he DST mode of the EFBE with that of thc CFBE usiiig t,hc E'orein,un video.In the case of CFHE, a fixed yuant,izatiori step of Q=l6 is used for all t,lic rriacrohlocks in a f'rarne.Wlicreas in the I)ST mode ol)erat:iori of EFBE, a lower qiiantixer (Q=8) is used for 001 region and higher qimrtixer (Q=31) is used for the retmaining part, of the f r a m .For thc shape codirig in the DST rnocle or' t,he RFBE, we (6 = 10,T = O}.The quality and liit,rate for the first, Intra-frame etnc:oded wit,li the two encoders ;~rc prcserited in Table 2. Using thc EFBE, a higher PSNR.for.the 0 0 1 region i s achieved at, the cost, of lower PSNH.for the 1.m:kgroimd region as coIIiy)iLr<d t,o the overdl PSNll obtained wit,li I,he CFBE.ure and design (Jf t,be proposed errliariced frarnc:-based video encx~der is presented.The main aim of the proposed encoder is to provide an enliancemeiit to t,he cow vcmtional frarne-based coding.By erribedding the coded representlation of a video cibjec coritour along with the c.:oded texture in the bitstiearn, it is possible I,o achieve several useful coriterit-l)ased functiorialities.In our experirrieutal results, the overhead of additional bits required fi)r.shape coding is less t,han 2% of the h t a l bit,s OF the conventional frarne-based vi d(xi coding. I 31 (Backgroutid) I

Figure 1 .Texture 2 :
Figure 1.(a) The enhanced frame-based video encoder (EFBE).(b) An example of EFBE video display that shows a user clicking on the hotspot to trigger a message display.