Video Analysis for Interactive Story Creation: The Sandmännchen Showcase

This paper presents a method to interactively create a new Sandmannchen story. We built an application which is deployed on a smart speaker, interacts with a user, selects appropriate segments from a database of Sandmannchen episodes and combines them to generate a new story that is compatible with the user requests. The underlying video analysis technologies are presented and evaluated. We additionally showcase example results from using the complete application, as a proof of concept.


INTRODUCTION
The days of a passive public depending upon a handful of selected broadcasters for their information and entertainment are long gone. Thanks to the internet, and thanks to smartphones in particular, users can access the content they want, where and when they want it. Content covering every topic and niche is today available in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. AI4TV'20, October 12, 2020, Seattle, WA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8146-8/20/10. . . $15.00 https://doi.org/10.1145/3422839.3423061 every format, whether audio, video, image or text. Content can now be individually tailored by broadcasters / professional content creators for a particular playout channel and a relevant target group. Professional content creators and owners can create new, or reinvent existing, broadcast channels to successfully find an audience for their content.
In the ReTV project we have intensively explored and researched 1 how end users can benefit from AI-based recommendation and user profiling systems. The result is the "4u2" use case 2 , which aims to provide consumers with quick and easy access to personalised content from broadcasters and media archives via novel publication channels. To realize and test this use case, we have developed an application for use with smart speakers. This application, the Abendgruß, is based on the well-known children's programme "Unser Sandmännchen" 3 , from Rundfunk Berlin-Brandenburg (rbb) 4 , which is a seven-minute show broadcast daily. Targeted at pre-school children, "Unser Sandmännchen" accompanies children to bed with a bedtime story at 18:00.
Sandmännchen episodes are simply structured: there is always a framing story, consisting of an intro and an outro (the Sandmännchen arrives and leaves a selected setting in a specified way), which surrounds a main story, usually an adventure of one or more of his friends. There is a continuously growing amount of archive material from which these structural elements can be drawn.
Currently, Sandmännchen episodes are created primarily for TV broadcasting. This means that the selection of production elements (i.e., which framing story to use, which main story elements to combine with which other) depends strongly on the requirements of rbb's daily (live) TV programme. Live broadcasting requires that Sandmännchen elements are limited by available broadcast time, and rbb's on-demand platforms, such as the Mediathek, host only these TV versions of the programme.
Here the Abendgruß application for smart speakers comes into play. Thanks to the app, there are unrestricted possibilities for the creation of personalised Sandmännchen episodes. These can be created according to user preference, and are no longer defined by the requirements of live TV broadcast. Users create their own Sandmännchen episode completely free from editorial restrictions, and their choices are driven only by their personal preferences. The Abendgruß application is supported by functions that we developed for video adaptation and re-purposing that are based on Artificial Intelligence (AI) techniques, most notably deep neural networks.

SANDMÄNNCHEN VIDEO ANALYSIS 2.1 Related Work
To be able to properly (and with minimal human intervention) select segments from a Sandmännchen episode, we constructed a video analysis framework. Such video analysis starts with temporally fragmenting the video to meaningful segments, and subsequently employing annotation methods for generating high-level metadata (e.g., concept / object detection results).
There exists a plethora of video temporal fragmentation methods. Most of them deal with segmenting a video to elementary structural units called shots, which are defined as groups of consecutive frames captured without interruption by a single camera [1]. Older methods employed handcrafted rules based on color features [17,25,39,40], local descriptors [1,4,6,8,22] or a fusion of such features [10,15]. Due to the success of deep convolutional neural networks (DCNN) on various computer vision tasks, most recent attempts are based on the use of DCNNs [9,11,18,37,38]. Beyond techniques that segment videos to shots, there is a substantial number of works that deal with the fragmentation at different granularities (coarser or finer), with some methods performing video decomposition into coarse and semantically-coherent temporal segments, known as scenes [5,16,29,32], while others working on the finer side of fragmentation and further decomposing shots into visually coherent parts that correspond to individual video capturing activities, usually referred to as sub-shots [2]. However, none of the aforementioned methods have direct application in our scenario: we need to detect specific parts of the structure of Sandmännchen episodes. For this, we need to design a domain-specific method, like [43] where soccer games are segmented according to the specific semantics of this sport, or [21] where the individual stories (on different topics) within a TV news broadcast are determined.
Regarding concept-based annotation, and again as a result of the widespread use of DCNNs, the focus has moved from employing Support Vector Machines (SVM) [3,41] or local descriptors [23,24] to an explosion of DCNN model architectures for concept detection [13,14,30,33,34,44] as well as object detection [12,19,20,27] in image / video. Of particular interest are methods that can be used to "adjust" (i.e., retrain) DCNN models that were trained for a visual annotation task on one dataset, to a new, considerably different dataset, a task known as "finetuning" [26,28,35].
Our goal is to construct a method to fragment a Sandmännchen episode taking into consideration the peculiarities of the application domain as discussed in Section 1, i.e., the presence of an intro / outro and a main story part, and annotate the main story part with the main involved character. For this we will adjust and employ methods of the literature, combining them to construct a complete Sandmännchen story creation framework.

Temporal Segmentation
Each Sandmännchen episode has three parts. In chronological order, these are: (1) The introductory part, where the Sandmännchen arrives in a different vehicle and setting every time, enters a room with children, and starts narrating his story. (2) The main part of the episode is the story that the Sandmännchen narrates. The Sandmännchen is not visible -it is assumed we are fully immersed in his story. This part deals with a different character, or set of characters, each time. (3) The closing part, when the Sandmännchen has finished his narration and he is leaving in the same way that he arrived. For the temporal segmentation of Sandmännchen episodes, following visual inspection of a large set of episodes, we decided to detect the intro transition (i.e., transition from the introductory part to the main story) and the outro transition of an episode (i.e, transition from the main story to the closing part of the episode). In most cases, the frames around the intro and outro transitions contain a characteristic camera zooming in and out from a screen, respectively. The screen is different every time, sometimes being a TV screen, other times being just a projection on wall. The zooming is accompanied with a fading transition, where in most cases the camera zooming fades out to a white frame.
We performed a statistical analysis of the (temporal) position and the duration of the intro and outro transitions on a set of randomly selected 80 episodes. We calculated the mean position of the intro transition to be at second 84 from the start of the video, with a standard deviation of 39 seconds. Similarly, the mean position of the outro was found to be at second 346, with a standard deviation of 151 seconds. The duration of the transitions also varied among episodes; they were on average 1.4 seconds long, with a standard deviation of little over 1 second. From this analysis it becomes clear that using a simple heuristic for segmenting a Sandmännchen episode to its three main parts based on, e.g., just time information, would fail.
We first implemented a DCNN-based method to segment the video to shots by adopting and extending the method of [9]. Our extension marginally improves the accuracy of the original method when applied to Sandmännchen videos, and concerns two directions: (1) The inclusion of a post-processing stage similar to the technique used in [1] to analyse the computed similarity scores between consecutive frames. Specifically, the time series formed by the shot transition probabilities is first smoothed using a moving average filter with a temporal window of 5 frames. Then, the first order derivative of the smoothed time series is calculated to discover the local minima and maxima. Each discovered local maximum is considered a shot transition. (2) The introduction of a trick to use a larger temporal window for the input to the network without affecting speed. Specifically, instead of analysing 10 consecutive frames, we choose to analyse frames using a quadratic incremental step, i.e., the inference of our model for frame with index is based on the analysis of frames with indices − 8, − 4, − 2, − 1, , + 1, + 2, + 4, + 8, + 16. This way, we allow the model to look at a larger temporal window while still using just 10 frames as input to our model (i.e., the time efficiency remains unaffected).
We also trained a Random Forest classifier on a set of simple (and cheap to compute) frame features that, based on our intuition, are able to capture the variations of the sought transitions. These features are the following: • ECR: We compute the Edge Change Ratio (ECR), which represents the amplitude of edge changes between two consecutive frames [42]. • Homogeneity: We convert the video frame to grayscale and we compute the range of the pixel intensity values, as a means to quantify the visual information contained in the frame. • Blackness: We convert the input frame to grayscale and we compute the average of all pixel intensity values, to quantify how black the frame appears to be. • Whiteness: We convert the input frame to the HLS [31] colorspace and we compute the range of all pixel values in the L (i.e., "lightness") component, to quantify how white the frame appears to be. • Blurriness: We calculate the variance of the Laplacian of each frame, a well-known practice in image analysis, to quantify how blurry the frame is.
To detect the three parts of a Sandmännchen episode (i.e., the intro, outro and main story) we first perform temporal segmentation of the input video to shots. In parallel, we employ our Random Forest classifier model trained on the above-mentioned image analysis features for classifying the video frames into two classes: "normal frame" and "transition frame". Then, taking a further step and not relying solely on this frame-level prediction (i.e., on whether a frame was correctly classified as belonging / not-belonging to a transition), we incorporate the results of shot segmentation for making a video-level prediction for the intro and outro transitions. This is accomplished by employing the following simple domain rules: • For a frame to be considered a "transition frame", besides having a high inferred probability from the Random Forest classifier, it must also belong to either the first or the last 1/3 of the video. This rule was employed since the main story part on all analysed Sandmännchen episodes was the largest part of each episode, and of course is always in the middle of the video. • For a frame to be considered a "transition frame", it must additionally have a temporal distance of no more than four seconds from a shot boundary. We employed this rule since the longest transition was observed to be four seconds long, and the intro and outro transitions are always marked as a shot change by the shot segmentation module.
After the application of these additional domain rules we select the shot that contains the highest ranked "transition frame" and belongs to the first 1/3 portion of the video as the last shot of the introductory part. Consequently, we select the shot that contains the highest ranked "transition frame" and belongs to the last 1/3 portion of the video as the last shot of the main story.

Character Annotation
As discussed in Section 2.2, the Sandmännchen appears in the introduction and closing part. The main story deals with a different protagonist each time. The protagonist can be a single character (e.g., Kalli -a blonde boy) or a character set, which will always appear together throughout the whole main story (e.g., Rita und das Krokodil -Rita and her very hungry friend, Crocodile, who lives in the bathtub). For the sake of brevity, in the sequel the term character may refer to a single character or a character set. There are 30 characters in total. We decide to employ a DCNN model of the EfficientNet state-of-the-art architecture [36]. We utilized the weights of an EfficientNet instance trained on the 1000 classes of the ImageNet challenge [7] as the initial weights of our model and then fine-tuned it to be able to detect the character of the main story with a similar technique to [26]. We decided to select a subset of 11 characters out of the total 30 characters since we consider this as a good starting point for getting our application to life. The 11 detectable characters are: 1) Herr Fuchs und Frau Elster, 2) Jan und Henry, 3) Kalli, 4) Der kleine König, 5) Der kleine Rabe Socke, 6) Die Moffels, 7) Meine Schmusedecke, 8) Pittiplatsch, Schnatterinchen und Moppi, 9) Plumps, 10) Pondorondo, and 11) Rita und das Krokodil.
Our model annotates each frame of an input video with the detection score for each one of the 11 characters. However, we do not rely solely on this frame-level character predictions but we also calculate a video-level prediction. For this we perform majority voting over the frame-level predictions, since a Sandmännchen episode deals with a single character in its main story part. Although the audio stream could also have been used for performing this classification, our results indicate that this is not needed, as perfect results are observed at the video level (following the majority voting) by using just the visual classifiers. The framework of all utilized video analysis methods described in Section 2.2 as well as in the current Section, is summarized in Fig. 2.
Regarding the selection of a framing story, our application (Section 3) will initially rely on a finite set of previously-annotated episodes. Therefore, it was deemed that for the identification of the vehicle that Sandmännchen uses in the intro and outro sections of an episode there is no need for developing a video analysis method to automate it, at least for this first phase of the Abendgruß application.

Video analysis service
The video analysis techniques discussed in previous sections have all been incorporated into a video analysis component. This component is deployed as a REST service that: a) retrieves a video file, b) performs the temporal segmentation of a Sandmännchen episode, c) analyzes the main part to identify the main character, and d) stores the results in a JSON-structured file which can be downloaded using a specific type of call.
The REST service works in an asynchronous way, i.e., through a 3-step process. The first step relates to an HTTP POST call that enables the submission of a video for analysis and the initiation of a relevant session in the REST service. The second step is associated to an HTTP GET call that queries the status of the initialized session and the progress of the analysis. Finally, the third step is performed by another HTTP GET call that enables the retrieval of the results of a successfully completed session.

SANDMÄNNCHEN APPLICATION 3.1 Application overview
Since our focus is on video content, Abendgruß is designed primarily for the use with smart speakers with display. The first prototype was developed as an action for Google Assistant, focusing on the Google Nest Hub. Here's how it works: • To start Abendgruß, the user has to say "OK, Google, mit Abendgruß sprechen (OK, Google, speak to Abendgruß.)". • The Nest Hub answers: "In Ordnung, ich starte die Testversion von Abendgruß (All right, I'm starting the test version of Abendgruß.). " • The application opens.
• The user sees the start screen and gets a welcome combined with a call to action: "Hallo! Um deinen eigenen Abendgruß zu sehen, sage das Wort Abendgruß" (Hello! To watch your own Abendgruß, say the word "Abendgruß".). • After saying "Abendgruß", two options are shown. First, the user can choose how the Sandmännchen should arrive. For example, "Zu Fuß oder auf dem Elefanten? (By foot or on the elephant?)". In other words, the framing story (see Section 1) is defined in this step. • Secondly, the user determines her/his main story by answering the question "Und welche Geschichte möchtest Du heute sehen?" (And what story do you want to see today?). Again two options are presented, e.g., "Rita und das Krokodil oder Die Moffels? (Rita and the crocodile or The Moffels?)". • The Abendgruß application finally shows an automaticallygenerated Sandmännchen video which consists of the respective framing and main story elements.

Communication with audio and video analysis APIs
The Abendgruß application aims to be conversational, which means that it needs to deal with users speaking a command in many different ways. A user might just say "Jan" instead of "Jan und Henry". Mapping all possible inputs to clearly defined API calls is usually done with a chatbot framework. We use Google's Dialogflow 5 , since it is tightly integrated with the Google Assistant. When a user speaks to the Abendgruß application on the Google Assistant, their commands are sent to Dialogflow and mapped to API calls. Those calls are then sent to the Abendgruß API, which either returns options for the user to choose from, or the customized video in the final step. If no mapping is possible, Dialogflow will tell the user that their command could not be understood. The Abendgruß API periodically checks if new Sandmännchen episodes have been published, by monitoring selected Web sources. If this is the case, they are sent through the Video analysis service, and the results are stored in the Abendgruß database, ready to be integrated into future stories. See Fig. 3 for an overview on how the different software services work together.

Video Analysis Results
Our models for the identification of the main story character were trained and evaluated on a dataset we manually curated. The specifications of this dataset are reported in Table 1. The training dataset is in the form of a set of selected frames, while the testing dataset consists of videos in order to be able to evaluate the whole character identification process, i.e., including the video-level character inference.  For the Sandmännchen episode structure segmentation, we compiled a training dataset of 50 Sandmännchen episodes, by manually annotating the structure of each episode. We also annotated 30 Sandmännchen videos to create a testing dataset. In Table 2 we report the frame-level identification results in the form of a confusion matrix. Overall, for the detection of the transitions, using the Random Forest classifier at frame-level we achieve 88.54% F-score. Employing the additional domain rules, as discussed in Section 2.2, for the video-level prediction of transitions we reach a 91.67% Fscore.
Regarding the Sandmännchen character identification framelevel predictions, in Tables 3 and 4 we report the evaluation results and confusion matrix, respectively. We observe that by just using the DCNN model for the frame-level predictions, there are classes that perform very well (e.g., Herr Fuchs und Frau Elster with 92.8% accuracy)) but also classes with noticeably bad performance (e.g., Meine Schmusedecke with 51.3% accuracy). Our intuition for explaining this sometimes low accuracy, besides the varying difficulty of detecting each character due to its specific characteristics, is that the video frames that are analysed do not always depict the main character (see Fig. 4 for indicative examples). However, we should highlight that after employing majority voting to infer video-level predictions, as discussed in Section 2.3, we achieve a perfect score on our test dataset, i.e., 100% accuracy for all classes. 5

Sandmännchen Application Results and Examples
We have already presented our Abendgruß application prototype to project stakeholders and a wider audience at various fairs such as the IFA 6 . A small user study with a questionnaire for parents regarding the concept of the application, was conducted. The aim was to find out to what extent the application meets the expectations of parents and their children, in the context of a smart speaker application for children, and where adjustments may still be necessary. The feedback was very positive throughout. In particular, the idea of enabling the end user to interact directly with the content of a broadcaster in order to personalise it was met with great approval. The fact that the smart speaker was the device of choice was considered reasonable and relevant to the times. With regard to the target group of the Abendgruß, i.e., pre-school children or parents with    pre-school children, the application also convinced the audience with its simple, child-friendly operation. Moreover, the approach of using AI techniques to realise the concept generated great interest and was considered highly innovative. In addition, rbb confirmed that the idea of the Abendgruß for smart speakers fits perfectly into the broadcaster's plans to 1) open up new distribution channels and 2) expand its digital offer for the Sandmännchen. Additionally, we should clarify that the application always allows the user to select between two options for each of the into/outro and main story, and these two options are selected randomly each time the application is used; thus, the user cannot select the exact same characters/stories again and again. Offering a variety of episodes, we ensure that the educational effect of the Sandmännchen series is not lost.
A typical way to use our developed application starts with the question about how you want the Sandmännchen to arrive. We provide two randomly-selected options, as can be seen in Fig. 5. After the user answers, the next screen will provide another two randomly-selected options for the main story, as illustrated in Fig. 6. We provide below three indicative usage examples of the developed application, as a proof of concept. The application will select the appropriate videos and segments from the database, as seen in Fig. 7, and will start playing the video of the constructed story after saying: application: Hier ist Dein Video (Here is your video). The application will again select the appropriate videos and segments from the database, as seen in Fig. 8, and will start playing the video of the constructed story after saying "Hier ist Dein Video (Here is your video)". Similarly to the previous examples, the application will select the appropriate videos and segments from the database, as seen in Fig. 9, and will start playing the video of the constructed story after saying "Hier ist Dein Video (Here is your video)".

CONCLUSIONS AND NEXT STEPS
We presented an application for smart speaker devices equipped with a display to interactively create custom videos of Sandmännchen episodes. We presented the underlying video analysis technologies that enable the automatic generation of such videos, and performed experimental tests to evaluate their effectiveness on our specific usage scenario. The application as a whole was evaluated in a qualitative way and examples of use were reported as a proof of concept.
With respect to future work, our goal for the next version of the application, aimed at Amazon Alexa, is to adjust / expand all the video analysis components so that the creation of more diverse personalised episodes using voice commands can be achieved. Specifically, we plan to expand the set of identifiable characters so as to include all of the Sandmännchen friends, and introduce a method to also automate the annotation of the intro and outro sections of an episode with information on the vehicle that the Sandmännchen uses.

ACKNOWLEDGMENTS
This work was supported by the EU's Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV.