An Empirical Evaluation of Video Conferencing Systems Used in Industry, Academia, and Entertainment [Work-in-Progress]

Video Conferencing Systems (VCS) are used daily—at work, in on-line education, and for get-togethers with friends and family. Many new VCSs have emerged in the past decade and a new market-leader has risen during the coronavirus period of 2020. Understanding how these systems work could help us improve them rapidly. However, no experimental comparison of such systems currently exists. In this work we propose a method to compare VCSs in real-world operation and implement it as a tool. Our method considers four main kinds of real-world experiments. Each captures different aspects, such as communication channels (audio, video, audio-video) and types of network environments (e.g., Ethernet, WiFi, 4G), and reports system and network utilization. We further implement an automated tool to conduct these real-world experiments, and exper-iment with three popular VCSs, Zoom, Microsoft Team, and Discord. We find that there are significant performance differences between these systems, and their behavior in different environments.


INTRODUCTION
Video Conferencing Systems (VCSs) are essential to communication between people, in business, academia, and for entertainment purposes. For example, the ICPE 2020 conference took place online, in April 2020, and chose to leverage Zoom as its live (synchronous) communication platform [11]. However, relatively to other online * These authors have an equal and leading contribution. The others have proposed the research problem, helped manage the project, and contributed to writing the article.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).  : Performance results for Zoom during Experiment 1 from Section 3 (in red) and during a real-world lecture (in blue); performance variability is much higher for the latter. and interactive systems [12,20], we understand little about how these VCSs perform in the real world. The organizing team of ICPE considered many alternatives to Zoom and tested several, but these are anecdotal experiences and currently no peer-reviewed performance study of VCSs exists. To address this problem, we develop a method to compare VCS behavior in real-world settings. We further design and implement an automated tool around this method, and use it to conduct real-world experiments using three popular VCSs; Figure 1 depicts a typical result, for Zoom. VCSs are becoming prominent in many areas of business, academia, and entertainment. With in-person meetings limited in the entire world due to COVID-19, VCSs are becoming the de facto tool for meetings; Zoom and Microsoft Teams already service over 100 million users each [21]. Video conferencing is rapidly becoming the norm for internal teams at Microsoft [8] and a large percentage of recruitment interviews occur over video call [3]. In a recent survey by Zoom, respondents suggest using VCSs could help more people get advanced degrees and lower dropout rate [19]. Game streamers combine both in-game and real-world channels [13].
Existing work has studied VCSs from different perspectives, focusing on a particular video-encoding protocol [12], on using virtualization platforms to deploy VCSs [23], and from a societal perspective [22,26]. The ICPE 2020 organizers give a sample performance result-the bandwidth consumption of Zoom during a single conference-session [11, Fig.2]. However, no study analyzes or compares the performance of commonly used VCSs, such as Zoom, (Microsoft) Teams, and Discord.
As suggested by the high-level model for VCSs introduced in Section 2, comparing VCSs experimentally is challenging. First, VCSs allow the user to choose between different communication channels, i.e., audio and video, and combined. This choice, in turn, drastically influences the resource consumption and the performance of VCSs. Second, the resource consumption of VCS clients, their CPU, memory, and bandwidth, may show idiosyncratic characteristics relatively to other online applications. For example, Figure 1 depicts resource consumption for Zoom in two different settings, an experiment we control and a real-world online lecture. The figure summarizes four main metrics, each in a sub-plot. Among the possible observations supported by this figure, the one-on-one controlled experiment exhibits lower performance variability than the real-world lecture. Third, the interplay between network and VCS performance is essential but non-trivial. Fourth, and as a last focus of this work, also the codecs used by the VCS affect performance. The codec itself can trade-off local performance (i.e., CPU-limited) with remote operation (i.e., adding bandwidth consumption). Addressing these challenges, we focus in this work on the research question How to measure and compare the system-and applicationlevel performance of VCSs?
We also identify as a problem the lack of open-access datasets about VCS operation, ready for FAIR [25] consumption.
Addressing the research question and providing a public and Findable, Accessible, Interoperable, and Reusable (FAIR) dataset, the main contribution of this work is three-fold: (1) We design a method for comparing VCSs and implement it as a tool for real-world experimentation (Section 3

SYSTEM MODEL FOR VCS
VCSs such as Zoom, Teams, and Discord differ in many aspects, e.g., their architecture [4], how their traffic is routed [2] or redirected [14]. To facilitate the design of a method to experiment across multiple VCSs, in this section we propose a high-level model.
Our model captures the high-level structure and operation of VCSs. Figure 2 is a visual overview of the model. We focus on VCSs based on a client/server communication model. Due to the lack of publicly available information, this model focuses on the client-side and does not detail how the server operates (typically, inside the data center). The model considers two main roles for each client, as sender (component 2 in Figure 2), which captures the input feed ( 1 ) and uses an encoder on it ( 3 ); and as receiver ( 6 , mirroring 2 ), which uses a decoder on the data it receives ( 7 ) and provides for another participant in the meeting an output feed ( 9 ). The encoding and decoding processes use VCS-specific software.
To reach the network, each client uses a connection ( 4 ). The datacenter redirects the incoming data to a VCS server ( 5 ), which processes and possibly re-encodes it, and sends it to all receivers. Consequently, the (possibly re-encoded) data traverses again the Internet and another network connection, per client.
We do not make any assumptions about the architecture of the server, e.g., centralized or distributed, hierarchical or flat. However, we make the assumption that clients do not form their own interconnections, bypassing the data center round-trip.
Our model considers two main types of data-streams, regular and compressed. The former are unencrypted data, the audio and/or video streams the user experiences. Compressed data are encrypted, often smaller-sized, possibly re-encoded, streams.

DESIGN OF EXPERIMENTS AND AUTOMATED EXPERIMENTATION TOOL
In this section we propose a method to compare VCSs experimentally, and implement the method as an automated tool.

Method Overview
In our method, clients run on different machines in the network, and the input feed represents the benchmark workload. To evaluate the performance of a VCS, the measuring tool (component 8 in Figure 2) runs on one client, capturing traffic-metrics and other resource-consumption information, and later calculates performance metrics. It measures computer and network metrics, focusing on: CPU utilization; and network bandwidth, number of packets, and packet sizes, each of these network metrics are measured in and out. Our method considers four major operational aspects: regular operation of VCS systems, and operation under constrained bandwidth, with various connection types, and under a large workload. We design a specific experiment for each; Table 1 summarizes them. Experiment 1 captures regular operation with various input channels enabled, i.e., audio, video, and audio+video; for all other experiments, we use only the latter, audio+video. Experiment 2 throttles (limits) the download speed (bandwidth) available to clients, lowering it by 50 kBps every minute from the initial limit (of 500 kBps, for a 10-minute experiment). Experiment 3 compares the performance with a wired connection (Ethernet) against various wireless  connections, i.e., Wi-Fi 4 (802.11n) and 5 (802.11ac), and 4G LTE. 1 Experiment 4 consider a larger workload. Section 3.2 details the workload for each experiment.
Limitations: Our method represents work in progress toward understanding the performance characteristics of VCSs used prominently from 2020 on. We specifically do not consider within the scope of this work: throttling local resources other than bandwidth, setting encoding and re-encoding parameters for audio and/or video, uncovering the operation and architecture of VCS servers, analyzing delay and the impact of network conditions (e.g., jitter, delay) on it, assessing quality of experience from the user's perspective, considering a larger set of VCSs and deeper VCS models.

Workload for Each Experiment
All experiments maintain the default video and audio settings, for all systems under test. The video resolution defaults to 1920 × 1080, a commonly used resolution for 2020.
The basis of each workload consists of (1) for video-only, a video shared FAIRly [18], with a duration of 2:36 minutes played in a continuous loop; (2) for audio-only, an audio stream generated by turning on the public radio and recording using the microphone (channel NPO2 [1]); (3) for audio+video, both, combined and looped. These are real-world fragments, and workloads combining them are realistic and reproducible. Experiment 1 (see Table 1) considers each of the three types of workload, in turn. Experiments 2 and 4 consider only the combined workload (type 3). Last, Experiment 3 contrasts the combined workload with the workload produced by a real-world classroom (an interactive session in our Distributed Systems course); although the classroom workload is not fully reproducible, we conjecture that many classrooms or generally interactive sessions with a central moderator will generate very similar workloads. We repeat the experiments for every system 50 times. This is done by taking 1-minute samples from a benchmark that we run for 50 minutes.

Tool to Automate Experiments
In this section, we design and implement a tool to automate the experiments proposed in our method for comparing VCSs. We have tested the tool and found it able to work with popular VCSs, such as Zoom, Teams, and Discord. Figure 3 depicts the workflow our tool automates. First, the tool sets up the configuration (component 1 ); the user only gives the network interface as input. Then, the tool retrieves the IP address used by the VCS, and sets up a network sniffer for it ( 2 ). To achieve this, the tool leverages PyShark [15], which is a Python wrapper for 1 Wi-Fi 6 routers are relatively expensive and thus not widely deployed. the popular network analyzer Wireshark/Tshark. Over the course of the experiment, the tool collects all the metrics ( 3 and 4 ), using Psutils [10]. Wondershaper [16] throttles the bandwidth (used in Experiment 2, see Table 1). The tool writes the results to a file ( 5 ) and/or displays them for the experiment admin to see ( 6 ).

REAL WORLD EXPERIMENTS
We use our method discussed in Section 3 to evaluate the performance of three popular VCSs through real-world experiments. Table 1 gives an overview of our experiments, from which we derive the following Main Findings: MF1 There are large performance differences between different VCSs under the same workload. MF2 Bandwidth limitations lead to local CPU-usage limitations. MF3 Increasing the number of participants and video streams introduces high performance variability. MF4 Wireless networks significantly increase resource performance variability compared to wired Ethernet. Wi-Fi 5 is more stable than both Wi-Fi 4 and 4G LTE.

Experimental Setup
The workloads used in our experiments are shown in Table 1. Each experiment is conducted using a PC equipped with a Ryzen 7 2700X CPU (8 logical cores) and 16GB DDR4 memory. The wired connection is Ethernet-based, with the machine connected to the Internet via the Dutch ISP KPN NetwerkNL, an ADSL with 100 Mbps download bandwidth and 10 Mbps upload bandwidth. The wireless connections use a high-quality AC1750 Wi-Fi router, via and the same Internet connection. The 4G LTE connection carries through KPN, a high-quality provider.

Results of Real-World Experiments
In the remainder of this section we describe the results of real-world experiments leading to MF1 through MF4. For detailed analysis of our results please refer to our technical report [6].
1. Varying Input Channels (Leads to MF1). Figure 4 shows the results of Experiment 1, in which we observe the behavior of our VCSs while using an audio+video workload, a video-only workload, and an audio-only workload. The results show large performance differences between the observed VCSs. The CPU utilization is expressed as the number of utilized CPU cores, where a value of 1 is equivalent to 100% CPU utilization on a single core. The maximum value is equal to the number of logical CPU cores (8,  2. Throttling Network Bandwidth (MF2). We cover Experiment 2 in our technical report [6]. To experiment the bandwidth throttling we limited the receiving client. In a stream-oriented communication, codecs play a key role to achieve high-quality audio and video, yet with reduced resource usage. Zoom uses the codec H.264 Annex G(SVC) [24], which is a Scalable Video Codec (SVC). For it, we expect a decrease in CPU utilization when bandwidth throttling is applied. Teams uses the H.264 Advanced Video Codec (AVC) [17], which use a single bit-rate for a stream. We expect its CPU utilization to decrease when the bandwidth is throttled, as the codec will need to run the same algorithm on less data. The Audio Codec used by Teams and Zoom are SILK [17] and Opus. Since they both use the same technology for streaming audio frames we did expect the same results in terms of audio-only workload for the CPU utilization. Overall, our results are consistent with these expectations, but also we have encountered problems in properly applying bandwidth throttling. We leave solving this issue for future work.
3. Scalability Experiment (MF3). Experiment 3 compares two settings, (1) a small one-on-one discussion and (2) a large an interactive classroom of over 20 persons, both using Zoom. Figure 1 depicts the results; for more details, see our technical report [6]. Overall, the classroom-related results exhibit much higher variability-the curves have higher and more peaks. This matches the expectation of more dynamic nature of the classroom activity.

Different Connection Types (MF4).
We cover Experiment 4 in our technical report [6]. The main findings are: (1) When using a wireless connection type, both the variability and the maximum values increase for all three metrics, and (2) Wi-Fi 5 shows less variability than both Wi-Fi 4 and 4G LTE.

CONCLUSION AND FUTURE WORK
We designed an experimental method to analyze and compare VCSs such as Zoom, Microsoft Teams, and Discord. Our experiments consider this kind of software as black-boxes, and subjects each system under test to a set of controlled and uncontrolled experiments. As our results indicate, the method allows us to compare the resource consumption of VCSs in settings with different properties: in oneon-one and classroom environments, with one or both of the audio and video channels enabled, using various kinds of network connections, and under limited (throttled) bandwidth. We have presented four main findings. Among the detailed results, we see strong evidence that Discord uses significantly more CPU utilization and bandwidth compared to Teams and Zoom during video streaming. For future work, we would like to measure server-side metrics such as latency and jitter. A community effort should continue this task-join our effort!