Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services

Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90% to below 1%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research.


INTRODUCTION
The need to classify websites became apparent in the early days of the Web. The first generation of domain classification services appeared in the late 1990s in the form of web directories. Notable examples from this period are Yahoo! Directory [1] and DMOZ 1 [3]. The main purpose of such services was to facilitate the discovery of web pages relevant to a certain topic of interest. To this end, human editors manually classified sitesÐoften relying on suggested categories submitted by other usersÐinto a purpose-specific taxonomy [4]. The quick expansion of the Internet soon put this approach to an end and led to the development of automated classification solutions [5ś9].
As the Web grew in size, content, and applications, domain classification services became a valuable facilitator in multiple areas.
One key application is traffic filtering, i.e., networking solutions designed to block access to sites that are deemed dangerous (e.g., phishing or malware [10,11]) or inappropriate (e.g., adult content). Cybersecurity firms such as McAfee [12] and OpenDNS [13] (Cisco) rapidly developed their own products. These technologies are nowadays embedded in multiple applications and setups such as parental control solutions and traffic filters in schools [14], libraries, and enterprise networks [15ś17]. The online marketing industry also found domain classification extremely useful, in particular to improve targeted contextual advertising [18ś20]. This led the Interactive Advertising Bureau (IAB) to develop an open standardized taxonomy for real-time bidding protocols [21]. Finally, networking, privacy, and security researchers also rely on website classification services to conduct category-dependent measurements [22ś24] or to discover websites falling in a given category [25ś27].
To the best of our knowledge, no study so far has specifically analyzed the coverage, labels and applicability of domain classification services in different scenarios and research domains. Classifiers that were developed for different target applications or with different methodological approaches often exhibit disparate characteristics in terms of their coverage and taxonomies. This may have a substantial impact on how much the applications and studies that rely on them can be trusted. In fact, previous research studies reported the need for manual classification of websites due to the shortcomings of commercial services [9,28,29].
Unfortunately, the evaluation of these services is complicated by their opacity. While many services claim to apply machine learning algorithms, it is unclear how thoroughly they perform concrete analyses to validate their solutions, how comprehensive the underlying training data is, and, ultimately, how trustworthy and accurate the resulting classification is. Similarly, services such as DMOZ and OpenDNS that rely on human volunteers may be biased due to subjective opinions in the moderation process. Therefore, classification services may not succeed at adequately covering the large diversity of websites in both number and nature.
In this paper, we address the questions above by presenting a first analysis of domain classification services. Specifically, we make the following contributions: • We analyze 13 popular services selected through purpose-specific web searches as well as through a survey of all the academic works published during 2019 (ğ 2). We find that the results of 24 academic papers published in 9 relevant conferences (e.g., IMC, WWW) depend on the outcome of the domain classification services that they use. Then, we present a qualitative analysis of the approach followed by these domain classification services according to their documentation. We find that key differences in their approaches might affect coverage and accuracy (ğ 3). • We evaluate the coverage of these services for both popular and unpopular domains, their labeling methodology, their taxonomies, and the (dis-)agreements across services when labeling the same domains. We crawl the labels assigned to 4.4M domains and find that most services lack coverage (only two services have a coverage above 55%), especially for non-popular domains. Furthermore, we show that their complex taxonomies (in particular for marketing-oriented classification services, with sometimes over 7.5k observed labels) hinder sound interpretation (ğ 4). • We study how introducing humans in the labeling process might impact the coverage and label consistency of those services (ğ 5). We find that manual classification is affected by disagreements, ambiguities, and mismatches in the labeling process as well as biases in the distribution of users that submit votes and the workload of editors. This translates in some domains receiving as many as 58 rejected labels. To gain a better understanding of these challenges, we run a controlled experiment involving manual domain labeling and find disagreements in 35.5% of the cases. • We explore the performance of domain classification services as tools to identify websites of interest. To do so, we run three case studies in the areas of detecting (and filtering) advertisement and tracking, adult content, and CDN or hosting infrastructure (ğ 6). We find that the accuracy and coverage of the studied services is extremely low, and that the choice of one service or another significantly affects the outcome because of differences in coverage, which ranges from over 95% to below 1%. • Finally, we discuss the implications of our findings for both the technical and academic applications of these services (ğ 7). We also provide recommendations on how users should handle the significant disparities observed across services and identify a number of research questions for future work.

USAGE IN ACADEMIC STUDIES
In this section we assess the relevance of domain classification services to academic studies. Given that the unknown properties of these services can impact research results, it is important to understand how widespread their usage is and what they are used for in the literature. Survey approach. We survey all 1,014 papers published in 2019 at top venues in four areas: i) network measurements (IMC, PAM, TMA, CoNEXT, SIGCOMM); ii) security and privacy (CCS, NDSS, S&P, USENIX Security, PETS); and iii) Web (WWW). We first search for the names of domain classification services as well as keywords that indicate that such a service is used. 2 We then discard obvious false positives, such as the Amazon Alexa voice assistant instead of the Alexa domain classification service. Usage. We manually analyze the remaining papers and find 26 papers that use at least one domain classification service ( Figure 1). We find that for 24 (92%) of these, their results depend on the choice of service as they use it to gather their initial dataset or validate their results. Papers accepted at WWW and IMC are the ones that tend to rely the most on domain classification services. VirusTotal is the most popular service among academic studies (12 papers [40] or gambling and dating websites [26]). One paper [41] also uses the list of top sites per country. Our analysis reveals one paper using SurfControl [42], but as this service was acquired by Websense in 2007 [43], we do not consider it further. Table 6 in Appendix A lists all analyzed publications per venue.
Purpose. The 26 papers using domain classification services do so for a wide range of purposes. We find that 9 (35%) of them focus on security topics, including mobile sensors attacks [39] and certifications in the online payment industry [44]. We find 4 (15%) papers studying privacy in specific website categoriesÐe.g., tracking on pornographic websites [25]Ðor email tracking [31]. We identify 6 (23%) measurement papers, e.g., on resource reloading by third-party websites [24] or web complaints [32]. Finally, 4 papers question the accuracy and applicability of existing domain classification services and either choose not to rely on them [28,29] or manually validate the results [45,46].
Takeaway: We find that 26 papers published at top peer-reviewed conferences from 2019 use domain classification services. For 92% of these, their results depend on the choice of service, even though these services are sometimes questioned. As we will show later, in the absence of ground truth this dependence can introduce biases in the study results.

METHODOLOGY OF DOMAIN CLASSIFICATION SERVICES
We perform an analysis of the 13 domain classification services listed in Table 1 using publicly available information. We select them based on their usage in recent academic works (ğ 2), extending the set with services found through targeted online searches. Note that 2 of the domain classification services that we consider (FortiGuard and Webshrinker) were not used by any of the surveyed academic papers published in 2019. Our list does not cover all commercially available services, but those omitted pose a high barrier for data collection because of technical or monetary reasons. 3 Furthermore, VirusTotal is unique in that it does not provide its own classification, but instead aggregates category labels from third-party scanners. At the time of our data collection, these scanners were Alexa, Bitdefender, Dr.Web, Forcepoint, Trend Micro, and Websense. 4 However, since July 2020, these consist of (at least) Bitdefender, Comodo Valkyrie Verdict, Dr.Web, Forcepoint Threat-Seeker, Sophos, and Yandex Safebrowsing. We consider the former services (independently) in our evaluation. In ğ 4.2, we evaluate the consistency of services across multiple available sources. Our evaluation focuses on features and methodological aspects that might affect how these services can be used in technical solutions and academic studies. Table 1 shows the features exhibited by the selected services according to their documentation and websites. A more detailed description of each service is provided in Appendix B. We also register our own domain and set up a live website hosting a WordPress blog, and then request its classification from each provider to investigate their approach to classifying new domains. We consider the following properties: Inputs. The granularity of input provided to the classifier affects the correctness of the classification: a subdomain may host a different kind of content than its base domain. For example, subdomains of the base domain (yahoo.com) may host a search engine (search.yahoo.com), a sports news site (sports.yahoo.com), or a webmail service (mail.yahoo.com). Depending on the origin of domains to be classified, e.g., domain top lists often used by researchers that can include subdomains [63,64], this can impact the accuracy and perception of the labels. All evaluated services may provide a separate classification for a subdomain. However, Alexa does not have a way to retrieve the classification given a (sub)domain. Instead, it requires searching through its listings of the top 500 domains in one of 279,716 categories.
Outputs. The outputs affect the utility of the data to a study's purpose. If a service yields multiple categories for a given site, this may improve the applicability and correctness of the classification as it can be more nuanced, e.g., tagging a sports news website as both sports and news. However, this could also lead to an incoherent interpretation, e.g., double-counting when aggregating domains by category. All services except FortiGuard and Forcepoint can assign multiple categories to domains. Purpose. In many cases, the provider's intended purpose for a service (e.g., content filtering, threat protection, marketing or discovery of relevant content) influences the used taxonomy. For example, a content-filtering service may prefer to label youtube.com purely as a bandwidth-consuming site, but a marketing-oriented service may label it as a video sharing or advertising platform. Most of the classification services analyzed are intended for content filtering, usually being integrated into their consumer or business web security software. One exception is VirusTotal, which provides only a threat assessment. Further exceptions are Alexa, DMOZ, and Curlie, which are designed for discovering sites within categories of interest. Moreover, certain services also have other applications. For instance, Webshrinker can categorize domains according to the marketing-oriented taxonomy of the Interactive Advertising Bureau (IAB) [21].
Updates. The ability to update classification results affects both coverage and accuracy. Real-time classification, often enabled by a fully automated analysis, may improve coverage and maintain data relevance. In other words, new sites can be immediately assigned to a category, and the classification will reflect the most recent content. For example, a change in website ownership would not result in outdated labels. Automated approaches may also increase the scale at which domains can be classified, in particular when additional data is used to label uncrawlable domains (e.g., malware domains). The ability to request reclassification of a site may allow to correct errors, but it may also be leveraged to undeservedly receive a less łharmfulž classification if requests are not adequately reviewed. For example, an adult website may attempt to get reclassified as a (non-adult) video streaming site in order to evade filtering. Only Forcepoint, Symantec and Webshrinker provide real-time results: we confirm through web server logs that upon request, they immediately visit and categorize a domain that we newly registered. Webshrinker even proactively visits the domain (likely due to its entry in the zone file), and is the only one to deploy a real browser. This behavior can be traced back to the methods that services claim to use, mostly consisting of automated classification through machine learning algorithms.  [69] state in their documentation that they complement their crawler-based ML solution with domain metadata, security honeypots and scanners, and third-party feeds and logs, as well as human reviewers who inspect and amend automatically determined categories. OpenDNS, DMOZ, and Curlie rely on human volunteers to propose and confirm categories; Alexa uses a truncated version of DMOZ's data and taxonomy [54]. All services except VirusTotal, Bitdefender and Dr.Web provide a way to request domain reclassification: for our newly registered domain, the delay of several days before any change suggests that this process requires human intervention. Access. Easy access to data and documentation improves usability for end users and researchers. For instance, clear descriptions and examples of sites that are considered part of a category aid in selecting the appropriate categories for other websites. Bitdefender and Dr.Web do not provide direct free access to their data, but they are available through VirusTotal. Dr.Web is the only service that does not document its taxonomy. VirusTotal does not document where and how it sources its data. In ğ 4.3, we compare the documented categories with those that we observe empirically.
Takeaway: The substantial differences in domain classification services' characteristics affect their applicability: label interpretation depends on a service's supported inputs and outputs as well as taxonomy differences due to their purpose, while coverage and accuracy benefit from easy access to up-to-date labels. These properties should therefore be well understood to ensure correct application. We assess the veracity of services' claims through our own empirical observations in ğ 4, to determine their effective suitability to different scenarios.

DOMAIN LABELING QUALITY
In this section we analyze domain classification services on their labeling coverage (ğ 4.2), their individual taxonomies (ğ 4.3), and the labeling consistency and relationships across providers (ğ 4.4). In this analysis, we omit DMOZ and Curlie as they aspire to achieve a different goal, i.e., supporting content discovery instead of concisely classifying all domains. This affects their data retrieval strategy and interpretation, and we would need to reverse their mapping of deeply nested categories to relevant domains.

Data collection
Our data collection process consists of two stages: (1) Compiling target domains. We compile a large list of domains starting from the union of all daily Alexa top sites rankings between September 1 and 30, 2019. To reduce possible biases caused by the instability of the Alexa ranking [22,63,64], we aggregate these rankings using the default method of the Tranco top list [64], which sums domain scores from individual lists following a Zipflike distribution. We retain a ranked list of 4,424,142 domains that we could successfully collect from all non-rate-limited services. While these 4.4M domains represent a small fraction of all registered domains [72], they are considered to be popular by the Alexa traffic ranking service. Their popularity is further reflected by the fact that 47% of the 4.4M domains are indexed in the Chrome User Experience Report [70] and 0.5% by Common Crawl [71], both generated between August and October 2019. We therefore believe that our set is representative of domains regularly visited by end users and therefore also of interest to researchers.
(2) Crawling domain classification services. We retrieve the category labels for the 11 selected domain classification services. As each service differs in how its online portals retrieve data, we develop the most scalable and least resource-intensive method possible for each provider.
• For FortiGuard, McAfee, and OpenDNS, we retrieve labels through their publicly available portals. While these services are not ratelimited and their data is public, we perform our data collection at a non-intensive average rate of 40 requests per minute. We retrieve McAfee's labels for its łReal-Time Databasež product. For VirusTotal, we retrieve labels through its API, which aggregates six services: Alexa, Bitdefender, Dr.Web, Forcepoint, Trend Micro, and Websense. We received access to VirusTotal's academic API, with a request limit of 20k queries per day and account. • For Symantec, Trend Micro and Webshrinker, our data collection is subject to rate limiting. Therefore, we retrieve labels on these three services for the top-10k domains in our ranked list. We retrieve Webshrinker's labels from its default marketing-oriented IAB taxonomy.

Coverage
One critical aspect to consider when using domain classification services is their coverage, defined as the number of websites for which they provide a meaningful label. This metric affects how comprehensively a service can both execute its original task and be deployed for large-scale applications and studies. As discussed in ğ 3, some domain classification services involve humans in the loop, while others try to achieve a larger scale or real-time classification using machine learning methods. As a result, not all services have the same ability to scale their labeling process. When measuring coverage, we apply a sanitization process to address the fact that five services (FortiGuard, OpenDNS, Websense, Forcepoint and Trend Micro) provide explicit labels for unclassified domains. We consider a domain łunlabeledž if we obtain an empty result, or a label explicitly stating that the service has not (yet) labeled the domain (e.g. Uncategorized for Forcepoint). Figure 2a shows for which percentage of our full set of 4.4M domains we obtain a valid label. The diagonal reveals that the coverage varies greatly between individual services. The off-diagonal values report the 'intersection coverage' defined as the number of domains that both services label simultaneously, regardless of the label provided. FortiGuard and McAfee excel by labeling around 94% of domains, likely due to their deployment of machine learning techniques for automated classification. Contrarily, OpenDNS only achieves 15% coverage, with its manual submission and voting processes (ğ 5) likely becoming a bottleneck when dealing with the millions of monthly domain registrations [72]. Alexa's coverage is even lower at 0.53%, possibly due to its data source DMOZ [54] containing human-volunteered labels in often highly specialized (and therefore less popular) categories designed for content discovery, as well as its limit of 500 websites per category. Services retrieved through VirusTotal also have much lower coverage; we will show later on that this may in part reflect a service integration issue at VirusTotal, as services do yield a label when directly queried.
For completeness, we also compute the łunion coveragež between pairs of providers. We define it as the percentage of websites for which at least one service provides a valid label (Appendix C). This analysis suggests that considering the union of two services does not necessarily increase the global coverage when their intersection is already high. For example, the union coverage for FortiGuard and McAfee increases slightly to over 98%. However, as we will discuss in ğ 7, the combination of labels from multiple services is non-trivial due to largely disjoint taxonomies. As a result, unless the objective of unifying providers is offering complementary perspectives, it might not necessarily benefit coverage. The importance of being popular. Table 2 shows that service coverage differs depending on domain popularity. We expect automated services to achieve a higher coverage even for less popular domains, but we observe that while McAfee and FortiGuard maintain a consistent coverage of at least 93% throughout, Bitdefender and Forcepoint drop from 93% and 98% to 27% and 48%, respectively, when labeling domains from either the top-1k or unpopular domains found in the long tail over 1M. We observe a similar behavior for Dr.Web, Websense, Trend Micro, and Alexa, who have relatively low coverage overall but perform worse for non-popular websites. The human labeling efforts of OpenDNS appear to prioritize popular domains (an expected feature). Nevertheless, OpenDNS coverage across domains ranked over the top-1M may be inflated by the 15% subdomains within that interval. As we will discuss next, in OpenDNS, subdomains typically inherit the label of the base domain. Finally, Trend Micro (directly sourced), Symantec and Webshrinker achieve a very high coverage of over 96% for the top-10k, but their rate limits make large-scale data collection unfeasible. In summary, only two services are able to categorize both popular and non-popular domains. Given the ever-increasing number of websites as well as the trend to conduct large-scale measurements,  33  17 20 17 20  47 83 32 20 84  48 94 33 20 83 95  8 11 4 3 10 11 11  21 27 16 9 27 27 5 27  48 96 33 20 83 93 11 27 97  19 24 14 9 24 24 4 13 24 24  48 98 33 20 84 95 11 27 97 24 99  48 95 33 20 83 93 11 27 94 24 96 96  48 95 33 20 82 92 11 27 94 24 96 93  the choice of service impacts the capacity to classify potentially millions of visited or targeted domains, including undesired ones. Base domain vs. Subdomains. We identify 582,230 (13%) subdomains among our 4.4M domains. Three servicesÐOpenDNS, McAfee, and FortiGuardÐprovide labels for more than 99% of them. Yet, as we will see in ğ 4.3, there is no difference between base and subdomain labels in the majority of cases. In the case of OpenDNS, the improvement compared to its overall coverage (15%) stems from its approach to labeling subdomains. When humans do not offer a category for a subdomain, OpenDNS classifies it by default with the label of the base domain (if labeled). However, this coverage is skewed towards the 77% subdomains related to three base domains: blogspot.com, wordpress.com, and tumblr.com. For Alexa, Websense, and Trend Micro, subdomain coverage is below 1%. Depending on the source and selection of domains, overall coverage may therefore become worse. Direct Source vs. VirusTotal. We verify labels collected through VirusTotal (which aggregates 6 existing services) by directly collecting labels for the top-10k domains at two services, Trend Micro  and Alexa. As shown in Table 3, Trend Micro's coverage is much higher (98%) when directly queried than when using VirusTotal (28%). Moreover, only 27% of the domains are classified with the same label and only half of the distinct labels appear at both sources. As we will expand on in ğ 4.3, we suspect VirusTotal may be using a different or an older Trend Micro product, with a potentially lower coverage and different set of labels. However, for Alexa we observe the opposite behavior: we obtain 12% more coverage through Virus-Total. Again, this may point to VirusTotal obtaining Alexa's data from an unknown source, different to our (one-time) search within the top 500 sites of Alexa's 279,716 categories. The inconsistencies between VirusTotal and a direct source indicate that the former might not be a fully reliable source. This is particularly worrisome given VirusTotal's popularity in recent academic work (ğ 2).

Labels within services
In this section, we report on the distinct labels that we observe in each service, and the properties that affect their correct and tractable interpretation: their diversity, deviations from documentation, and uniqueness. We normalize all labels to lowercase, and we break down multi-labeled classifications into their individual units to reduce possible inconsistencies in the comparison.
Label diversity. Table 4 shows that the number of observed labels per service varies significantly across services, but conforms to their intended purpose. Security and content filtering services have fewer labels (12 observed in Dr.Web to 125 observed/139 documented in Forcepoint), which may simplify the setup of security policies. Conversely, the larger diversity in marketing-oriented services (300 observed/401 documented in Webshrinker, and more than 7,500 observed in Alexa) may enable more fine-grained targeting. We also see that all services except Websense use at least one label that is unique to them, showing that their taxonomies are diverse and not trivial to merge. While some services offer hierarchical taxonomies that can reduce the diversity by replacing a label with that of an ancestor, this compromises precision and forces users to decide where to prune the tree. This complexity is best exemplified by labels for Alexa queried through VirusTotal, which will only yield the label of the leaf. This is often a non-English label, derived from that website's classification into the multilingual World tree. For example, a given domain may be labeled as Arts (English), Artes (Spanish), or Kultur (German). In short, it is hard to reduce the large set of labels, without affecting their usability and interpretability. Documented vs. Observed labels. In order to further understand how well these services document their taxonomy, we compare the documented categories with those that we observe in our dataset. As shown in Table 4 44 and 40; for the former, this is due to the low number of labels observed ( Table 4). The label pairs are often unevenly distributed, e.g., in Trend Micro, 2% of the labeled domains have the most popular pair disease vector-spam, while the next most popular pair financial services-business economy appears only on 0.2% of the domains. In McAfee and OpenDNS, the most popular pairs, personal pages-internet services and blogscontent delivery networks, appear on 1% and 39% of labeled domains respectively. Common pairs are also not always intuitively linked. For example, in Dr.Web, the most popular pair is adult content-social network, appearing in 65% of all domains labeled by Dr.Web, where 60% of them are subdomains of blogspot.com. When using aggregated labels from VirusTotal without taking into account individual services, a non-adult blog could, therefore, be inadvertently labeled as an adult site, impacting applications targeting adult content. Base domain vs. Subdomains. We saw in the previous section that coverage on subdomains is better compared to the general coverage, in the case of OpenDNS with an improvement of 70%. We now analyze how meaningful these labels are. We see that for OpenDNS, McAfee, and FortiGuard, 99%, 98%, and 97% of subdomains, respectively, have at least the label of the base domain. However, since domains at McAfee and OpenDNS can be multi-labeled, we observe that the percentage of the subdomains that have the same labels as the base domain drops to 46% in OpenDNS, while in McAfee, below 1% of the subdomains have different labels. This drop in OpenDNS is because 90% of blogspot.com subdomains, which represent 51% of the total subdomains observed, have the original label of the base domain (Blogs) plus an extra label, typically Content Delivery Networks (90% of cases). We conclude that subdomains inherit the label of the base domain, without taking into account the actual content of the subdomain. Labeling update. As discussed in ğ 3, the frequency of label updates affects the timeliness and, therefore, accuracy of labels. We analyze how common such updates are for the 9 services that do not rate limit (see ğ 3). We select 2,000 domains per service: half of them were previously labeled by (at least) that service, while the rest were unlabeled for the particular service. We select domains that have been crawled at the beginning of our data collection, to increase the time that these services had to (re-)label the domains.
We find that in our second round, only OpenDNS, FortiGuard, and McAfee categorize domains that had not been previously labeled. However, the number of updates varies: while McAfee and FortiGuard now label 88 and 53 out of 1,000 previously unlabeled domains, OpenDNS only does so for 2 domains. Similarly, for domains that had been previously labeled, McAfee and FortiGuard relabel 15 and 10 domains, respectively. The majority of these changes concern the maliciousness of domains, with some of them gaining a related label (e.g., malicious sites) while others lose such a label. Finally, for OpenDNS, three domains gain a label, although two of those receive the label Content Delivery Networks outside of the regular voting process (ğ 5). In summary, some services update labels over time, making it more likely that their classification better reflects the current state of a website.

Labels across services
The differences in both label number and coverage (see Table 4) call for a better understanding of the relationships between services. This analysis is however hindered by inconsistencies in label syntax (e.g., News vs. News and Media), language (e.g., Arts vs. Artes), semantics (e.g., File sharing vs. File storage), and aggregation (e.g., sports vs. entertainment/sports). Furthermore, one provider may give multiple labels to a particular domain, requiring a comparison of sets of labels with different dimensions. Mutual information. In this section, we take a statistical approach to perform a label-agnostic analysis. A suitable metric is the mutual information, which describes the amount of information gained about a random variable upon observing another random variable [75]. Mutual information can be thought of as the reduction in one variable's entropy (level of uncertainty) if the output of another variable is observed. In our case, we treat each provider as a random variable whose distribution of values (i.e., labels) we estimate empirically. We can then interpret the mutual information as how similarly the labels are distributed between two services. Its normalized value will be 1 if one service assigns a common label to all domains (and none other) that are given a common label by the other, regardless of the exact label syntax. Conversely, it will be 0 if the services are completely independent, i.e., there is no information to be gained about the first when observing the labels of the second.
We select McAfee, OpenDNS, Bitdefender, Forcepoint, VirusTotal and FortiGuard for this analysis as they are the services with the largest coverage (see Table 2). VirusTotal is a special case: while it meets the coverage criterion, its labels are aggregated from other providers, including Bitdefender and Forcepoint. The normalized mutual information matrix is shown in Figure 3. Overall values are   low, indicating disagreement between providers, which is due to several reasons. First, services such as OpenDNS and Bitdefender differ in specialization, providing either a content-or a securityoriented label, e.g., Online Service vs. Spam. Next, human-sourced services such as OpenDNS may suffer more from subjective labeling (ğ 5) and therefore disagree more with automated services such as McAfee. Differences in the size and granularity of taxonomies (e.g., between VirusTotal and FortiGuard) can introduce further disagreement. Finally, shared sources of labels or taxonomies may inflate agreement: we see the highest mutual information between VirusTotal and two of its aggregated providers, due to their partially shared data source. We observe consistent results when repeating our analysis using the conditional entropy. Label frequency. Next, we compare the distribution of labels over domains, in order to understand the label coverage as well as service specialization. Figure 4 presents the normalized label frequencies for the top-1k, 10k and 100k domains in our ranked list. In all three subsets but in particular for the top-100k, there is a significant number of outlier labels that appear with a much higher frequency, indicating that labels are distributed unevenly. With the exception of VirusTotal, the median frequency for labels across domains is relatively consistent. On the top-1k domains, OpenDNS shows the smallest granularity in terms of coverage, while VirusTotal shows the highest. The trend is partially maintained when considering larger domain sets, where Bitdefender, FortiGuard and McAfee span the considered domains with the smallest number of labels.
Label distribution. Finally, we observe two trends in the concrete distributions of labels between providers. First, we see that, especially when considering more than two providers, one fixed set of domains corresponds to largely varying sets of labels that cannot trivially be combined into one category: e.g., Nudity, Society and Lifestyle, and Adult Content are overlapping but not equivalent categories. We provide a visual example of these inter-service label relationships in Appendix D. Secondly, we find that labels are distributed unevenly across pairs of providers: e.g., for McAfee, the lower granularity of its taxonomy means that few labels cover the set of domains generated by a large number of labels from other services while for VirusTotal far more labels are needed. This distribution of labels across services is further explored in Appendix E. In summary, differences in service purpose, taxonomy size and label distribution cause large disagreements between services, making it difficult to compare and combine their classifications.
Takeaway: We find that commonly used domain classification services exhibit traits that affect their suitability, both for technical solutions as well as for research. Only a few services attain a level of coverage that is sufficient to cover non-popular or non-base domains. Services may return multiple or undocumented labels, requiring careful data processing and even manual validation. Breaking down multilabeled classification may ease the label comparison between services as well as improve the interpretation of the results. However, it may also bias the results, overestimating the presence of labels that do not provide information about the real purpose of the service. The large diversity in labels, both within and across services, may harm their accurate and tractable interpretation. Efforts to combine labels from multiple services to achieve a higher agreement on label accuracy might be thwarted by labeling inconsistencies. The labeling updates may also have an impact on accuracy and timeliness. Researchers should be aware of these phenomena and renew their dataset to reduce possible misclassifications, especially in treating malicious services. In summary, sound deployment and usage of domain classification services requires a thorough understanding of the (desired) characteristics and resulting biases to select the most appropriate sources.

HUMAN PERCEPTIONS
As described in ğ 3, OpenDNS, DMOZ and Curlie leverage a network of human volunteers to label domains. In OpenDNS, moderators approve or reject labels voted on by users, while in DMOZ and Curlie, editors add suggested sites to their managed categories. In this section, we harvest historical data from OpenDNS' voting process to further measure the effect that human decisions have on (1) OpenDNS' labeling processÐin terms of user and editor temporal dynamicsÐand, (2) on the resulting classifications. For comparison and completeness, we also study Curlie's labeling dynamics by crawling and analyzing their publicly available data.

Labeling dynamics
OpenDNS. OpenDNS relies on a voting process that allows users to submit labels ('tags') for domains, which then receive positive and negative votes from other users. After sufficient votes, a trusted moderator approves or rejects these submitted labels [76]. OpenDNS publicly releases historical data from this voting process, including the labels proposed for every domain, the user who proposed them, whether they are accepted or not, and the moderator who took the final decision. All items are timestamped, which allows us to analyze the evolution of submitted labels over time. This data allows us to inspect the OpenDNS voting process for 794.8k domains, as well as the behavior, agreements and disagreements between 19k users and 292 moderators from February 2008 until January 2020. First, we analyze who is submitting labels for observed domains. The first observation that stands out is that most users are łcasual, ž as 95% of users only submit a label for 10 domains or fewer. Nevertheless, there is a group of 160 highly engaged users who submitted labels for more than 100 domains. As for moderation, the workload distribution is more even: around 40% of moderators have approved 10 labels or fewer. Nevertheless, there are 292 moderators (0.03% of all moderators) which are very prolific, being responsible for the approval of over 10k labels. Figure 5 shows the number of approved and not approved labels submitted quarterly. We can observe that the majority of labels were submitted during 2008 and 2009. Interestingly, at the beginning, the majority of labels were accepted. However, starting in 2009 there is a large decay on the number of accepted labels and an increment of those that are not accepted. Our intuition is that because at the beginning of the project all major sites lacked a label, the probability of people correctly labeling those is higher. As time passes, only a long tail of unpopular domains remain unlabeled, so users are more likely to submit an incorrect label or no label at all. Curlie. As in the case of DMOZ, Curlie has no open voting process. Instead, trusted editors fully manage categories and decide which user suggestions they include. Review may come from other editors for the same category and its parent categories, or those with the right to edit all categories [77,78]. Because of its content discovery purpose, Curlie has a large and deep hierarchical taxonomy, consisting of 671,715 observed categories. By analyzing the assignments of categories to editors, we examine whether these editing and reviewing processes can be effective considering this deep taxonomy.
Only 985 (0.1%) are explicitly managed by at least one out of 294 active editors. When we account for the editing rights to subcategories, 515,791 (76.8%) categories have at least one łimplicitž editor. However, 565,812 categories have at most one implicit editor, which means that 84.2% of categories can only be peer reviewed by the editors with rights to all categories. The opportunity for peer review may be further affected by the breadth of certain editors' scope, with the top łimplicitž editor managing over 300k categories. In summary, the large number of categories managed by only a few editors may prevent these editors from conducting a regular review for accuracy and recency. Figure 6 shows that around half of all categories have been updated since the evolution of DMOZ into Curlie in 2017. Moreover, it shows more recent activity higher in the tree: lower levels may either inherently require fewer updates, or may be less actively maintained by their editors. While there is steady ongoing activity on Curlie, many categories have not been updated for years, potentially leading to their entries being outdated or inaccurate.

Labeling (dis-)agreements
One key issue with human-in-the-loop labeling is that the task of classifying domains is not completely objective, and thus different users might suggest different labels for the same website. Therefore, we measure how often this happens in the labeling process of OpenDNS. While the median number of accepted and rejected labels in OpenDNS is one, we have shown in ğ 4.3 that some domains have as many as 17 accepted labels. In the case of labels that do not get approved, we can find domains with a high level of disagreement among voters with as many as 58 not accepted labels.
We further investigate the type of labels that create most agreement and disagreement in OpenDNS. To do so, for all domains with a given approved label, we measure how often other proposed labels are approved and rejected for the same domain. Selected clusters of labels where the disagreement is high are shown in Figure 7. Some of the labels that often appear together seem to be a product of honest mistakes by the users, as they are closely related (such as Adult themes and Sexuality, or Travel and Business Services). An interesting case is the label Pornography, which often appears proposed (and rejected) in addition to other labels. While this might make sense for some categories (such as Lingerie or Sexuality), it is surprising that over 30% of Social Media sites and over 40% of dating sites were also labeled (and rejected) as Pornography. Another apparent issue is that domains related to URL Shorteners, Video Sharing or File Storage can often be related to other categories, such as Music, Movies or Pornography. This shows that deciding the correct label for a given domain can be hard, with the differences between categories being vague. Furthermore, not all users might behave honestly, as some could mislabel domains to pollute the system or gain advantages over competitors, e.g., a pornographic site  trying to be labeled as a video sharing site, or a company labeling a competitor's website as malicious or pornographic.
In the case of labels that often appear accepted together, we also find a high correlation among categories that could be related to sexual or nudity content (e.g., Pornography, Nudity, Bikini/Lingerie). Another interesting case is the pair Advertising and Business Services, which are accepted together over 30% of the time. This can be a result of many of these Business services acting as third parties offering advertising and tracking services too. Similarly, News and Media and Television often appear together since television stations often act as news outlets.

Is labeling domains a trivial process?
We perform an experiment using the authors of this study in order to gain a better idea of the aforementioned challenges behind OpenDNS' labeling process. One member of the research team manually selected 200 hostnames, including 50 for which OpenDNS and McAfee provide semantically equivalent labels; 50 for which they disagree; 50 from the top-1k domains in our normalized rank; and 50 unpopular sites. For ethical reasons, we discarded domains with labels that could be uncomfortable or harmful for our human labelers (e.g., child pornography, nudity, violence, drugs, weapons, and malware-related ones). The remaining authors manually visited each website and labeled it using the OpenDNS taxonomy and definitions. Each domain was labeled by two authors, adding a third labeler when there was disagreement in the first stage.
Disagreement between two labelers is relatively high at 35.5% of domains, reaching 90.5% agreement between at least two reviewers when a third labeler is introduced. When the final results are compared to OpenDNS categories, we observe that our process could only achieve 71% accuracy; in 80.5% of the cases, at least one labeler reported the same category as OpenDNS. This experiment, while not representative, illustrates some of the challenges that arise when humans are involved in the process, even for experts in network measurements and cybersecurity. Disagreement is the result of subjective factors caused by different perceptions and sensitivities, but also by the inherent ambiguity of many of the categories forming the taxonomy and the dual nature of many websites, for instance, blogs offering political content [79] or tourism boards advertising casinos [80].
Takeaway: We analyze OpenDNS's ecosystem of voters and editors, and find that most labels were submitted during the early stages of the project. We show that most users (95%) submit labels for only a few domains but that, in general, workload is evenly distributed among moderators. In the case of Curlie, we find that peer review may suffer from the low number of editors, but that categories are still being updated regularly. Furthermore, we find that labeling strategies involving humans are bound to generate disagreements. In OpenDNS, there are domains with 58 not approved labels. Moreover, the slight differences among labels generate clusters of related labels that often appear rejected together (i.e., Adult themes, Lingerie/Bikini, Pornography and Sexuality). We show that labeling is a non-trivial job by running a small-scale manual classification experiment, in which we only achieve 71% accuracy compared to OpenDNS and find that two labelers disagree on 35.5% of domainsÐhighlighting the subjective nature of labeling.

CASE STUDIES
In this paper, we have shown that researchers often rely on domain classification providers to either understand the type of domains that they observe in their study [31] (i.e., to better characterize their results) or to gather a field-specific corpus of domains [25] (ğ 2). Next to that, core applications of domain classification services are outside the academic circles. They are often used in technical solutions for content filtering and threat intelligence, for example in parental control apps [81] and school networks [14], which require accurate identification of specific types of domains.
Therefore, in this section we aim to understand whether choosing one domain classification service over another can yield different results when selecting target domains or when classifying domains specific to a given category. We analyze the usefulness and aptness of domain classification services for three types of domains that are often analyzed by the research community: (1) advertising and tracking services; (2) websites offering adult content (i.e., pornography and gambling sites); and (3) domains that belong to a Content Delivery Network (CDN) and hosting providers. Our approach starts with obtaining available sanitized domain category sets to identify which domains belong to each one of these categories. Then, we analyze the coverage as well as the labels assigned to these domains by different classification services to identify potential errors and inconsistencies. While such specialized lists are more appropiate for choosing a pool of websites that belong to a given category, we have seen that it is still common for academic papers to rely on classification services for website selection or classification [23,25,32]. Advertising and tracking services. As ground truth, we take a list of manually sanitized domains indexed in EasyList [82] and EasyPrivacy [83]. 5 However, these lists allow blocking traffic at a full URL level. 6 To reduce bias in our case study, we opt to account only for domains that are fully blocked by these lists, regardless of the full URL path. After a manual sanitization process, we study the labels from different classification services for the resulting 24,825 advertisement and tracking-related domains and manually extract the resulting labels semantically related to advertising and tracking applications (e.g., Web Marketing or Advertisement). Table 5 (two leftmost columns) shows that none of these services are able to correctly label most domains as tracking or advertising. Forcepoint presents the highest accuracy, which is barely higher than 15%, at the cost of sacrificing coverage (51.6%). While McAfee and FortiGuard have a higher coverage, they classify fewer than 10% of the domains as trackers. Most of the errors arise from tracking-or advertising-specific subdomains. For instance, all providers classify airpushmarketing.s3.amazonaws.com and tracking.eurosports.com using labels related to hosting/CDNs and news/media/sports, respectively. Identifying adult content. We rely on two resources to gather domains related to adult content [14]. First, we rely on a manually labeled and sanitized list of pornographic websites from Vallina et al. [25]. Additionally, we compose a list of gambling sites extracted from three government websites [87ś89]. By combining these two sources, we compile a manually vetted list of 3,519 domains related to web services typically considered as ładult contentž. The results ( Table 5, middle columns) show that 5 services do a good job at identifying and correctly labeling webpages that host adult content: OpenDNS, McAfee, FortiGuard, Forcepoint and Dr.Web. Yet, there are substantial differences across services. Alexa, Trend Micro and Websense do not provide a label for the majority of the websites analyzed. Therefore, this case study also demonstrates that the choice of one provider above another can have severe implications in the number of domains classified as adult content. We also examine which other labels are usually assigned to adult content domains, finding a high correlation with those related to video sharing and streaming media. These labels are, in most cases, technically correct but they do not allow to identify these domains as pornographic. We also see that some services assign labels that imply maliciousness of adult domains (e.g., malicious, spam, or not recommended). CDN and hosting provider related domains. Content delivery networks (CDNs) remain the dominant means for serving popular content and represent Internet infrastructure. While most domain classification services (e.g., McAfee and Fortiguard) contain labels referring to CDNs or hosting providers, the content classification is often mixed with an infrastructure classification. As an example, one service can classify a CDN-hosted site as content delivery network while another derives a label from the site's content (e.g., news or personal blog). In order to measure differences in the classification strategies of different services, we select those domains in our dataset that are related to CDNs and hosting services. To do so, we pattern match the CNAME record of all domains against more than 80 CDN signatures from WebPageTest [90]. In total, we obtain a corpus of 2,858 domains, for which we compare the coverage across domain classification services. Table 5 (rightmost columns) shows that only McAfee and FortiGuard provide a label for the majority of these domains. Both services classify these domains based on their function rather that on their content (e.g., Internet Services, information technology, and content services). For the other services, the coverage is so low that it is difficult to discover a trend in the labels. Yet, it is still possible to find examples of labels related to the actual content of webpages hosted on these services (e.g., News, Adult content, or Business) as well as to the type of service provided. None of these classification strategies are right or wrong, but the choice of service translates in differences in terms of coverage and labels for CDN and hosting provider related domains.
Takeaway: For specialized use cases, the choice of one domain classification service over another can significantly impact the accuracy of academic studies and the effectiveness of solutions relying on them.

DISCUSSION
In this section, we extract actionable insights from our empirical results, discuss best practices for using domain classification services, and propose various solutions as future work to overcome their limitations.
Dealing with insufficient accuracy. The key observations of our study are that i) coverage varies substantially between services (ğ 4.2) and ii) the classification accuracy is marred by inconsistent taxonomies (ğ 4.3) and low agreement among providers (ğ 4.4). These inherent limitations set a high barrier for their effectiveness in real-world applications as well as their usage in research. For highly targeted use cases, general-purpose classification services may fall short. For example, as shown in our case studies (ğ 6), the choice of service impacts the number of correctly identified adult domains. It may therefore be necessary to either search or develop curated and manually labeled domain-specific lists. Furthermore, end users and researchers should carefully consider the implications of errors. In applications like content filtering, errors can lead to inappropriately restricting access to legitimate resources ('overblocking') or, conversely, allowing access to undesirable resources ('underblocking') [91,92]. For example, aggressive adult content filters could block sexual health information [93] or, as in the recent case of Cloudflare's DNS resolver, LGBTQIA+ sites [94].
In the academic domain, researchers can also take into account how important classification is to their studies, e.g., using domain categories to provide context for a minor result vs. generating the list of domains on which they base their whole study. There are a few documented cases in which authors preferred their own classification over those of commercial services due to concerns regarding their accuracy and coverage [9,28,29,45,46]. Dealing with biases. Coverage and accuracy suffer from selection and interpretive biases respectively. Service purpose determines which and how domains are classified: a filtering service may better cover and differentiate malicious domains, while marketing-or discovery-oriented services may provide a more fine-grained label for popular sites. How labels are sourced also introduces biases. For automated solutions, these stem from deficiencies in the training sets for machine learning algorithms. In a manual classification process, these are induced by maintainability challenges as well as human interpretation (ğ 5). There are cases where using a domain classification service can produce sound results. Yet, researchers should gain a proper understanding of potential biases in their chosen services to assess the limitations of applying them in specific domains, e.g., by consulting the documentation. To empirically gauge the coverage and accuracy of the used service specifically for their studied domains, researchers can additionally manually inspect random subsets to determine whether the labeling is of sufficient quality to make its usage appropriate.
Dealing with inconsistencies. When using domain classification services, results must be interpreted and reported with care, to avoid introducing errors due to inconsistencies. Domain classification services exhibit varying characteristics, e.g., whether they provide multiple labels, label subdomains differently, or regularly update labels (ğ 3). Moreover, they may behave unexpectedly, such as by deviating from their documented taxonomies (ğ 4.3). Users should therefore verify the output of the services, e.g., by analyzing aggregate statistics or a randomly selected sample. Furthermore, the specific applications of services affect their taxonomies. The granularity and exact meaning of a label (even if it is syntactically the same) thus largely differs between services and directly impacts the effectiveness of any application or the results of any study. Studies based on domain classification should thus examine the labeling taxonomy in detail and report the meaning of the selected labels to prevent wrong or incomplete conclusions. Aggregation of multiple domain classifiers. Many websites are complex entities: it is hard to reduce them into a single label. Researchers might be tempted to overcome the limitations of individual domain classifiersÐboth in terms of coverage as well as label accuracyÐby combining the output of multiple services in a single analysis pipeline. While this might be useful in some scenarios (e.g., threat intelligence aggregators such as VirusTotal), we identify multiple challenges that rule out simplistic aggregation strategies: (1) If the goal is to improve overall coverage, aggregating various classifiers might not necessarily achieve this purpose, as we showed in ğ 4.2. The choice of classifiers should be informed by the size of the intersecting set. In addition, we found coverage to vary greatly depending on factors such as domain popularity or freshness.
(2) Different classifiers might provide complementary perspectives on a domain's nature, but the aggregation of their labels can be difficult since they come from different taxonomies with radically different purposes. Simply taking the union of the outputs might unnecessarily increase the constellation of labels and increase redundancy, since two services might use semantically-equivalent labels to reflect the same purpose or abstract concept. This could be aggravated by services developing multilingual taxonomies. Reconciling multiple taxonomies coherently might be cumbersome and difficult to scale, particularly if it must be done semantically.
(3) Determining what is a discrepancy among classifiers and what is just a different perspective on the nature of a website could also be challenging. A site can simultaneously be labeled as porn, streaming, and CDN by three different providers. Understanding the focus, sensitivities, limitations, classification methods, and intended label usage of each classification service is an unavoidable step to properly contextualize and meaningfully aggregate their outputs.

Limitations and Future work
While we showed how domain classification services' characteristics can vary significantly and often tend to be unfavorable, we are unable to quantify the quality of individual services due to a lack of comprehensive ground truth. We therefore avoid putting forward specific guidance on which services end users as well as researchers should prefer. We provide directions for future work that would bring us closer to such an evaluation. While we have been able to compare labels between services by analyzing their diversity, understanding the semantic agreement of these labels would require developing a new taxonomy to which all labels across all services need to be translated, similar to how AVClass [95] automatically annotates malware samples with one semantically-equivalent label generated from multiple antivirus labels. This translation could occur manually, which may be more accurate, but comes at a higher maintenance cost when taxonomies change or an additional service is to be integrated. Alternatively, this taxonomy development could be (partially) automated through methods such as label normalization, heuristics [96], determining strongly coupled label pairs between services, or a semantic interpretation of existing labels through natural language processing. Anecdotally, we explored the latter method, but it generated a high false positive rate (e.g., web spam and web hosting could be reported as equivalent).
Beyond case studies, we do not broadly evaluate label correctness: even if all services agree on a label, it might still be wrong. An independently developed classifier can serve as a more trustworthy source of labels, against which the labels from other services could be compared. In order to cope with the large scale of the Internet, such a classifier would need to rely on automated methods, based on those developed in the state of the art, such as topic modeling [97]. Potential sources of ground truth are human-developed directories such as DMOZ and Curlie (as used in previous work [4, 98ś100]) or the categorization of pages on Wikipedia (idem [101]). While an automated model may not be able to achieve perfect accuracy, its methods and performance can be disclosed transparently, improving the soundness of research that depends on it and enabling unbiased evaluation.
These steps could result in a classification service that researchers can rely upon to retrieve category labels obtained through a welldocumented process and embedded in a vetted taxonomy. Such a service could either translate the set of labels from existing thirdparty classification services into labels from a custom taxonomy, or output the labels from a custom independent classifier. We consider both challenges to be interesting avenues for future work.

RELATED WORK
Web classification. One direction of research concerns methodologies and tools to classify websites automatically [5ś9, 102]. In this regard, Qi et al. [5] studied features and algorithmic approaches used for automatic website classification. Beyond textual features, another approach [6] proposes using the web site's visual content for classification. Closer to our study are recent efforts to understand VirusTotal. Here, Peng et al. [103] studied how it classifies łphishingž domains, as well as quantified the quality of the results, finding discrepancies between results from VirusTotal results and those provided by direct sources [7]. Despite these efforts, the question on how domain classification services that are in widespread use work and differ, and how their different approaches impact study results remains openÐa question that we study in this paper. Internet measurement research methodology. Recent work has critically analyzed data sets and tools that are regularly used in Internet measurement research, in terms of the soundness and representativeness of results stemming from their usage as well as their enabling of reproducible studies [104]. Moreover, these studies formulate recommendations for how researchers should use them or propose improved solutions. For the selection of a representative sample of the Web from popular domains, the regularly used Alexa top sites list was shown to be unstable and easily manipulable [22,63,64]. Le Pochat et al. proposed the Tranco list as an alternative [64]. For retrieving Web site contents through Web crawls, Ahmad et al. developed a framework to compare crawlers based on varying technologies, finding that the choice of crawler may significantly impact measurements [105]. Zeber et al. compared crawlers with each other and with human user traffic, and found results to vary over time as well as across platforms [106]. We provide a similar assessment of domain classification services, as they can equally impact the results of research studies.

CONCLUSIONS
In this paper, we empirically and comprehensively analyze 13 domain classification services in order to study their labeling strategy and performance. We find that their limitations and shortcomings heavily affect their suitability and applicability, both for practical solutions and for academic studies, as demonstrated through our case studies. Coverage varies greatly between services and is insufficient for many types of domains. The lack of a common taxonomy and labeling behavior prevents a fair comparison and combination of services. Meanwhile, services using human labeling suffer from potential disagreements. We conclude with recommendations on how these services should improve, as well as a discussion on how to limit their deficiencies when using them.

B PROVIDER ANALYSIS
We examine the claims made by classification services (if available) in terms of their purpose, methods used for classification, coverage of URLs and languages, and development of their taxonomy. We retrieve these details through a manual inspection of their own documentation.
OpenDNS. OpenDNS provides DNS-based content filtering, sourcing website categorization from its human volunteer-based łDomain Taggingž project [13]. Participants submit domains and their categories, on which other participants may vote; once the mapping of a domain to a category receives sufficient votes, it is available for approval by a community moderator before it is propagated to the content filtering system [76]. These moderators also review reports of incorrect categorization as well as categories of popular sites [113]. We expand on the effects of this voting procedure in ğ 5. OpenDNS has at least one confirmed category for almost 4 million domains, out of 12.7 million submitted domains [13]. A list of categories and short descriptions is available [48]. Users had the ability to suggest the addition of categories to the taxonomy [113]; it is unclear who approved these new categories.
McAfee. McAfee provides the łTrustedSourcež online service (previously called łSmartFilterž) for obtaining both the category and a reputation score-based risk assessment for a URL [12], mainly with the goal of client-side content filtering. A user of the service must choose one of eight 'products', which affects the ' FortiGuard. FortiGuard provides an online tool for retrieving content-based URL categorization [50], which supports the content filtering functionality in its FortiOS-based FortiGate firewall [65]. Websites are classified through a łcombination of proprietary methods including text analysis, exploitation of the web structure, and human ratersž [65]. FortiGuard's service is said to include over 45 million website ratings that cover over two billion URLs [65]. Categories are divided into seven high-level groups (adult, bandwidthconsuming, business, personal, potentially liable, security, and unrated), and short descriptions and test pages are available [51].
VirusTotal. VirusTotal is an online service providing analysis of potentially malicious files and URLs by aggregating the results from a large set of detection engines [52,114]. It also lists the domain's category, but it is unique among the other services in that it does not establish its own categorization. Instead, it collects labels from existing services: at the time of our data collection, these were Alexa, Bitdefender, Dr.Web, Forcepoint, Trend Micro, and Websense, but since July 2020, these were (at least) Bitdefender, Comodo Valkyrie Verdict, Dr.Web, Forcepoint ThreatSeeker, Sophos, and Yandex Safebrowsing. For each service, VirusTotal displays at most one distinct label, without combining labels any further, i.e., a domain can have as many categories as there are services.
Categories are only provided for domains, even though a user can also request scanning for URLs.
Alexa. Alexa offers the ability to view the 500 most popular websites for a specific category [53], with a focus on marketing and content discovery. Its results are based on the human volunteerbased categorization from DMOZ [54], but in contrast to DMOZ, Alexa's lists only contain domains, not URLs. Alexa's taxonomy is also based on the DMOZ's taxonomy, but pruned to around 280,000 categories. Alexa does not allow searching for the category of a specific domain. The ranking within a category is calculated using the same methodology as the main Alexa top list, but if applicable only using the data for the specific subdomain [115]. As the main Alexa top list only lists base domains, this may result in a different relative ranking for two domains [115].
Bitdefender. Bitdefender provides content category-based website filtering in its consumer-and business-oriented products [10]. There is no free online categorization tool, but VirusTotal integrates Bitdefender's categorization into its domain analysis. Its database is said to cover millions of URLs in multiple languages [66]. A list of (ungrouped) categories, short descriptions, and examples is available [55].
Forcepoint/Websense. Forcepoint (renamed from Websense in 2016 [47]) provides an online tool for website threat and content analysis [56]. The tool shows both a static (i.e., previously determined) and a real-time classification. The former results from a combination of automated and manual inspection [57], while the latter is based purely on an automated machine learning-based approach [56]. Forcepoint will classify the specific page of a given URL, not its base domain [116]. Categories are divided into six high-level groups (reputation, security, bandwidth-consuming, productivityinhibiting, social networks, and baseline) for which short descriptions are available [57].
Dr.Web. Dr.Web includes a category-based website filter in its client-side anti-virus software, but its online tool only provides a binary classification of a URL's maliciousness [117]. A more detailed categorization is accessible through VirusTotal, but appears to only cover types of malicious behavior. No documentation is available on the categorization process or the possible categories.
Trend Micro. Trend Micro's classification security-oriented service is available online through its łSite Safety Centerž [59]. Next to a content-based category, they establish a threat rating denoting whether a website is 'safe', 'dangerous', 'suspicious' or 'untested' [59]. Their database is said to include over 35 million URLs, and they acknowledge that ła few URL rating errorsž may occur [118]. Trend Micro publishes two lists of available categories with short descriptions. One was last updated in late 2019 and appears to be used for their łWorry-Free Business Securityž and łOfficeScanž web threat protection products [60]; its categories are grouped into seven 'filtering groups'. The other was published at the latest in November 2011 [119] and has not been updated since [74]; its categories are ungrouped.
Symantec. Symantec (now part of Broadcom) provides an online tool to retrieve the URL categorization from its WebPulse system [11], which powers its web gateway content filtering. The categorization system is said to use manual and automated (machine learning) analysis, with several modules voting towards the final categorization [67,68]. The tool indicates how recently the URL was categorized; previously unknown URLs are purported to be classified in real time [67]. Its URL database is said to cover łmillions of entriesž, and supports over 60 languages [68]. A URL can be classified as up to four categories [67]. A listing of categories, descriptions, examples and test sites is available [61]. The taxonomy was last updated in August 2019 [120].
Webshrinker. Webshrinker provides an online demo tool of their URL categorization service [19]. Their service targets two audiences: a purely content-based categorization aimed at advertisers, and a security-oriented service which combines custom heuristics, machine learning, internal and external data feeds to assess web threats [19,69]. Classification is said to occur in real-time [62], their database covers over 97.2 million 'entries' [69], and they support over 12 languages [62]. The two target audiences are also reflected in the two available taxonomies [62]. One is a custom list of 42 'standard' categories designed for content filtering, while the other uses the taxonomy of over 390 categories developed for marketing purposes by the Interactive Advertising Bureau (IAB) [21]. For the latter, Webshrinker computes a confidence score [121].
DMOZ. DMOZ (also known as the Open Directory Project) operated a directory of web pages, where users could navigate the category structure to find URLs in that category [3]. Its owner AOL took down DMOZ in 2017 after 19 years of operation [122]. DMOZ's rich taxonomy consisted of sixteen top-level categories, each being the top leaf in a large hierarchy of gradually more fine-grained subcategories, amounting to over a million categories encompassing 3.86 million URLs [3]. All users could suggest the addition of a URL to a category, but this had to be approved by one of the 91,929 category-specific editors [123]. Editors were also responsible for developing subcategories of the categories they maintained, which was suggested they do once a category reached 20 links [77]. DMOZ had strong multilingual support, with separate directories for 90 languages [3]. DMOZ allowed to search whether and where URLs appear in the directory.

E LABELS ACROSS SERVICES
When comparing pairs of services, it is instructive to also look at the cumulative distribution functions of one service over a corresponding one. These are presented in Figure 10. The horizontal axes contains all labels of a particular provider split into buckets, while the vertical axes represents the fraction of labels from the corresponding provider, covered by all the buckets up to the considered point. As expected, the curves for McAfee and OpenDNS (read row-wise) show a fast increase, as a small number of buckets contains the majority of labels, while Forcepoint and VirusTotal have a much more gradual increase. In some cases, a plateau appears at a point in the curve, as in the case of the Bitdefender-Forcepoint pair, or at the very beginning, as in the case of Bitdefender-McAfee. This is an artifact of the bucketing procedure which shows that the corresponding buckets cover a very small number of labels from the paired provider. This does, however, offer interesting information regarding labels that correspond on a one-to-one or one-to-few basis, even in the case of services that have a relatively reduced amount of overall labels. As to one label of McAfee, for example, there corresponds a considerable number of labels from VirusTotal, the conditional probability between pairs of labels from the two services is small, explaining the low values of conditional entropy as well as low mutual information. This is valid in all such one-to-many correspondences between providers.