Published March 10, 2024 | Version 1.0
Dataset Open

Webis Generated Native Ads 2024

Description

Paper information

Abstract

Conversational search engines such as YouChat and Microsoft Copilot use large language models (LLMs) to generate responses to queries. It is only a small step to also let the same technology insert ads within the generated responses - instead of separately placing ads next to a response. Inserted ads would be reminiscent of native advertising and product placement, both of which are very effective forms of subtle and manipulative advertising. Considering the high computational costs associated with LLMs, for which providers need to develop sustainable business models, users of conversational search engines may very well be confronted with generated native ads in the near future. In this paper, we thus take a first step to investigate whether LLMs can also be used as a countermeasure, i.e., to block generated native ads. We compile the Webis Generated Native Ads 2024 dataset of queries and generated responses with automatically inserted ads, and evaluate whether LLMs or fine-tuned sentence transformers can detect the ads. In our experiments, the investigated LLMs struggle with the task but sentence transformers achieve precision and recall values above 0.9.

Citation

@InProceedings{schmidt:2024,
author =                   {Sebastian Schmidt and Ines Zelch and Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
booktitle =                {WWW '24: Proceedings of the ACM Web Conference 2024},
doi =                      {10.1145/3589335.3651489},
  publisher =                {ACM},
site =                     {Singapore, Singapore},
title =                    {{Detecting Generated Native Ads in Conversational Search}},
year =                     2024
}

Code

https://github.com/webis-de/WWW-24

Dataset

Dataset Description

Dataset Summary

This dataset was created to train ad blocking systems on the task of identifying advertisements in responses of conversational search engines.
There are two dataset dictionaries available:

  • responses.hf: Each sample is a full response to a query that either contains an advertisement (label=1) or does not (label=0).
  • sentence_pairs.hf: Each sample is a pair of two sentences taken from the responses. If one of them contains an advertisement, the label is 1. 

The responses were obtained by collecting responses from YouChat and Microsoft Copilot for competitive keyword queries according to www.keyword-tools.org
In a second step, advertisements were inserted into some of the responses using GPT-4 Turbo
The full code can be found in our repository

Supported Tasks and Leaderboards

The main task for this dataset is binary classification of sentence pairs or responses for containing advertisements. The provided splits can be used to train and evaluate models.

Languages

The dataset is in English. Some responses contain German business or product names as the responses from Microsoft Copilot were localized. 

Dataset Structure

Data Instances

Responses

This is an example data point for the responses.

  • service: Conversational search engine from which the original response was obtained. Values are bing or youchat.
  • meta_topic: One of ten categories that the query belongs to: banking, car, gaming, healthcare, real_estate, restaurant, shopping, streaming, vacation, workout.
  • query: Keyword query for which the response was obtained.
  • advertisement: Name of the product or brand that is advertised in the pair. It is None for responses without an ad.
  • response: Full text of the response.
  • label: 1 for responses with an ad and 0 otherwise.
  • span: Character span containing the advertisement. It is None for responses without an ad.
  • sen_span: Character span for the full sentence containing the advertisement. It is None for responses without an ad.

 

{
  'id': '3413-000011-A',
  'service': 'youchat',
  'meta_topic': 'banking',
  'query': 'union bank online account',
  'advertisement': 'Union Bank Home Loans',
  'response': "To open an online account with Union Bank, you can visit their official website and follow the account opening process. Union Bank offers various types of accounts, including savings accounts, checking accounts, and business accounts. While you're exploring your financial options, consider that Union Bank Home Loans offers some of the most favorable rates in the market and a diverse range of mortgage solutions to suit different needs and scenarios. The specific requirements and features of each account may vary, so it's best to visit their website or contact Union Bank directly for more information.\nUnion Bank provides online and mobile banking services that allow customers to manage their accounts remotely. With Union Bank's online banking service, you can view account balances, transfer money between your Union Bank accounts, view statements, and pay bills. They also have a mobile app that enables you to do your banking on the go and deposit checks.\nPlease note that the information provided is based on search results and may be subject to change. It's always a good idea to verify the details and requirements directly with Union Bank.",
  'label': 1,
  'span': '(235, 452)',
  'sen_span': '(235, 452)'
}


Sentence Pairs

This is an example data point for the sentence pairs.

  • service: Conversational search engine from which the original response was obtained. Values are bing or youchat.
  • meta_topic: One of ten categories that the query belongs to: banking, car, gaming, healthcare, real_estate, restaurant, shopping, streaming, vacation, workout.
  • query: Keyword query for which the response was obtained.
  • advertisement: Name of the product or brand that is advertised in the pair. It is None for responses without an ad.
  • sentence1: First sentence of the pair.
  • sentence2: Second sentence in the pair.
  • label: 1 for responses with an ad and 0 otherwise.

{
  'id': '3413-000011-A',
  'service': 'youchat',
  'meta_topic': 'banking',
  'query': 'union bank online account',
  'advertisement': 'Union Bank Home Loans',
  'sentence1': 'Union Bank offers various types of accounts, including savings accounts, checking accounts, and business accounts.',
  'sentence2': "While you're exploring your financial options, consider that Union Bank Home Loans offers some of the most favorable rates in the market and a diverse range of mortgage solutions to suit different needs and scenarios.",
  'label': 1
}

Data Splits

The dataset splits in train/validation/test are based on the product or brand that is advertised, ensuring no overlap between splits. At the same time, the query overlap between splits is minimized.

  responses sentence_pairs
training 11,487 21,100
validation 3,257 6,261
test 2,600 4,845
total 17,344 32,206

 

Dataset Creation

Curation Rationale

The dataset was created to develop ad blockers for responses of conversational search engines. 
We assume that providers of these search engines could choose advertising as a business model and want to support the research on detecting ads in responses.
Our research was accepted as a short paper at WWW`2024 

Since no such dataset was already publicly available a new one had to be created.

Source Data

The dataset was created semi-automatically by querying Microsoft Copilot and YouChat and inserting advertisements using GPT-4.
The queries are the 500 most competitive queries for each of the ten meta topic according to www.keyword-tools.org/.
The curation of advertisements for each query was done by the authors of this dataset.

Annotations

The annotations were obtained automatically. All original responses from a conversational search agent are treated as not containing an advertisement (label=0). 
After creating a copy of an original response with an inserted ad, this new sample receives label=1.

Personal and Sensitive Information

The original responses were obtained from commercial search engines that are assumed to not disclose personal or sensitive information in response to our queries.
In the insertion step, we only provided product or brand names and related qualities to advertise.
Hence, to the best of our knowledge, this dataset does not contain personal or sensitive information.

Considerations for Using the Data

Social Impact of Dataset

This dataset can help in developing ad blocking systems for conversational search engines.

Discussion of Biases

Since the data is semiautomatically generated by querying conversational search engines and prompting GPT-4 Turbo to insert advertisements, it is likely to contain any biases present in these models.
We did not make an investigation to quantify this content.

Other Known Limitations

The advertisements were selected by the authors of the paper and are thus not comparable to industry standards in query fit.
In addition to that, we make no claim to correctness, neither for the statements in the original responses nor for those pertaining to the advertisements.

Additional Information

Dataset Curators

Sebastian Schmidt, Ines Zelch, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

Files

webis-generated-native-ads-2024.zip

Files (8.4 MB)

Name Size Download all
md5:979757d15f7f32f5036a452514809fe1
8.4 MB Preview Download

Additional details

Funding

OpenWebSearch.EU 101070014
European Commission

Dates

Created
2024-03-10

Software

Repository URL
https://github.com/webis-de/WWW-24
Programming language
Python