CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

Kumar, Puneet; Malik, Sarthak; Raman, Balasubramanian; Li, Xiaobai

doi:10.5281/zenodo.11409612

Published June 1, 2024 | Version v1

Dataset Restricted

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

1. University of Oulu
2. Indian Institute of Technology Roorkee
3. Zhejiang University

Overview
The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset is designed to enable the generation of sentiment-controlled feedback from multimodal inputs, including text and images. This dataset can be used to train feedback synthesis models in both uncontrolled and sentiment-controlled manners. Serving a crucial role in advancing research, the CMFeed dataset supports the development of human-like feedback synthesis, a novel task defined by the dataset's authors. Additionally, the corresponding feedback synthesis models and benchmark results are presented in the associated code and research publication.

Task Uniqueness: The task of controllable multimodal feedback synthesis is unique, distinct from LLMs and tasks like VisDial, and not addressed by multi-modal LLMs. LLMs often exhibit errors and hallucinations, as evidenced by their auto-regressive and black-box nature, which can obscure the influence of different modalities on the generated responses [Ref1; Ref2]. Our approach includes an interpretability mechanism, as detailed in the supplementary material of the corresponding research publication, demonstrating how metadata and multimodal features shape responses and learn sentiments. This controllability and interpretability aim to inspire new methodologies in related fields.

Data Collection and Annotation
Data was collected by crawling Facebook posts from major news outlets, adhering to ethical and legal standards. The comments were annotated using four sentiment analysis models: FLAIR, SentimentR, RoBERTa, and DistilBERT. Facebook was chosen for dataset construction because of the following factors:
• Facebook was chosen for data collection because it uniquely provides metadata such as news article link, post shares, post reaction, comment like, comment rank, comment reaction rank, and relevance scores, not available on other platforms.
• Facebook is the most used social media platform, with 3.07 billion monthly users, compared to 550 million Twitter and 500 million Reddit users. [Ref]
• Facebook is popular across all age groups (18-29, 30-49, 50-64, 65+), with at least 58% usage, compared to 6% for Twitter and 3% for Reddit. [Ref]. Trends are similar for gender, race, ethnicity, income, education, community, and political affiliation [Ref]
• The male-to-female user ratio on Facebook is 56.3% to 43.7%; on Twitter, it's 66.72% to 23.28%; Reddit does not report this data. [Ref]

Filtering Process: To ensure high-quality and reliable data, the dataset underwent two levels of filtering:
a) Model Agreement Filtering: Retained only comments where at least three out of the four models agreed on the sentiment.
b) Probability Range Safety Margin: Comments with a sentiment probability between 0.49 and 0.51, indicating low confidence in sentiment classification, were excluded.
After filtering, 4,512 samples were marked as XX. Though these samples have been released for the reader's understanding, they were not used in training the feedback synthesis model proposed in the corresponding research paper.

Dataset Description
• Total Samples: 61,734
• Total Samples Annotated: 57,222 after filtering.
• Total Posts: 3,646
• Average Likes per Post: 65.1
• Average Likes per Comment: 10.5
• Average Length of News Text: 655 words
• Average Number of Images per Post: 3.7

Components of the Dataset
The dataset comprises two main components:
• CMFeed.csv File: Contains metadata, comment, and reaction details related to each post.
• Images Folder: Contains folders with images corresponding to each post.

Data Format and Fields of the CSV File
The dataset is structured in CMFeed.csv file along with corresponding images in related folders. This CSV file includes the following fields:
• Id: Unique identifier
• Post: The heading of the news article.
• News_text: The text of the news article.
• News_link: URL link to the original news article.
• News_Images: A path to the folder containing images related to the post.
• Post_shares: Number of times the post has been shared.
• Post_reaction: A JSON object capturing reactions (like, love, etc.) to the post and their counts.
• Comment: Text of the user comment.
• Comment_like: Number of likes on the comment.
• Comment_reaction_rank: A JSON object detailing the type and count of reactions the comment received.
• Comment_link: URL link to the original comment on Facebook.
• Comment_rank: Rank of the comment based on engagement and relevance.
• Score: Sentiment score computed based on the consensus of sentiment analysis models.
• Agreement: Indicates the consensus level among the sentiment models, ranging from -4 (all negative) to 4 (all positive). 3 negative and 1 positive will result into -2 and 3 positives and 1 negative will result into +2.
• Sentiment_class: Categorizes the sentiment of the comment into 1 (positive) or 0 (negative).

More Considerations During Dataset Construction
We thoroughly considered issues such as the choice of social media platform for data collection, bias and generalizability of the data, selection of news handles/websites, ethical protocols, privacy and potential misuse before beginning data collection. While achieving completely unbiased and fair data is unattainable, we endeavored to minimize biases and ensure as much generalizability as possible. Building on these considerations, we made the following decisions about data sources and handling to ensure the integrity and utility of the dataset:

• Why not merge data from different social media platforms? We chose not to merge data from platforms such as Reddit and Twitter with Facebook due to the lack of comprehensive metadata, clear ethical guidelines, and control mechanisms—such as who can comment and whether users' anonymity is maintained—on these platforms other than Facebook. These factors are critical for our analysis. Our focus on Facebook alone was crucial to ensure consistency in data quality and format.

• Choice of four news handles: We selected four news handles—BBC News, Sky News, Fox News, and NY Daily News—to ensure diversity and comprehensive regional coverage. These news outlets were chosen for their distinct regional focuses and editorial perspectives: BBC News is known for its global coverage with a centrist view, Sky News offers geographically targeted and politically varied content learning center/right in the UK/EU/US, Fox News is recognized for its right-leaning content in the US, and NY Daily News provides left-leaning coverage in New York. Many other news handles such as NDTV, The Hindu, Xinhua, and SCMP are also large-scale but may contain information in regional languages such as Indian and Chinese, hence, they have not been selected. This selection ensures a broad spectrum of political discourse and audience engagement.

• Dataset Generalizability and Bias: With 3.07 billion of the total 5 billion social media users, the extensive user base of Facebook, reflective of broader social media engagement patterns, ensures that the insights gained are applicable across various platforms, reducing bias and strengthening the generalizability of our findings. Additionally, the geographic and political diversity of these news sources, ranging from local (NY Daily News) to international (BBC News), and spanning political spectra from left (NY Daily News) to right (Fox News), ensures a balanced representation of global and political viewpoints in our dataset. This approach not only mitigates regional and ideological biases but also enriches the dataset with a wide array of perspectives, further solidifying the robustness and applicability of our research.

• Dataset size and diversity: Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we manually scraped publicly available data. This labor-intensive process requiring around 800 hours of manual effort, limited our data volume but allowed for precise selection. We followed ethical protocols for scraping Facebook data , selecting 1000 posts from each of the four news handles to enhance diversity and reduce bias. Initially, 4000 posts were collected; after preprocessing (detailed in Section 3.1), 3646 posts remained. We then processed all associated comments, resulting in a total of 61734 comments. This manual method ensures adherence to Facebook’s policies and the integrity of our dataset.

Ethical considerations, data privacy and misuse prevention
The data collection adheres to Facebook’s ethical guidelines [Ref]. We manually scraped publicly available data in compliance with Facebook's ethical guidelines prohibiting automatic scraping [Ref]. We collected data that is publicly available, specifically corresponding to news articles that are publicly accessible following the protocols for ethically scraping Facebook data [Ref]. The human-generated comments are included without identifying information. Aiming to prevent potential misuse, we have proactively designed our feedback synthesis system with an integral interpretability module. This feature is crucial as it not only helps in explaining how decisions are made within the system but also in detecting and preventing any misuse, such as the creation of misleading or manipulative content. Our original motivation for integrating this technology was to ensure that it is used responsibly and ethically, enhancing its positive impact while minimizing risks. By focusing on developing robust and transparent systems, we aim to foster trust and encourage the responsible use of technology in line with our ethical commitments.

Code and Citation
• Code Repository: https://github.com/MIntelligence-Group/CMFeed/
• Citing the Dataset: Users of the dataset should cite the corresponding paper described at the above GitHub Repository.

License & Access
• This dataset is released for academic research only and is free to researchers from educational or research institutes for non-commercial purposes.
• Note that you are downloading this corpus at your own risk. No guarantee is provided, e.g. regarding the goodness of the corpus nor towards any subsequent effects. You may use it free of charge, and modify it as you wish, but clearly specify modifications if you pass modified material on.

Contact
Please send any questions about this dataset to:
• Puneet Kumar (puneet.kumar@oulu.fi),
• Sarthak Malik (sarthak_m@mt.iitr.ac.in),
• Balasubramanian Raman (bala@cs.iitr.ac.in),
• Xiaobai Li (xiaobai.li@zju.edu.cn).

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/11409612">Log in</a> to check if you have access.

Additional details

Valid: 2024-06-01

Repository URL: https://github.com/MIntelligence-Group/CMFeed
Programming language: Python

	All versions	This version
Views	253	253
Downloads	189	189
Data volume	82.9 GB	82.9 GB

Files

Restricted

Dates

Software

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

Authors/Creators

Description

Files

Restricted

Additional details

Dates

Software