The dataset and software for Simulating Human Responses to Environmental Messaging by Dr Ian Drumm and Dr Atefeh Tate, University of Salford, UK.
Description
READ ME
Simulating Human Responses to Environmental Messaging
Dr Ian Drumm and Dr Atefeh Tate
This dataset will be referenced from a corresponsing journal article.
The data presented pertains to ongoing work to implement and evaluate virtual humans whose responses to environmental messaging are shaped by their media diets and social interactions. The project scraped thousands of social media post–comment pairs related to environmental issues, classified them by viewpoint through the large-scale orchestration of multiple instances of large language models, and built a vector database of embedded interactions with associated classification metadata to serve as a knowledge source for a chatbot. Dynamic, metadata-based filtering of this knowledge source, in conjunction with retrieval-augmented generation, enabled a chatbot with selectable personas that generate responses to new social media posts based on stereotypical attitudes grounded in current news and zeitgeists. A qualitative and quantitative evaluation was conducted to demonstrate the validity of the approach, though its full potential remains to be explored.
Data Content & Compliance Note
Original Content Redaction: Due to strict Data Redistribution restrictions in Reddit's Terms of Service and API policies, the original user-generated content (posts and comments) used to generate this dataset has been entirely redacted.
Data Provided: To enable analysis and evaluation of the metric scoring system, the dataset includes synthetic or "fake" comments and all associated quantitative and qualitative metrics (e.g., BERT scores, perplexity, and category justifications). This allows for the verification of scoring algorithms without violating the original content license.
Code for generating database of scored comments is given via the github.
Source Code
https://github.com/iduos/CLIMATE_BOT
Reddit Search
The data relates Reddit a search and classification based of the following parameters…
|
Parameter |
Value |
Notes / Description |
|
--subreddits |
"worldnews, politics, conservative, liberal, libertarian" |
List of subreddits to collect data from |
|
--query |
"climate change OR global warming OR net zero OR renewable energy OR carbon tax OR sea level rise OR extreme weather" |
Search query terms used to filter posts |
|
--start_date |
"2025-01-01" |
Start date for data collection |
|
--end_date |
"2025-11-01" |
End date for data collection |
|
--bin_by_period |
"month" |
Searches respective months in the search period |
|
--scoring_prompts |
"rubrics/climateUK4.json" |
Prompt file for scoring climate-related discussions |
|
--scorer_llm |
"gemini-2.5-flash" |
Model used for scoring content |
|
--embed_model |
"nomic-embed-text:latest" |
Embedding model used for storing post/comment pairs |
|
--sample_size |
5000 |
Samples from 30,000+ items to build database |
Classified Knowledge Base
An export of the Vector Database with 5000 classified items, given in
climate_uk4_5000GF_database_export.csv
Rubric for classification
climateUK4.json
Evaluation Metrics
Evaluation metrics for 400 items sampled from the database (100 per category),
We conducted both qualitative and quantitative evaluations of the chatbot. For a given post a real comment and a generated comment were compared. The real comment’s id was used to filter it from RAG context. Hence, we ensured each real comment during the evaluation wasn’t included in formulating the chatbot’s generated comment. The generated comment was added to a new dataset of items of the form {post/real comment/chatbot comment}. The {real comment / chatbot comment} pairs were presented as respective reference and generated texts to a variety of tools for calculating chatbot metrics. Hence, averages were determined. We wanted to demonstrate linguistic similarity performance, semantic alignment, language predictability and emotional alignment. Our automated evaluation used the following quantitative metrics ... BERTScore F1, embedding similarity (ES), real comment perplexity (RP), chatbot comment perplexity (BP), emotional similarity (EM), and sentiment difference (SD). The baselines represent random pairings of real comments, serving as empirical chance-level reference. The table gives overall evaluation.
|
Category |
BERT F1 |
ES |
RP median [IQR] |
BP median [IQR] |
EM |
SD |
|
Concerned |
0.706 ±0.026 |
0.204 ±0.132 |
69.6 [45.3-150.9] |
64.9 [45.1-96.1] |
0.722 ±0.152 |
0.658 ±0.427 |
|
Sceptical |
0.702 ±0.032 |
0.182 ±0.126 |
92.6 [51.2-165.6] |
72.0 [51.4-154.4] |
0.736 ±0.135 |
0.553 ±0.386 |
|
Paradoxical |
0.700 ±0.033 |
0.170 ±0.127 |
66.4 [44.3-129.7] |
65.3 [46.5-111.8] |
0.760 ±0.154 |
0.510 ±0.419 |
|
Baselines (real comment pairs) |
0.695 |
0.129 |
0.757 |
0.559 |
Here is the data we used to ascertain these matrics.
/evaluation_data/
gf_concerned.csv
gf_sceptical.csv
gf_paradoxical.csv
gf_irrelevant.csv
Each evaluated item includes…
| Reddit Post |
| Original Comment, Original Comment classification by LLM, Original Comment Justification from LLM |
| Fake (bot generated) Comment, Fake Comment classification by LLM, Fake Comment Justification from LLM |
| Compares Original and Fake comments giving... |
| bert_score_f1, bert_score_precision, bert_score_recall |
| embedding_similarity |
| emotional_similarity |
| sentiment_difference |
| generated_perplexity (fake comment perplexity) |
| reference_perplexity (generated comment perplexity) |
Human Evaluation
Human evaluation with two coders of 100 samples equally distributed across categories (Concerned, Paradoxical, Sceptical and Irrelevant). Gives the LLM (Gemini 2.5 Flash) classifications of original comments to reddit posts and human classifications of the same comment. Both apply the rubric given in climateUK4.json.
/evaluation_data/
Majority_LLM_CAT_v_HUMAN.csv
FAKE_v_HUMAN gives the category filters use for generating the fake comments to reddit posts and human classifications of the fake comments applying the rubric given in climateUK4.json.
/evaluation_data/
Majority_FAKE_v_HUMAN.csv
Lineup tests
We evaluated whether chatbot comments were distinguishable from real Reddit comments using a lineup-style human judgment task (Wickham et al., 2010). For each post, panels of five comments were shown to raters, four real and one chatbot generated, within one of four viewpoint categories (Sceptical, Paradoxical, Concern, Irrelevant). Comments were anonymized and standardized, and the chatbot’s position was randomized. Three raters each evaluated 50 panels (100 total trials) and selected the comment they believed was written by the chatbot. Performance near the chance level (1/5 = 20%) indicated that the chatbot’s comments were operationally indistinguishable from human ones.
/evaluation_data/
lineup_CODER_A.csv
lineup_CODER_B.csv
Files
Additional details
Software
- Repository URL
- https://github.com/iduos/CLIMATE_BOT
- Programming language
- Python
- Development Status
- Active