Hyperreal Talk (Polish clear web message board) messages data
Creators
Contributors
Project leader:
Project members:
Description
General Information
1. Title of Dataset
Hyperreal Talk (Polish clear web message board) messages data.
2. Data Collectors
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
3. Funding Information
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
4. Data Source
Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).
5. Purpose
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
6. Collection Method
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
7. Collection Date
The data was collected in two periods, i.e., in September 2023 and November 2023.
Data Content
8. Data Description
The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.
9. Data Cleaning, Processing, and Anonymization
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
10. File Formats and Variables/Fields
The dataset consists of the following files:
- Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.
- Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.
- A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
11. Access Conditions
The data can be accessed without any restrictions.
12. Related Documentation
Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”
Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:
- https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
- https://github.com/HaitaoShi/Scrapy_hyperreal"
Ethical Considerations
13. Ethics Statement
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Files
hyperreal.zip
Files
(6.1 GB)
Name | Size | Download all |
---|---|---|
md5:05bce7ae2759f15d25e47bbd45a634dd
|
870.3 MB | Preview Download |
md5:17ca6f60cde6730034fa8754a105f95d
|
2.3 GB | Preview Download |
md5:431f57dd1af344d8c1b06ebb7559b5eb
|
3.7 kB | Preview Download |
md5:7d35296094647b3d1304a6c36005f08b
|
2.9 GB | Preview Download |
md5:d5601957f841a42b4a8345e3df8fecc9
|
6.8 kB | Preview Download |
Additional details
Related works
- Continues
- Journal article: 10.1093/jcmc/zmac023 (DOI)
- Conference proceeding: 10125/103073 (Handle)
Funding
Dates
- Collected
-
2023-09
- Collected
-
2023-11
Software
- Repository URL
- https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
- Programming language
- Python