Published March 12, 2024 | Version Version v1
Dataset Open

Hyperreal Talk (Polish clear web message board) messages data

  • 1. Kazimierz Wielki University in Bydgoszcz
  • 2. ROR icon University of Edinburgh
  • 1. Kazimierz Wielki University in Bydgoszcz
  • 2. ROR icon Jagiellonian University
  • 3. ROR icon University of Warsaw
  • 4. Tampere University
  • 5. ROR icon University of Edinburgh
  • 6. ROR icon Opole University
  • 7. ROR icon Collegium Civitas

Description

General Information

1. Title of Dataset

Hyperreal Talk (Polish clear web message board) messages data.

2. Data Collectors

Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

3. Funding Information

The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

Data Collection Context

4. Data Source

Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).

5. Purpose

This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

6. Collection Method

The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

7. Collection Date

The data was collected in two periods, i.e., in September 2023 and November 2023.

Data Content

8. Data Description

The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.

9. Data Cleaning, Processing, and Anonymization

The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

10. File Formats and Variables/Fields

The dataset consists of the following files:

  • Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.
  • Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.
  • A .csv file that lists all the messages, including file names and the content of each post.

Accessibility and Usage

11. Access Conditions

The data can be accessed without any restrictions.

12. Related Documentation

Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”

Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:

Ethical Considerations

13. Ethics Statement

A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

Files

hyperreal.zip

Files (6.1 GB)

Name Size Download all
md5:05bce7ae2759f15d25e47bbd45a634dd
870.3 MB Preview Download
md5:17ca6f60cde6730034fa8754a105f95d
2.3 GB Preview Download
md5:431f57dd1af344d8c1b06ebb7559b5eb
3.7 kB Preview Download
md5:7d35296094647b3d1304a6c36005f08b
2.9 GB Preview Download
md5:d5601957f841a42b4a8345e3df8fecc9
6.8 kB Preview Download

Additional details

Related works

Continues
Journal article: 10.1093/jcmc/zmac023 (DOI)
Conference proceeding: 10125/103073 (Handle)

Funding

National Science Center
Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade 2021/43/B/HS6/00710

Dates

Collected
2023-09
Collected
2023-11

Software

Repository URL
https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
Programming language
Python