Published February 7, 2023 | Version 1.0.
Dataset Restricted

TRACES Bulgarian Twitter Dataset on Famous Bulgarian Political Cases of Suspected Lies, Annotated with Linguistic Markers of Lies

Description

This dataset has been created within Project TRACES (more information: https://traces.gate-ai.eu/). The dataset contains 15850 tweet IDs of tweets, written in Bulgarian, with annotations. The dataset can be used for general use or for building lies and disinformation detection applications.

Note: this dataset is not fact-checked, the social media messages have been retrieved via keywords. For fact-checked datasets, see our other datasets.

The tweets (written between 1 Jan 2020 and 7 July 2022) have been collected via Twitter API under academic access in June-July 2022 with the following keywords without retweets:

  • (ваксиниран депутат) OR (ваксинирани депутати) 

  • (язовири премиер) OR (язовири прокуратура) OR (язовири прокуратурата)

  • ((мвр хемус) OR мвр) (прокуратура OR прокуратурата)

  • (шефът тотото) OR (изпълнителният директор Българския спортен тотализатор)

  • (кирил петков двойно гражданство) OR (премиер двойно гражданство) OR (премиер гражданство)

  • ((Пътна OR загубена OR загуби OR изчезнала) карта газпром)

  • (министър плагиат плагиатство) OR (плагиат плагиатство)

  • ((изслушване главния прокурор) OR (иван гешев)) 

  • (фалшива диплома)

  • (златни паспорти)

  • (апартаментгейт OR (къща за гости) OR (къщи за гости)

  • (оръжия OR оръжие) (Украйна OR украина)

  • ((цена OR цени) (газ OR ток OR нафта OR бензин))

  • (мвр OR данс) (фалшиви новини)

  • (данъци OR данъчни OR данък)

  • ((кораб Царевна) OR Царевна)

  • (Северна Македония) 

Explanations of which fields can be used as markers of lies (or of intentional disinformation) are provided in our  paper:

Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.

Notes

The project TRACES has indirectly received funding from the European Union's Horizon 2020 research and innovation action programme, via the AI4Media Open Call #1 issued and executed under the AI4Media project (Grant Agreement no. 951911).

 

The dataset is shared with the License: Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).

In accordance with European Union laws, user profiling of the authors of these texts is forbidden. 

The Project Sponsors (European Commission and the AI4Media project), Researchers, users or subjects shall not be liable or otherwise responsible for any damages (including pecuniary or moral damages) arising out or in relation to the uses of this dataset.

When using the dataset, please cite this article:

Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.

 

 

 

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This dataset is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA) with the additional terms below. After reading the terms, please state that you read and accept the conditions and describe the entity you are applying from (if it exists - e.g. academic institution, company, government agency), and your intended use of the dataset.

The TRACES team members will review your application and you may be granted access or not.

If you have questions, please contact us at: irina.temnikova@gmail.com

Conditions for using the dataset TRACES_Dts9.1.4_socialmedia_AutomaticAnnTwitterSuspectedLies_(version).csv:

In order to be allowed access to this dataset, in line with applicable legislation, including but not limited to the General Data Protection Regulation (GDPR), the Artificial Intelligence Act (AI Act, current draft as of 01 November 2022, pending adoption and entry into force), the TRACES Project Data Management Plan, as well as the Twitter requirements, if you want to download or use this dataset, you must agree with and abide with the following terms and conditions:

  • The dataset has been created with Twitter Academic Access and thus cannot be used for commercial purposes.

  • The dataset is anonymized, and all personal information, revealing the authors is removed. The dataset cannot be used for profiling Twitter users or for any applications, which breach the AI Act’s provisions. The identity reconstruction of the authors of the social media datasets is forbidden.

  • Upon request of the authors of the tweets or of the TRACES team, specific tweet IDs and their annotations must be deleted.

  • The social media posts, included in this dataset, are annotated with linguistic markers *potentially* signaling lies. The presence of such markers should not indicate that the social media posts contain disinformation, misinformation, lies, untrue facts and/or other inconsistencies with 100% confidence, but with a lower degree of confidence (certain likelihood, but not certainty).

  • The linguistic markers are currently being developed and are provided purely and solely for scientific purposes. They cannot be used as conclusive evidence, as arguments on the merits of the dataset, as evidence in judicial or administrative proceedings or in any other way not directly related to Project TRACES.

  • No legal action should or could be taken against the authors of the social media posts, included in the dataset, solely based on the presence of linguistic markers, potentially signaling lies.

  • The dataset is not suitable to be used and shall not be used for governmental or public authority purposes, including for investigations, government surveillance, intelligence work, analysis, criminal investigation, court or administrative proceedings. 

  • The presence of linguistic markers of potential lies in social media posts are not statements/beliefs/affirmations of the Project's team members or affiliated institutions. 

  • The Project Sponsors (AI4Media, F6S, and the European Commission), the members of the TRACES team, users, or subjects shall not be liable or otherwise responsible for any consequences and/or damages (including pecuniary or moral damages) arising out of or in relation to the Project, the data collected, and the methods used for their analysis and/or the results/outcomes.

  • This notice, as well as all the activities of the TRACES Project and of its Project Sponsors, team members, users, or subjects, including any contractual and/or non-contractual liability, are governed exclusively by the European Union laws and by the laws of the Republic of Bulgaria.

  • You agree to provide attribution to the TRACES project in the following format:

    • The TRACES project (https://traces.gate-ai.eu/)

    • Dataset name: TRACES_Dts9.1.4_socialmedia_AutomaticAnnTwitterSuspectedLies_1

    • Data source: Twitter.

    • Research article to cite: Irina Temnikova, Silvia Gargova, Ruslana Margova, Veneta Kireva, Ivo Dzhumerov, Tsvetelina Stefanova and Hristiana Nikolaeva (2023) New Bulgarian Resources for Detecting Disinformation. 10th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC'23). Poznań. Poland.

    • Link to the original dataset: https://zenodo.org/record/7614357.

You are currently not logged in. Do you have an account? Log in here

Additional details

Funding

European Commission
AI4Media – A European Excellence Centre for Media, Society and Democracy 951911