Published November 27, 2024 | Version v1.0
Dataset Open

AIT OSINT Summer2024 Data Set

Description

The cyber security news items in this dataset were gathered from publicly accessible sources using the OSINT platform Taranis AI (https://taranis.ai/). The dataset spans the period from May 13, 2024, to July 31, 2024, and consists of daily JSON files containing news items (around 12.000 entries) in either German or English.

Metadata
Taranis AI automatically collects news items every eight hours and processes them for further use. During this process, the content, author, publication date, and title are directly sourced from the RSS feed without modification. The platform then adds the following attributes to the news item: An ID, the link to the collected news item, the ID and URL of the open source, and a hash generated from the author, title and link.

One of the primary objective of Taranis AI is to present news items to human analysts in a way that optimizes their analysis time. To achieve this, similar news items need to be grouped together. Therefore, news items are wrapped in a data item, which enables multiple news items to be stored within a single data item when clustering is applied. However, since no clustering algorithm was applied to the provided dataset, each data item contains only one news item.

Each data item consists of

  • id -- Id of data item (Format: UUID)
  • created -- Publish timestamp of first news item stored in news_items (Format: Timestamp)
  • news_items -- News item (Format: List of JSON)
  • title -- Title of first news item stored in news items (Format: String)
  • tags -- Tags (Format: Nested JSON)
  • attributes -- Information about tag creation (Format: List of JSON)

Each news item, encapsulated in a data item, consists of:

  • id -- ID of news item (Format: UUID)
  • author -- Author of collected news item (Format: String)
  • content -- News item's content (Format: String)
  • hash -- SHA-256 hash of following attributes: Author, title, and link (Format: SHA-256)
  • link -- Link to news item (Format: URL)
  • osint -- source id ID of open source (Format: UUID)
  • published -- Timestamp of news item’s publication (Format: Timestamp)
  • source -- URL of open source (Format: URL)
  • story_id -- ID of data item (same as id in data item) (Format: UUID)
  • title -- News item’s title (Format: String)

Each data item in the dataset is assigned tags that represent its content. Tagging is carried out using Named Entity Recognition (NER), a word-matching algorithm, and/or by extracting Indicators of Compromise (IoCs) and Common Vulnerabilities and Exposures (CVE) IDs from the text. The attributes field of each data item specifies the tagging technique used, providing clarity on the method applied for content classification.

The associated pdf in this data set provides further information on the number of data items collected per day, as well as distribution across most common authors, sources, and tags. It also contains of a full list of public sources used for the collection.

If you use the AIT-OSINT-Summer2024 data set, please cite the following publications:

[1] Skopik, F., Akhras, B., Woisetschläger, E., Andresel, M., Wurzenberger, M., Landauer, M. (2024). On the Application of Natural Language Processing for Advanced OSINT Analysis in Cyber Defence. In Proceedings of the 19th International Conference on Availability, Reliability and Security (pp. 1-10).

Notes

Skopik, F., Akhras, B., Woisetschläger, E., Andresel, M., Wurzenberger, M., Landauer, M. (2024). On the Application of Natural Language Processing for Advanced OSINT Analysis in Cyber Defence. In Proceedings of the 19th International Conference on Availability, Reliability and Security (pp. 1-10).

Files

AIT-OSINT-Summer2024-v1.0.zip

Files (28.1 MB)

Name Size Download all
md5:c678cc469c845c203719d6a7c0f11bd6
27.8 MB Preview Download
md5:1ededb680142a5ad93dbf17e86ac9854
285.5 kB Preview Download

Additional details

Dates

Collected
2024-05-13 / 2024-07-31