Published July 10, 2024 | Version v1.0
Dataset Open

StergiosChatzikyriakidis/Greek_dialect_corpus: v1.0

Description

A collection of raw text from various Greek dialects. Contains data from the following dialects:

  • Cypriot Greek
  • Cretan Greek
  • Pontic Greek
  • Northern Greek
  • Some part of the Modern Greek wikipedia

The repository contains data collected from the web and other textual resources (blogs, websites, theatrical plays among other things). The folder SMG_CG contains twitter data from Standard Modern Greek and Cypriot that have been originally collected by Hanna Sababa for her project A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek. Mr Sfakianakis is thanked from providing us with his Cretan translations of a number of Ancient Greek tragedies and comedies. The folder all_dialects contains a zip file that has the collection of data with minimal pre-processing and annotation for the respective dialect. Stergios Chatzikyriakidis gratefully acknowledges funding of the European Commission under TALOS AI for SSH Grant agreement ID: 101087269.

Files

StergiosChatzikyriakidis/Greek_dialect_corpus-v1.0.zip

Files (66.9 MB)

Additional details