Published May 11, 2021 | Version v4
Dataset Open

CTAB: Corpus of Tunisian Arabizi

  • 1. Data Engineering and Semantics Research Unit (DES-Unit), Faculty of Sciences of Sfax, University of Sfax, Tunisia
  • 2. Université de Moncton, Moncton, Canada

Description

This dataset has been created between 2017 and 2021 to provide a textual resource that can be used to study the behaviors of Tunisian people in writing Tunisian Arabic (ISO 693-3: aeb) in Latin Script. This corpus is constituted from messages written using Tunisian Arabic Chat Alphabet or Arabizi and is developed to solve the matter of the lack of NLP databases about the use of the Latin Script for transcribing Tunisian Arabic. The messages are automatically pulled using web scraping of Facebook public pages and are kept as they are without any annotation, spelling adjustments or morphological and syntactic labeling. Then, messages that are written in Latin Script but not in Tunisian Arabic are manually eliminated. Finally, every collection of messages that are retrieved from the same Facebook page in the same period is included in the same text file where every message is featured as one line.

Notes

This corpus has been developed by the Data Engineering and Semantics Research Unit (DES-Unit), University of Sfax, Tunisia. It has been developed to increase the coverage of Latin Script in the NLP resources for Tunisian. It is included as a part of the Tunisian Arabic Corpus (http://www.tunisiya.org/).

Files

CTAB-SAMPLE0001.txt

Files (356.6 kB)

Name Size Download all
md5:dc42ee73cf7e5b125188785d0e26306f
22.4 kB Preview Download
md5:28a1c0d0bae0ccb61427446748e714bb
5.9 kB Preview Download
md5:ea1a3a417892d9935db8ba70d0fcf03e
3.1 kB Preview Download
md5:db569c9391eadb4b5f796facdf853802
22.9 kB Preview Download
md5:3b8f8919a844931bee9070d12f5de8a7
23.3 kB Preview Download
md5:b01d16e95e674e3ded630e92589cb33e
21.7 kB Preview Download
md5:6f3a59e99c2330a45f520569c38bb18d
5.1 kB Preview Download
md5:17e4180b6af6965217a5a0638353d0a8
60.1 kB Preview Download
md5:2d51ebbeeaa5431c36765f76e438f3c8
20.5 kB Preview Download
md5:7424d805e2502b222805c03bf956a1e5
27.8 kB Preview Download
md5:89c64af867793fe690531722348edc1d
39.8 kB Preview Download
md5:263ac4be6e096a3078e08ae8a5c4fb3f
32.7 kB Preview Download
md5:a4f6b25fe7a97df601750f538046bb4c
21.3 kB Preview Download
md5:5dee23175faaaf83fffea0d4bcda78de
50.1 kB Preview Download