Conference paper Open Access

Time-Aware Word Embeddings of Three Lebanese News Archives

Doughman, Jad; Abu Salem, Fatima; Elbassuoni, Shady

Abstract: Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. 

Folder Navigation: The two zipped folders are models and evaluations.

  • The models folder contains three subdirectories: assafir_models, hayat_models, and nahar_models. Each directory is attributed to a news archives. The contentsof these directories are decade-level and archive-level Word2Vec (CBOW) models in the form of [min year]_[max year].model for each archive. For each model, there is an attributed [min year]_[max year].txt , which consists of the filenames of each transcribed document used to train that model, ending with a set of the years and the number count of documents used. 
  • The evaluations folder contains three xls files and three text files. Each of the xls files is a workbook containing various spreadsheet, each of the spreadsheets contains the evaluation of each model trained across all the relations of the benchmark file and a total accuracy. The spreadsheet names are also in the form of [min year]_[max year]. The three text files are the logger files generated when the evaluation was done. The text files are in the form of logger_[archive_name].txt 



Files (3.7 GB)
Name Size
41.8 kB Download
3.7 GB Download
1.4 kB Download
All versions This version
Views 9595
Downloads 2222
Data volume 52.3 GB52.3 GB
Unique views 8383
Unique downloads 1313


Cite as