Published November 25, 2019 | Version v1
Conference paper Open

Time-Aware Word Embeddings of Three Lebanese News Archives

  • 1. American University of Beirut

Description

Abstract: Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used. 

Folder Navigation: The two zipped folders are models and evaluations.

  • The models folder contains three subdirectories: assafir_models, hayat_models, and nahar_models. Each directory is attributed to a news archives. The contentsof these directories are decade-level and archive-level Word2Vec (CBOW) models in the form of [min year]_[max year].model for each archive. For each model, there is an attributed [min year]_[max year].txt , which consists of the filenames of each transcribed document used to train that model, ending with a set of the years and the number count of documents used. 
  • The evaluations folder contains three xls files and three text files. Each of the xls files is a workbook containing various spreadsheet, each of the spreadsheets contains the evaluation of each model trained across all the relations of the benchmark file and a total accuracy. The spreadsheet names are also in the form of [min year]_[max year]. The three text files are the logger files generated when the evaluation was done. The text files are in the form of logger_[archive_name].txt 

 

 

Files

evaluations.zip

Files (3.7 GB)

Name Size Download all
md5:284327ab270754634b151d1d7cd051f3
41.8 kB Preview Download
md5:4f10253ae27fa4e88ae0a861e25e2d01
3.7 GB Preview Download
md5:85e63f488ad62f1d09125b810a207876
1.4 kB Preview Download