Time-Aware Word Embeddings of Three Lebanese News Archives

doi:10.5281/zenodo.3538880

Published November 25, 2019 | Version v1

Conference paper Open

Time-Aware Word Embeddings of Three Lebanese News Archives

1. American University of Beirut

Abstract: Word embeddings have proven to be an effective method for capturing semantic relations among distinct terms within a large corpus. In this paper, we present a set of word embeddings learnt from three large Lebanese news archives, which collectively consist of 609,386 scanned newspaper images and spanning a total of 151 years, ranging from 1933 till 2011. To train the word embeddings, Google’s Tesseract 4.0 OCR engine was employed to transcribe the scanned news archives, and various archive-level as well as decade-level word embeddings were learnt. To evaluate the accuracy of the learnt word embeddings, a benchmark of analogy tasks was used.

Folder Navigation: The two zipped folders are models and evaluations.

The models folder contains three subdirectories: assafir_models, hayat_models, and nahar_models. Each directory is attributed to a news archives. The contentsof these directories are decade-level and archive-level Word2Vec (CBOW) models in the form of [min year]_[max year].model for each archive. For each model, there is an attributed [min year]_[max year].txt , which consists of the filenames of each transcribed document used to train that model, ending with a set of the years and the number count of documents used.
The evaluations folder contains three xls files and three text files. Each of the xls files is a workbook containing various spreadsheet, each of the spreadsheets contains the evaluation of each model trained across all the relations of the benchmark file and a total accuracy. The spreadsheet names are also in the form of [min year]_[max year]. The three text files are the logger files generated when the evaluation was done. The text files are in the form of logger_[archive_name].txt

Files

evaluations.zip

Files (3.7 GB)

Name	Size	Download all
evaluations.zip md5:284327ab270754634b151d1d7cd051f3	41.8 kB	Preview Download
models.zip md5:4f10253ae27fa4e88ae0a861e25e2d01	3.7 GB	Preview Download
README.txt md5:85e63f488ad62f1d09125b810a207876	1.4 kB	Preview Download

	All versions	This version
Views	402	394
Downloads	105	104
Data volume	190.6 GB	190.6 GB

Time-Aware Word Embeddings of Three Lebanese News Archives

Creators

Description

Files

evaluations.zip

Files (3.7 GB)