3908507
doi
10.5281/zenodo.3908507
oai:zenodo.org:3908507
PTNews Corpus
Nunes, Davide
info:eu-repo/semantics/openAccess
Other (Non-Commercial)
corpus
dataset
portuguese
language modelling
<p>The <strong>PTNews Corpus</strong> is a collection of over 19 million tokens extracted from 10 years of political news articles (in Portuguese) from the <strong>Portuguese</strong> newspaper <a href="https://www.publico.pt/">PÚBLICO</a>. The corpus is available under the <a href="http://127.0.0.1:4000/ptnews/## Licence">Creative Commons Attribution-NonCommercial-ShareAlike Licence</a>. The material contained on the PTNews Corpus is © 2010-2020 <a href="https://www.publico.pt/">PÚBLICO Comunicação Social SA</a>.</p>
<p>The corpus sizes between the preprocessed version of Penn Treebank (PTB) and <a href="https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/">WikiText-103</a>. Similarly to WikiText, PTNews has a larger vocabulary than PTB and retains the original case, punctuation and numbers. This corpus contains over 31000 publicly available full articles which makes it well suited for models that can take advantage of long-term dependencies.<br>
<br>
The corpus is available as a <strong>word-level</strong> collection of articles in two version: the first (ptnews_origin) contains a single file with all the articles in the form: <strong>title</strong>, <strong>URL</strong>, <strong>date</strong>, <strong>body</strong>; the second, contains only the <strong>title</strong> and <strong>body</strong> of the news articles and it is split into <strong>train</strong>, <strong>test</strong>, <strong>validation</strong> sets. In this processed version, the words with less than 3 occurrences are mapped to the <em><unk></em> token. Each sentence in an article body occupies a single line of the dataset and the end of paragraph is marked with the <em><eop></em> tag at the end of a sentence. Portuguese words resulting from contractions like <em>"desta"</em>, ou <em>"nesta"</em> are separated into <em>"d"</em>, <em>"esta"</em>, <em>"n"</em>, <em>"esta",</em> respectively.<br>
</p>
<p><strong>Sample article</strong>:<br>
</p>
<pre><code class="language-bash">Carlos César : Cavaco " cansado e sem entusiasmo " quis afastar responsabilidades sobre a crise
https://publico.pt/2010/06/10/politica/noticia/carlos-cesar-cavaco-cansado-e-sem-entusiasmo-quis-afastar-responsabilidades-sobre-a-crise-1441369
2010-06-10 15:38:00
O presidente do Governo Regional dos Açores , Carlos César , considerou hoje que Cavaco Silva esteve " cansado e sem entusiasmo " no discurso do Dia de Portugal , onde afastou responsabilidades sobre a actual crise . <eop>
" O país ouviu um Presidente cansado e sem entusiasmo , que andou às voltas com os papéis para dizer que não tinha nada a ver com as razões da crise " , afirmou Carlos César , num comentário à Lusa sobre o discurso do Presidente da República na cerimónia oficial do 10 de Junho , realizada em Faro . <eop>
Carlos César considerou , no entanto , " positivo " que Cavaco Silva tenha feito " um discurso alinhado com um tema recorrente na apreciação do momento que vivemos , o da coesão e da corresponsabilização " . <eop>
No mesmo sentido , manifestou concordância com o apelo que Cavaco Silva fez " à responsabilidade dos empregadores e empregados " , mas deixou um alerta relativamente à referência do Presidente da República à necessidade de " limpar Portugal " . <eop>
Para Carlos César , se essa referência " for despida de conteúdo institucional útil , tratou-se de mais um discurso que se perderá na babugem política d aquilo que Cavaco Silva entendeu recordar como o ' rectângulo ' " . <eop></code></pre>
<p><strong>Reporting Results</strong><br>
If you wish to report results or other resources obtained on the PTNews contact <a href="mailto:davidenunes@pm.me">Davide Nunes</a> with the following information:</p>
<ul>
<li><strong>Task</strong>: e.g. Language Modelling, Semantic Similarity, etc;</li>
<li><strong>Publication URL</strong>: url to published article or preprint;</li>
<li><strong>Type of Model</strong>: LSTM Neural Network, n-grams, GloVe vectors, etc;</li>
<li><strong>Evaluation Metrics</strong>: e.g. validation and testing perplexities in the case of language modelling.</li>
</ul>
<p>They will be displayed <a href="https://davidenunes.com/ptnews">here</a></p>
<p><strong>Preprocessed Corpus Statistics</strong></p>
<ul>
<li>articles: 31.919</li>
<li>articles by split:
<ul>
<li>train: 25.537</li>
<li>test: 3.191</li>
<li>val: 3.191</li>
</ul>
</li>
<li>unique tokens: 68.318</li>
<li>unique OoV Tokens: 76.157</li>
<li>total tokens: 19.021.661</li>
<li>total OoV tokens: 95.043</li>
<li>OoV rate: 0.5%</li>
<li>tokens by split:
<ul>
<li>train: 15.242.995</li>
<li>test: 1.895.184</li>
<li>val: 1.883.482</li>
</ul>
</li>
</ul>
<p> </p>
<p><strong>Contact Information</strong></p>
<p>If you have questions about the corpus or want to report benchmark results, contact <a href="mailto:davidenunes@pm.me">Davide Nunes</a>.</p>
<p><br>
</p>
Zenodo
2020-06-25
info:eu-repo/semantics/other
3908506
1
1593279234.860857
35914171
md5:a6feb6bbba29913daee1e855793bab1f
https://zenodo.org/records/3908507/files/ptnews.tar.gz
36895663
md5:a2082d999ec5394f8de3edc866ba7670
https://zenodo.org/records/3908507/files/ptnews_origin.tar.gz
21027
md5:27671fca8c18e8ee61aad115bb041d9e
https://zenodo.org/records/3908507/files/LICENCE.txt
public
10.5281/zenodo.3908506
isVersionOf
doi