Text Simplification from Professionally Produced Corpora

Carolina Scarton; Gustavo Henrique Paetzold; Lucia Specia

doi:10.5281/zenodo.1410451

Published May 7, 2018 | Version v1

Conference paper Open

Text Simplification from Professionally Produced Corpora

1. University of Sheffield

The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We investigate the application of the recently created Newsela corpus, the largest collection of professionally written simplifications available, in TS tasks. Using new alignment algorithms, we extract 550,644 complex-simple sentence pairs from the corpus. This data is explored in different ways: (i) we show that traditional readability metrics capture surprisingly well the different complexity levels in this corpus, (ii) we build machine learning models to classify sentences into complex vs. simple and to predict complexity levels that outperform their respective baselines, (iii) we introduce a lexical simplifier that uses the corpus to generate candidate simplifications and outperforms the state of the art approaches, and (iv) we show that the corpus can be used to learn sentence simplification patterns in more effective ways than corpora used in previous work.

Files

1063.pdf

Files (242.5 kB)

Name	Size	Download all
1063.pdf md5:0c84211c53b25ed696a374ec0e64d673	242.5 kB	Preview Download

Additional details

SIMPATICO – SIMplifying the interaction with Public Administration Through Information technology for Citizens and cOmpanies 692819: European Commission

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	189	189
Downloads	67	67
Data volume	17.0 MB	17.0 MB

Text Simplification from Professionally Produced Corpora

Creators

Description

Files

1063.pdf

Files (242.5 kB)

Additional details

Funding