Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs

10.5281/zenodo.1042505 https://zenodo.org/records/1042505 oai:zenodo.org:1042505 Fernando Alva-Manchego Fernando Alva-Manchego University of Sheffield Joachim Bingel Joachim Bingel University of Copenhagen Gustavo Henrique Paetzold Gustavo Henrique Paetzold University of Sheffield Carolina Scarton Carolina Scarton University Lucia Specia Lucia Specia Univer Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs Zenodo 2017 2017-11-27 2020-01-20 eng 10.5281/zenodo.1042504 https://zenodo.org/communities/h2020-simpatico-692819 https://zenodo.org/communities/eu Creative Commons Attribution 4.0 International Current research in text simplification has been hampered by two central problems: (i) the small amount of high-quality parallel simplification data available, and (ii) the lack of explicit annotations of simplification operations, such as deletions or substitutions, on existing data. While the recently introduced Newsela corpus has alleviated the first problem, simplifications still need to be learned directly from parallel text using black-box, end-to-end approaches rather than from explicit annotations. These complex-simple parallel sentence pairs often differ to such a high degree that generalization becomes difficult. End-to-end models also make it hard to interpret what is actually learned from data. We propose a method that decomposes the task of TS into its sub-problems. We devise a way to automatically identify operations in a parallel corpus and introduce a sequence-labeling approach based on these annotations. Finally, we provide insights on the types of transformations that different approaches can model. European Commission 692819 SIMplifying the interaction with Public Administration Through Information technology for Citizens and cOmpanies