Project deliverable Open Access

Deliverable 3.1. Scalable NLP pipelines

Aitor Gonzalez Aguirre; Joan Llop Palao; Marc Pàmies Massip; Marta Villegas

The Intelcomp NLP pipeline can be defined as a collection of tools that apply the requested
transformations to unstructured textual data, which will be used by the Intelcomp services
(document classification, subcorpus generation, topic modeling, etc.) as a preliminary step to
process the datasets of interest. It has been designed to carry out standard text preprocessing
tasks (e.g. n-grams detection, keywords extraction, lemmatization, etc) in a High Performance
Computing environment, allowing the efficient and scalable processing of large amounts of
documents. The final version of the pipeline will be deployed over the HPC infrastructure
provided by the Barcelona Supercomputing Center and fully integrated with Intelcomp's Data
Space. This document should serve not only as a report of the work performed by the
members of WP3 but also as a complete guide for the targeted users and operators of the

Files (327.7 kB)
Name Size
D3.1. Scalable NLP pipelines.pdf
327.7 kB Download
All versions This version
Views 109109
Downloads 8585
Data volume 27.9 MB27.9 MB
Unique views 102102
Unique downloads 7878


Cite as