Project deliverable Open Access

Deliverable 3.1. Scalable NLP pipelines

Aitor Gonzalez Aguirre; Joan Llop Palao; Marc Pàmies Massip; Marta Villegas

The Intelcomp NLP pipeline can be defined as a collection of tools that apply the requested
transformations to unstructured textual data, which will be used by the Intelcomp services
(document classification, subcorpus generation, topic modeling, etc.) as a preliminary step to
process the datasets of interest. It has been designed to carry out standard text preprocessing
tasks (e.g. n-grams detection, keywords extraction, lemmatization, etc) in a High Performance
Computing environment, allowing the efficient and scalable processing of large amounts of
documents. The final version of the pipeline will be deployed over the HPC infrastructure
provided by the Barcelona Supercomputing Center and fully integrated with Intelcomp's Data
Space. This document should serve not only as a report of the work performed by the
members of WP3 but also as a complete guide for the targeted users and operators of the
pipeline.

Files (327.7 kB)
Name Size
D3.1. Scalable NLP pipelines.pdf
md5:9adfeff0fd1e3869abd127c803ca532d
327.7 kB Download
55
39
views
downloads
All versions This version
Views 5555
Downloads 3939
Data volume 12.8 MB12.8 MB
Unique views 4949
Unique downloads 3434

Share

Cite as