Search and classify topics in a corpus of text using the latent dirichlet allocation model

Orlando Iparraguirre-Villanueva; Fernando Sierra-Liñan; Jose Luis Herrera Salazar; Saul Beltozar-Clemente; Félix Pucuhuayla-Revatta; Joselyn Zapata-Paulini; Michael Cabanillas-Carbonell

doi:10.11591/ijeecs.v30.i1.pp246-256

Published April 1, 2023 | Version v1

Journal article Open

Search and classify topics in a corpus of text using the latent dirichlet allocation model

1. Facultad de Ingeniería y Arquitectura, Universidad Autónoma del Perú, Lima, Perú
2. Facultad de Ingeniería, Universidad Privada del Norte, Lima, Perú
3. Facultad de Ingeniería, Ciencias y Administración, Universidad Autónoma de Ica, Lima, Perú
4. Dirección de Cursos Básicos, Universidad Científica del Sur, Lima, Perú
5. Facultad de Ingeniería, Universidad Tecnológica del Perú, Lima, Perú
6. Escuela de Posgrado, Universidad Continental, Lima, Perú
7. Vicerrectorado de Investigación, Universidad Privada Norbert Wiener, Lima, Perú

This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among other tools associated with the technology.

Files

30256-61127-1-PB.pdf

Files (646.3 kB)

Name	Size	Download all
30256-61127-1-PB.pdf md5:9612a19922a6b02e74c30e5467962abb	646.3 kB	Preview Download

	All versions	This version
Views	42	41
Downloads	53	53
Data volume	34.3 MB	34.3 MB

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Authors/Creators

Description

Files

30256-61127-1-PB.pdf

Files (646.3 kB)