LDA-Mallet topic model for CPV-45 Spanish tender titles from Procurement Metadata

Carlos III University of Madrid

doi:10.5281/zenodo.11089900

Published April 30, 2024 | Version v1.0

Model Open

LDA-Mallet topic model for CPV-45 Spanish tender titles from Procurement Metadata

Carlos III University of Madrid

Contributors

Researchers:

Supervisor:

Arenas García, Jerónimo¹

1. Universidad Carlos III de Madrid

This model consists of an LDA-Mallet topic model trained on the titles from Spanish tenders with CPV 45 from the Spanish Procurement metadata dataset. The data was collected by crawling the Spanish government’s “Plataforma de contratación del sector público”. Before modeling, the tender titles underwent lemmatization and stopwords removal.

The selection of the number of topics for model training was based on optimizing both coherence (Cv) and dispersion among the identified topics.

The model is composed of the following folders and files:

model_data: Stores model-related data and outputs.
train_data: Contains data used for training the model.

es_Mallet_all_CPV_45_15_topics
├── model_data                     # Stores model-related data and outputs
│   ├── TMmodel                    # Directory for various model-related files
│   │   ├── alphas.npy             # NumPy array storing alpha parameters (topics' size)
│   │   ├── alphas_orig.npy        # Original alpha parameters (in case modifications to the alphas file are made)
│   │   ├── betas.npy              # NumPy array storing beta parameters
│   │   ├── betas_ds.npy           # Downsampled beta parameters
│   │   ├── betas_orig.npy         # Original beta parameters
│   │   ├── edits.txt              # Text file for documenting edits
│   │   ├── ndocs_active.npy       # NumPy array storing the number of active documents
│   │   ├── pyLDAvis.html          # HTML file for visualizing topic models (PyLDAvis)
│   │   ├── thetas.npz             # NumPy file storing theta parameters (document-topic distribution)
│   │   ├── thetas_orig.npz        # Original theta parameters
│   │   ├── topic_coherence.npy    # NumPy array storing topic coherence scores
│   │   ├── tpc_coords.txt         # Text file storing topic coordinates
│   │   ├── tpc_descriptions.txt   # Text file for topic descriptions
│   │   ├── tpc_labels.txt         # Text file storing curated topic labels (no changes to the ChatGPT ones were made)
│   │   └── vocab.txt              # Text file storing vocabulary
│   ├── corpus_train.mallet        # Training corpus in Mallet format
│   ├── corpus_train.txt           # Training corpus in text format
│   ├── diagnostics.xml            # XML file for model diagnostics obtained from Mallet
│   ├── dictionary.gensim          # Gensim dictionary file
│   ├── doc-topics.txt             # Document-topic distribution after training
│   ├── inferencer.mallet          # Mallet inferencer file
│   ├── model.pickle               # Pickle file storing the trained model
│   ├── topic-keys.json            # JSON file storing topic keys
│   ├── topic-keys.txt             # Text file storing topic keys
│   ├── topic-report.xml           # XML file for topic report
│   ├── vocab_freq.txt             # Text file storing vocabulary frequency
│   ├── vocabulary.txt             # Text file storing vocabulary
│   └── word-topic-counts.txt      # Text file storing word-topic counts
├── train_data                     # Contains data used for training the model
│   ├── corpus.mallet              # Corpus in Mallet format for training
│   ├── corpus.txt                 # Raw text data for training
│   ├── corpus_aux.txt             # Auxiliary text file for the corpus
│   ├── import.pipe                # Pipe file for importing inference data
│   └── train.config               # Configuration file for training
└── trainconfig.json               # JSON file containing the training configuration

Files

Files (451.8 MB)

Name	Size	Download all
es_Mallet_all_CPV_45_15_topics.gz md5:89bf353463d140c2c8dd78d499482838	451.8 MB	Download

Additional details

Updated: 2024-04-29

Repository URL: https://github.com/nextprocurement/NP-Search-Tool
Programming language: Python
Development Status: Active

	All versions	This version
Views	61	61
Downloads	16	16
Data volume	7.2 GB	7.2 GB

Contributors

Researchers:

Supervisor:

Files (451.8 MB)

Dates

Software

LDA-Mallet topic model for CPV-45 Spanish tender titles from Procurement Metadata

Authors/Creators

Contributors

Researchers:

Supervisor:

Description

Files

Files (451.8 MB)

Additional details

Dates

Software