LDA-Mallet topic model for CPV-45 Spanish tender titles from Procurement Metadata
Authors/Creators
Contributors
Supervisor:
Description
This model consists of an LDA-Mallet topic model trained on the titles from Spanish tenders with CPV 45 from the Spanish Procurement metadata dataset. The data was collected by crawling the Spanish government’s “Plataforma de contratación del sector público”. Before modeling, the tender titles underwent lemmatization and stopwords removal.
The selection of the number of topics for model training was based on optimizing both coherence (Cv) and dispersion among the identified topics.
The model is composed of the following folders and files:
-
model_data: Stores model-related data and outputs. train_data: Contains data used for training the model.
es_Mallet_all_CPV_45_15_topics
├── model_data # Stores model-related data and outputs
│ ├── TMmodel # Directory for various model-related files
│ │ ├── alphas.npy # NumPy array storing alpha parameters (topics' size)
│ │ ├── alphas_orig.npy # Original alpha parameters (in case modifications to the alphas file are made)
│ │ ├── betas.npy # NumPy array storing beta parameters
│ │ ├── betas_ds.npy # Downsampled beta parameters
│ │ ├── betas_orig.npy # Original beta parameters
│ │ ├── edits.txt # Text file for documenting edits
│ │ ├── ndocs_active.npy # NumPy array storing the number of active documents
│ │ ├── pyLDAvis.html # HTML file for visualizing topic models (PyLDAvis)
│ │ ├── thetas.npz # NumPy file storing theta parameters (document-topic distribution)
│ │ ├── thetas_orig.npz # Original theta parameters
│ │ ├── topic_coherence.npy # NumPy array storing topic coherence scores
│ │ ├── tpc_coords.txt # Text file storing topic coordinates
│ │ ├── tpc_descriptions.txt # Text file for topic descriptions
│ │ ├── tpc_labels.txt # Text file storing curated topic labels (no changes to the ChatGPT ones were made)
│ │ └── vocab.txt # Text file storing vocabulary
│ ├── corpus_train.mallet # Training corpus in Mallet format
│ ├── corpus_train.txt # Training corpus in text format
│ ├── diagnostics.xml # XML file for model diagnostics obtained from Mallet
│ ├── dictionary.gensim # Gensim dictionary file
│ ├── doc-topics.txt # Document-topic distribution after training
│ ├── inferencer.mallet # Mallet inferencer file
│ ├── model.pickle # Pickle file storing the trained model
│ ├── topic-keys.json # JSON file storing topic keys
│ ├── topic-keys.txt # Text file storing topic keys
│ ├── topic-report.xml # XML file for topic report
│ ├── vocab_freq.txt # Text file storing vocabulary frequency
│ ├── vocabulary.txt # Text file storing vocabulary
│ └── word-topic-counts.txt # Text file storing word-topic counts
├── train_data # Contains data used for training the model
│ ├── corpus.mallet # Corpus in Mallet format for training
│ ├── corpus.txt # Raw text data for training
│ ├── corpus_aux.txt # Auxiliary text file for the corpus
│ ├── import.pipe # Pipe file for importing inference data
│ └── train.config # Configuration file for training
└── trainconfig.json # JSON file containing the training configuration
Files
Files
(451.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:89bf353463d140c2c8dd78d499482838
|
451.8 MB | Download |
Additional details
Dates
- Updated
-
2024-04-29
Software
- Repository URL
- https://github.com/nextprocurement/NP-Search-Tool
- Programming language
- Python
- Development Status
- Active