Published April 30, 2024 | Version v1.0
Model Open

LDA-Mallet topic model for Spanish Tender titles from Procurement metadata

  • 1. ROR icon Universidad Carlos III de Madrid

Description

This model consists of an LDA-Mallet topic model trained on a subset of Spanish tender titles identified as such using a language detection tool applied to the Spanish Procurement metadata dataset. The dataset was gathered through web crawling of the Spanish government’s “Plataforma de contratación del sector público”.

Before modeling, the tender titles underwent lemmatization and stopwords removal, with a filter applied to retain only titles containing more than two words. Inference was subsequently conducted on the processed titles using the trained LDA-Mallet model.

The selection of the number of topics for model training was based on optimizing both coherence (Cv) and dispersion among the identified topics. 

The model is composed of the following folders and files:

  • infer_data: Contains data and outputs related to the inference process.

  • model_data: Stores model-related data and outputs.

  • train_data: Contains data used for training the model.
es_Mallet_all_55_topics
├── infer_data                     # Contains data and outputs related to the inference process
│   ├── corpus.txt                 # Raw text data used for inference (documents not included in the training corpus)
│   ├── corpus_inf.mallet          # Processed input for inference in Mallet format
│   ├── doc-topics-inf.txt         # Document-topic distribution after inference
│   └── thetas.npz                 # NumPy file storing inferred topic distributions
├── model_data                     # Stores model-related data and outputs
│   ├── TMmodel                    # Directory for various model-related files
│   │   ├── alphas.npy             # NumPy array storing alpha parameters (topics' size)
│   │   ├── alphas_orig.npy        # Original alpha parameters (in case modifications to the alphas file are made)
│   │   ├── betas.npy              # NumPy array storing beta parameters
│   │   ├── betas_ds.npy           # Downsampled beta parameters
│   │   ├── betas_orig.npy         # Original beta parameters
│   │   ├── edits.txt              # Text file for documenting edits
│   │   ├── ndocs_active.npy       # NumPy array storing the number of active documents
│   │   ├── pyLDAvis.html          # HTML file for visualizing topic models (PyLDAvis)
│   │   ├── thetas.npz             # NumPy file storing theta parameters (document-topic distribution)
│   │   ├── thetas_orig.npz        # Original theta parameters
│   │   ├── topic_coherence.npy    # NumPy array storing topic coherence scores
│   │   ├── tpc_coords.txt         # Text file storing topic coordinates
│   │   ├── tpc_descriptions.txt   # Text file for topic descriptions
│   │   ├── tpc_labels.txt         # Text file storing curated topic labels
│   │   ├── tpc_labels.txt.backup.txt  # ChatGPT topic labels
│   │   └── vocab.txt              # Text file storing vocabulary
│   ├── corpus_train.mallet        # Training corpus in Mallet format
│   ├── corpus_train.txt           # Training corpus in text format
│   ├── diagnostics.xml            # XML file for model diagnostics obtained from Mallet
│   ├── dictionary.gensim          # Gensim dictionary file
│   ├── doc-topics.txt             # Document-topic distribution after training
│   ├── inferencer.mallet          # Mallet inferencer file
│   ├── model.pickle               # Pickle file storing the trained model
│   ├── topic-keys.json            # JSON file storing topic keys
│   ├── topic-keys.txt             # Text file storing topic keys
│   ├── topic-report.xml           # XML file for topic report
│   ├── vocab_freq.txt             # Text file storing vocabulary frequency
│   ├── vocabulary.txt             # Text file storing vocabulary
│   └── word-topic-counts.txt      # Text file storing word-topic counts
├── train_data                     # Contains data used for training the model
│   ├── corpus.mallet              # Corpus in Mallet format for training
│   ├── corpus.txt                 # Raw text data for training
│   ├── corpus_aux.txt             # Auxiliary text file for the corpus
│   ├── import.pipe                # Pipe file for importing inference data
│   └── train.config               # Configuration file for training
└── trainconfig.json               # JSON file containing the training configuration

Files

Files (985.0 MB)

Name Size Download all
md5:73262b6d76f86086fb30bdd11764c338
985.0 MB Download

Additional details

Dates

Updated
2024-04-29

Software

Repository URL
https://github.com/nextprocurement/NP-Search-Tool
Programming language
Python
Development Status
Active