MaartenGr/BERTopic: v0.9

Maarten Grootendorst; Nils Reimers

doi:10.5281/zenodo.5168575

Published August 7, 2021 | Version v0.9.0

Software Open

MaartenGr/BERTopic: v0.9

1. Van Spaendonck
2. Huggingface

Highlights

Implemented a Guided BERTopic -> Use seeds to steer the Topic Modeling
Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)
- This allows users to see which documents are good representations of a topic and better understand the topics that were created
Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics
Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True
Added several FAQs

Fixes

Fix loading pre-trained BERTopic model
Fix mapping of probabilities
Fix #190

Guided BERTopic

Guided BERTopic works in two ways:

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Files

MaartenGr/BERTopic-v0.9.0.zip

Files (6.2 MB)

Name	Size	Download all
MaartenGr/BERTopic-v0.9.0.zip md5:60f155bbfac774a56b187663b8dfd18a	6.2 MB	Preview Download

Additional details

Is supplement to: https://github.com/MaartenGr/BERTopic/tree/v0.9.0 (URL)

	All versions	This version
Views	10,822	484
Downloads	736	24
Data volume	3.9 GB	147.9 MB

MaartenGr/BERTopic: v0.9

Authors/Creators

Description

Files

MaartenGr/BERTopic-v0.9.0.zip

Files (6.2 MB)

Additional details

Related works