There is a newer version of the record available.

Published April 30, 2022 | Version v0.10.0
Software Open

MaartenGr/BERTopic: v0.10.0

  • 1. IKNL
  • 2. Huggingface

Description

Highlights
  • Use any dimensionality reduction technique instead of UMAP:
from bertopic import BERTopic
from sklearn.decomposition import PCA

dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
  • Use any clustering technique instead of HDBSCAN:
from bertopic import BERTopic
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)
Documentation
  • Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
  • Added pages on how to use other dimensionality reduction and clustering algorithms
    • Additional instructions on how to reduce outliers in the FAQ:
      import numpy as np
      probability_threshold = 0.01
      new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]
      
Fixes
  • Fixed None being returned for probabilities when transforming unseen documents
  • Replaced all instances of arg: with Arguments: for consistency
  • Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if min_df is set to a value larger than 1
  • Set "hdbscan>=0.8.28" to prevent numpy issues
    • Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues
  • Update gensim dependency to >=4.0.0 (#371)
  • Fix topic 0 not appearing in visualizations (#472)
  • Fix #506
  • Fix #429

Files

MaartenGr/BERTopic-v0.10.0.zip

Files (6.2 MB)

Name Size Download all
md5:db73a29436eccc8caef7b936982a9e3e
6.2 MB Preview Download

Additional details

Related works