Published April 30, 2022
| Version v0.10.0
Software
Open
MaartenGr/BERTopic: v0.10.0
Description
Highlights
- Use any dimensionality reduction technique instead of UMAP:
from bertopic import BERTopic
from sklearn.decomposition import PCA
dim_model = PCA(n_components=5)
topic_model = BERTopic(umap_model=dim_model)
- Use any clustering technique instead of HDBSCAN:
from bertopic import BERTopic
from sklearn.cluster import KMeans
cluster_model = KMeans(n_clusters=50)
topic_model = BERTopic(hdbscan_model=cluster_model)
Documentation
- Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
- Added pages on how to use other dimensionality reduction and clustering algorithms
- Additional instructions on how to reduce outliers in the FAQ:
import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]
- Additional instructions on how to reduce outliers in the FAQ:
- Fixed
Nonebeing returned for probabilities when transforming unseen documents - Replaced all instances of
arg:withArguments:for consistency - Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if
min_dfis set to a value larger than 1 - Set
"hdbscan>=0.8.28"to prevent numpy issues- Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues
- Update gensim dependency to
>=4.0.0(#371) - Fix topic 0 not appearing in visualizations (#472)
- Fix #506
- Fix #429
Files
MaartenGr/BERTopic-v0.10.0.zip
Files
(6.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:db73a29436eccc8caef7b936982a9e3e
|
6.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/MaartenGr/BERTopic/tree/v0.10.0 (URL)