Published September 29, 2023 | Version v0.8.0
Software Open

MaartenGr/KeyBERT: v0.8

  • 1. IKNL
  • 2. IIT kanpur
  • 3. @Gruveo
  • 4. IIIT, Hyderabad
  • 5. The University of Tokyo
  • 6. @explosion

Description

Highlights

  • Use keybert.KeyLLM to leverage LLMs for extracting keywords 🔥
    • Use it either with or without candidate keywords generated through KeyBERT
    • Efficient implementation by calculating embeddings and generating keywords for a subset of the documents
  • Multiple LLMs are integrated: OpenAI, Cohere, LangChain, 🤗 Transformers, and LiteLLM
1. Create Keywords with KeyLLM

A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match KeyBERT with KeyLLM. You could also choose to use KeyLLM without KeyBERT.

from keybert import KeyBERT

kw_model = KeyBERT()

# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
2. Efficient KeyLLM

If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, convert_to_tensor=True)

# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.75)
3. Efficient KeyLLM + KeyBERT

This is the best of both worlds. We use KeyBERT to generate a first pass of keywords and embeddings and give those to KeyLLM for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior with the threshold. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM, KeyBERT

# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents); keywords

See here for full documentation on use cases of KeyLLM and here for the implemented Large Language Models.

Fixes
  • Enable Guided KeyBERT for seed keywords differing among docs by @shengbo-ma in #152

Files

MaartenGr/KeyBERT-v0.8.0.zip

Files (457.3 kB)

Name Size Download all
md5:99cca342bdf8fa217c9b85abdfd916ad
457.3 kB Preview Download

Additional details

Related works