TL;DR

Construct a lemma graph, then perform entity linking based on: spaCy, transformers, SpanMarkerNER, spaCy-DBpedia-Spotlight, REBEL, OpenNRE, qwikidata, pulp

In other words, this hybrid approach integrates NLP parsing, LLMs, graph algorithms, semantic inference, operations research, and also provides UX affordances for including human-in-the-loop practices. The following demo illustrates a small problem, and addresses a much broader class of AI problems in industry.

This step is a prelude before leveraging topological transforms, large language models, graph representation learning, plus human-in-the-loop domain expertise to infer the nodes, edges, properties, and probabilities needed for the semi-automated construction of a knowledge graph from raw unstructured text sources.

In addition to providing a library for production use cases, TextGraphs creates a "playground" or "gym" in which to prototype and evaluate abstractions based on "Graph Levels Of Detail".

  1. use spaCy to parse a document, with SpanMarkerNER LLM assist
  2. add noun chunks in parallel to entities, as "candidate" phrases for subsequent HITL
  3. perform entity linking: spaCy-DBpedia-Spotlight, WikiMedia API
  4. infer relations, plus graph inference: REBEL, OpenNRE, qwikidata
  5. build a lemma graph in NetworkX from the parse results
  6. run a modified textrank algorithm plus graph analytics
  7. approximate a pareto archive (hypervolume) to re-rank extracted entities with pulp
  8. visualize the lemma graph interactively in PyVis
  9. cluster communities within the lemma graph
  10. apply topological transforms to enhance embeddings (in progress)
  11. run graph representation learning on the graph of relations (in progress)

...

  1. PROFIT!
More details...

Implementation of an LLM-augmented textgraph algorithm for constructing a lemma graph from raw, unstructured text source.

The TextGraphs library is based on work developed by Derwen in 2023 Q2 for customer apps and used in our Cysoni product. This demo integrates code from:

For more details about this approach, see these talks:

Other good tutorials (during 2023) which include related material:

Bibliography...

"Automatic generation of hypertext knowledge bases"
Udo Hahn, Ulrich Reimer
ACM SIGOIS 9:2 (1988-04-01)
https://doi.org/10.1145/966861.45429

The condensation process transforms the text representation structures resulting from the text parse into a more abstract thematic description of what the text is about, filtering out irrelevant knowledge structures and preserving only the most salient concepts.

Graph Representation Learning
William Hamilton
Morgan and Claypool (pre-print 2020)
https://www.cs.mcgill.ca/~wlh/grl_book/

A brief but comprehensive introduction to graph representation learning, including methods for embedding graph data, graph neural networks, and deep generative models of graphs.

"REDFM: a Filtered and Multilingual Relation Extraction Dataset"
Pere-Lluís Huguet Cabot, Simone Tedeschi, Axel-Cyrille Ngonga Ngomo, Roberto Navigli
ACL (2023-06-19)
https://arxiv.org/abs/2306.09802

Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems.

"InGram: Inductive Knowledge Graph Embedding via Relation Graphs"
Jaejun Lee, Chanyoung Chung, Joyce Jiyoung Whang
ICML (2023–08–17)
https://arxiv.org/abs/2305.19987

In this paper, we propose an INductive knowledge GRAph eMbedding method, InGram, that can generate embeddings of new relations as well as new entities at inference time.

"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
EMNLP (2004-07-25)
https://aclanthology.org/W04-3252

In this paper, the authors introduce TextRank, a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications.