Published May 11, 2026 | Version 1.0.0
Preprint Open

Estimating Levenshtein Distance With Signatures

Authors/Creators

Description

Levenshtein Distance (LD) is an intuitive measure of lexical similarity, but computing it exactly runs in time proportional to the product of the string lengths, limiting practical use to strings of about a thousand characters. This paper describes a technique for estimating LD between much larger texts by applying LD to compact signatures---short strings generated by a sliding-window hash that function as thumbnails of the originals. Two parameters control the trade-off: a compression factor C determines signature length (approximately file_size/C), and a neighborhood size n controls sensitivity to dense    character-level differences. Signatures are two to three orders of magnitude shorter than the source documents, making LD estimation on documents of hundreds of thousands of characters practical on commodity hardware. At 25KB with C=50, normalized estimation error stays below 13% even for completely unrelated files, and the estimator reliably distinguishes identical, near-duplicate, modified, and unrelated documents across all tested compression factors. Because signatures are self-contained artifacts that support all subsequent operations without access to the source document, they enable privacy-preserving architectures in which neither party to a comparison need expose its original content. Applications include web-scale deduplication, content security and leak detection, double-blind similarity search, digital forensics, and scholarly analysis of manuscript traditions.   

Files

levenshtein.pdf

Files (250.5 kB)

Name Size Download all
md5:770f716dbb58b8f1eb3f1dd80f74a37b
250.5 kB Preview Download

Additional details

Related works

Is supplemented by
Software: https://github.com/coatespt/eld_2026 (URL)

Dates

Available
2026-05-11

Software

Repository URL
https://github.com/coatespt/eld_2026
Programming language
Go
Development Status
Active