Estimating Levenshtein Distance With Signatures

Coates, Peter Taylor

doi:10.5281/zenodo.20125438

Published May 11, 2026 | Version 1.0.0

Preprint Open

Estimating Levenshtein Distance With Signatures

Coates, Peter Taylor

Levenshtein Distance (LD) is an intuitive measure of lexical similarity, but computing it exactly runs in time proportional to the product of the string lengths, limiting practical use to strings of about a thousand characters. This paper describes a technique for estimating LD between much larger texts by applying LD to compact signatures---short strings generated by a sliding-window hash that function as thumbnails of the originals. Two parameters control the trade-off: a compression factor C determines signature length (approximately file_size/C), and a neighborhood size n controls sensitivity to dense character-level differences. Signatures are two to three orders of magnitude shorter than the source documents, making LD estimation on documents of hundreds of thousands of characters practical on commodity hardware. At 25KB with C=50, normalized estimation error stays below 13% even for completely unrelated files, and the estimator reliably distinguishes identical, near-duplicate, modified, and unrelated documents across all tested compression factors. Because signatures are self-contained artifacts that support all subsequent operations without access to the source document, they enable privacy-preserving architectures in which neither party to a comparison need expose its original content. Applications include web-scale deduplication, content security and leak detection, double-blind similarity search, digital forensics, and scholarly analysis of manuscript traditions.

Files

levenshtein.pdf

Files (250.5 kB)

Name	Size	Download all
levenshtein.pdf md5:770f716dbb58b8f1eb3f1dd80f74a37b	250.5 kB	Preview Download

Additional details

Is supplemented by: Software: https://github.com/coatespt/eld_2026 (URL)

Available: 2026-05-11

Repository URL: https://github.com/coatespt/eld_2026
Programming language: Go
Development Status: Active

	All versions	This version
Views	208	208
Downloads	59	59
Data volume	15.3 MB	15.3 MB

levenshtein.pdf

Files (250.5 kB)

Related works

Dates

Software

Estimating Levenshtein Distance With Signatures

Authors/Creators

Description

Files

levenshtein.pdf

Files (250.5 kB)

Additional details

Related works

Dates

Software