TL;DR - Possibly the fastest way of computing the Levenshtein distance between two fairly short ASCII strings in Python...
I didn’t study computer science so the usefulness of bit shifts and other bitwise operations outside of low-level programming have always been a bit of a gray area for me. I finally gave in and decided to spend some time reading up on the subject and ended up implementing the Myers bit-vector algorithm1 for computing Levenshtein edit distance. I was fairly happy with the results so I wrapped the original C code in Python and uploaded it to PyPI. Coincidentally, it’s also pretty fast.
To start using Turboshtein, simply install it via PyPi:
pip install turboshtein
Once installed, computing Levenshtein edit distance then becomes as easy as:
>>> from turboshtein import levenshtein
>>> levenshtein("saturday", "sunday")
3
Note, that this particular implementation is limited to ASCII strings of less than 64 characters.
After implementing the algorithm, I was curious as to what the real world performance looked like. Given a query of length m and a string length of n, its been shown that the edit distance can be computed in O(mn) time.2 The approach as described by Myers offers a considerable improvement and requires only O(mn/w) where w is the register size (64-bit in this case). For comparison, I included a handful of the most popular libraries that include a Levenshtein distance function.
These values are obviously hardware dependent, but on my dusty laptop I was consistently getting a higher throughput with Turboshtein. For example, with strings of less than 24 characters, Turboshtein is able to crunch through a minimum of 600k additional string pairs per second relative to the next fastest library. For short strings (16 characters or less), that figure jumps up to an additional 800k string pairs per second. Test setup was as follows:
laptop model | XPS 15 9570 | |
processor | Intel i7-8750H | |
operating system | Fedora Linux 36 (Workstation Edition) |
If you’re curious as to what the implementation looks like, here is the repo. I suppose at some point, I ought to implement support for non-ASCII characters so maybe check back in the future…
Gene Myers. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3 (May 1999), 395-415. https://doi.org/10.1145/316542.316550 ↩︎
Sellers, P. H. 1980. The theory and computation of evolutionary distances: Pattern recognition, Journal of Algorithms 1, 4 (December 1980), 359-373. https://doi.org/10.1016/0196-6774(80)90016-4 ↩︎