Turboshtein

TL;DR - Possibly the fastest way of computing the Levenshtein distance between two fairly short ASCII strings in Python...


Background

I didn’t study computer science so the usefulness of bit shifts and other bitwise operations outside of low-level programming have always been a bit of a gray area for me. I finally gave in and decided to spend some time reading up on the subject and ended up implementing the Myers bit-vector algorithm1 for computing Levenshtein edit distance. I was fairly happy with the results so I wrapped the original C code in Python and uploaded it to PyPI. Coincidentally, it’s also pretty fast.

Usage

To start using Turboshtein, simply install it via PyPi:

pip install turboshtein

Once installed, computing Levenshtein edit distance then becomes as easy as:

>>> from turboshtein import levenshtein
>>> levenshtein("saturday", "sunday")
3

Note, that this particular implementation is limited to ASCII strings of less than 64 characters.

Performance

After implementing the algorithm, I was curious as to what the real world performance looked like. Given a query of length m and a string length of n, its been shown that the edit distance can be computed in O(mn) time.2 The approach as described by Myers offers a considerable improvement and requires only O(mn/w) where w is the register size (64-bit in this case). For comparison, I included a handful of the most popular libraries that include a Levenshtein distance function.

turboshtein performance

These values are obviously hardware dependent, but on my dusty laptop I was consistently getting a higher throughput with Turboshtein. For example, with strings of less than 24 characters, Turboshtein is able to crunch through a minimum of 600k additional string pairs per second relative to the next fastest library. For short strings (16 characters or less), that figure jumps up to an additional 800k string pairs per second. Test setup was as follows:

laptop modelXPS 15 9570
processorIntel i7-8750H
operating systemFedora Linux 36 (Workstation Edition)

Final thoughts

If you’re curious as to what the implementation looks like, here is the repo. I suppose at some point, I ought to implement support for non-ASCII characters so maybe check back in the future…


  1. Gene Myers. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3 (May 1999), 395-415. https://doi.org/10.1145/316542.316550 ↩︎

  2. Sellers, P. H. 1980. The theory and computation of evolutionary distances: Pattern recognition, Journal of Algorithms 1, 4 (December 1980), 359-373. https://doi.org/10.1016/0196-6774(80)90016-4 ↩︎