Edit Distance - The Levenshtein Algorithm Explained with Python Examples

What is Edit Distance?

As developers, we constantly deal with strings. Whether it’s comparing user input, correcting typos, aligning text, or even analyzing DNA sequences, understanding string similarity is fundamental. This is where Edit Distance comes into play.

At its core, Edit Distance is a metric for quantifying how dissimilar two strings are to one another. It’s defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

The most common and widely used algorithm for calculating edit distance is the Levenshtein Distance, often just called “Edit Distance” interchangeably.

Why is Edit Distance Important?

  • Spell Checkers: The classic example. When you mistype a word, your spell checker uses edit distance to suggest corrections (e.g., “kiten” to “kitten”).
  • “Did you mean…?” Features: Search engines and e-commerce sites use it to provide suggestions when a user’s query doesn’t match perfectly.
  • DNA Sequence Alignment: In bioinformatics, it’s used to compare DNA or protein sequences to find similarities, which can indicate evolutionary relationships or functional resemblances.
  • Plagiarism Detection: While not the sole method, it can be a component in identifying similar text passages.
  • Diff Utilities: Tools like git diff or diff itself, although often using more specialized algorithms, are conceptually related to finding the “edits” between two versions of a file.
  • Natural Language Processing (NLP): Used in various tasks like text normalization, tokenization, and even in some machine translation systems.

The Levenshtein Algorithm: How it Works

The Levenshtein algorithm uses a technique called Dynamic Programming. If you’re new to dynamic programming, think of it as solving a complex problem by breaking it down into simpler, overlapping subproblems and storing the results of these subproblems to avoid redundant calculations.

Let’s define the three basic operations that contribute to the Levenshtein distance:

  1. Insertion: Adding a character. (e.g., “cat” to “cart” - insert ‘r’)
  2. Deletion: Removing a character. (e.g., “cart” to “cat” - delete ‘r’)
  3. Substitution: Changing one character to another. (e.g., “cat” to “cut” - substitute ‘a’ with ‘u’)

Each of these operations has a “cost” of 1. The goal is to find the sequence of operations with the minimum total cost.

The Dynamic Programming Approach

We’ll build a 2D matrix (or grid) dp, where dp[i][j] represents the Levenshtein distance between the first i characters of string1 and the first j characters of string2.

Let s1 be string1 and s2 be string2. Let m be the length of s1 and n be the length of s2. Our dp matrix will have dimensions (m+1) x (n+1).

1. Initialization (Base Cases)

  • First Row (dp[0][j]): If s1 is an empty string, the distance to transform it into s2 (up to j characters) is simply j insertions. So, dp[0][j] = j.
  • First Column (dp[i][0]): Similarly, if s2 is an empty string, the distance to transform s1 (up to i characters) into an empty string is i deletions. So, dp[i][0] = i.

This essentially sets up the cost of transforming an empty string to a prefix of the other string.

2. Filling the Matrix (Recurrence Relation)

For every other cell dp[i][j] (where i > 0 and j > 0):

  • Case 1: Characters Match (s1[i-1] == s2[j-1]) If the current characters at s1[i-1] and s2[j-1] are the same, no operation is needed for these characters. The distance dp[i][j] is simply the distance of the preceding substrings: dp[i-1][j-1].

  • Case 2: Characters Don’t Match (s1[i-1] != s2[j-1]) If the characters are different, we have three options, and we choose the one that results in the minimum cost:

    1. Deletion: Delete s1[i-1]. The cost is 1 + dp[i-1][j] (1 for deletion, plus the distance between s1 up to i-1 and s2 up to j).
    2. Insertion: Insert s2[j-1] into s1. The cost is 1 + dp[i][j-1] (1 for insertion, plus the distance between s1 up to i and s2 up to j-1).
    3. Substitution: Substitute s1[i-1] with s2[j-1]. The cost is 1 + dp[i-1][j-1] (1 for substitution, plus the distance between s1 up to i-1 and s2 up to j-1).

    So, dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]).

3. The Result

The final Levenshtein distance between s1 and s2 will be dp[m][n], which is the bottom-right cell of our matrix.

Step-by-Step Example: “kitten” vs “sitting”

Let s1 = "kitten" (m=6) and s2 = "sitting" (n=7).

Initialize the (m+1) x (n+1) matrix (7x8):

s i t t i n g
0 1 2 3 4 5 6 7
k 1
i 2
t 3
t 4
e 5
n 6

Now, let’s fill it:

dp[1][1] (k vs s): Mismatch. 1 + min(dp[0][1], dp[1][0], dp[0][0]) = 1 + min(1, 1, 0) = 1 dp[1][2] (k vs si): Mismatch. 1 + min(dp[0][2], dp[1][1], dp[0][1]) = 1 + min(2, 1, 1) = 2

…and so on.

Let’s look at a crucial cell: dp[3][3] (kit vs sit). s1[2] is ’t’, s2[2] is ’t’. They match! So, dp[3][3] = dp[2][2] (distance for ki vs si).

If s1[i-1] and s2[j-1] are the same, cost = 0 for that char, otherwise cost = 1. dp[i][j] = min(dp[i-1][j] + 1, // Deletion from s1 dp[i][j-1] + 1, // Insertion into s1 dp[i-1][j-1] + cost) // Substitution or Match

Let’s fill the table for “kitten” vs “sitting” (values are for understanding, not literal steps):

s i t t i n g
0 1 2 3 4 5 6 7
k 1 1 2 3 4 5 6 7
i 2 2 1 2 3 4 5 6
t 3 3 2 1 2 3 4 5
t 4 4 3 2 1 2 3 4
e 5 5 4 3 2 2 3 4
n 6 6 5 4 3 3 2 3

Let’s trace dp[6][7] (kitten vs sitting): s1[5] is ’n’, s2[6] is ‘g’. Mismatch. dp[6][7] = 1 + min(dp[5][7] /*delete n*/, dp[6][6] /*insert g*/, dp[5][6] /*substitute n for g*/) dp[6][7] = 1 + min(4, 2, 3) = 1 + 2 = 3

The final distance is 3. (kitten -> sitting)

  1. k -> s (sub)
  2. e -> i (sub)
  3. n -> g (sub) In this case, it happens to be three substitutions.

Python Implementation

Here’s a robust Python function that implements the Levenshtein distance using dynamic programming:

def levenshtein_distance(s1: str, s2: str) -> int:
    """
    Calculates the Levenshtein (edit) distance between two strings.

    The Levenshtein distance is the minimum number of single-character edits
    (insertions, deletions, or substitutions) required to change one word
    into the other.

    Args:
        s1 (str): The first string.
        s2 (str): The second string.

    Returns:
        int: The Levenshtein distance between s1 and s2.
    """
    m, n = len(s1), len(s2)

    # Create a DP table to store results of subproblems
    # dp[i][j] will be the Levenshtein distance between s1[:i] and s2[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Initialize the first row and column of the DP table
    # If s1 is empty, distance to s2 is the length of s2 (insertions)
    for j in range(n + 1):
        dp[0][j] = j
    
    # If s2 is empty, distance to s1 is the length of s1 (deletions)
    for i in range(m + 1):
        dp[i][0] = i

    # Fill the DP table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            # If characters match, no cost for this step.
            # Take the distance from the diagonal element (previous subproblem).
            if s1[i - 1] == s2[j - 1]:
                cost = 0
            else:
                # If characters don't match, there's a cost of 1 for an operation.
                # Choose the minimum of:
                # 1. Deletion: dp[i-1][j] + 1 (delete s1[i-1])
                # 2. Insertion: dp[i][j-1] + 1 (insert s2[j-1])
                # 3. Substitution: dp[i-1][j-1] + 1 (substitute s1[i-1] with s2[j-1])
                cost = 1
            
            dp[i][j] = min(dp[i - 1][j] + 1,      # Deletion
                           dp[i][j - 1] + 1,      # Insertion
                           dp[i - 1][j - 1] + cost) # Substitution or Match

    # The bottom-right cell contains the final Levenshtein distance
    return dp[m][n]

# Helper function to visualize the DP table (for debugging/understanding)
def print_dp_table(s1: str, s2: str, dp_table: list[list[int]]):
    """Prints the DP table for visualization."""
    m, n = len(s1), len(s2)
    print("\nDP Table:")
    
    # Print header row for s2
    header = [" "] + list(s2)
    print("      " + "   ".join(header)) # Adjust spacing for aesthetic
    
    for i in range(m + 1):
        row_label = s1[i-1] if i > 0 else " " # Label for s1 character
        row_values = [str(val).rjust(2) for val in dp_table[i]]
        print(f"{row_label.ljust(2)} {i:^2} | {' | '.join(row_values)} |")
    print("-" * (6 + n * 4)) # Separator line

Code Examples and Outputs

Let’s test our levenshtein_distance function with various scenarios.

Example 1: Classic “kitten” vs “sitting”

s1 = "kitten"
s2 = "sitting"
distance = levenshtein_distance(s1, s2)
print(f"Distance between '{s1}' and '{s2}': {distance}")
Distance between 'kitten' and 'sitting': 3

Example 2: Identical Strings

s1 = "hello"
s2 = "hello"
distance = levenshtein_distance(s1, s2)
print(f"Distance between '{s1}' and '{s2}': {distance}")
Distance between 'hello' and 'hello': 0

Example 3: One Empty String

s1 = "apple"
s2 = ""
distance = levenshtein_distance(s1, s2)
print(f"Distance between '{s1}' and '{s2}': {distance}")
Distance between 'apple' and '': 5

Example 4: Completely Different Strings

s1 = "developer"
s2 = "algorithm"
distance = levenshtein_distance(s1, s2)
print(f"Distance between '{s1}' and '{s2}': {distance}")
Distance between 'developer' and 'algorithm': 8

Example 5: Different Lengths, Some Similarity

s1 = "programming"
s2 = "programmer"
distance = levenshtein_distance(s1, s2)
print(f"Distance between '{s1}' and '{s2}': {distance}")
Distance between 'programming' and 'programmer': 2

(One deletion ‘i’, one substitution ‘g’ to ‘r’ or deletion ‘g’, insertion ‘r’)

Let’s use the print_dp_table helper for a smaller example to illustrate the matrix:

s1 = "ab"
s2 = "ac"
distance = levenshtein_distance(s1, s2) # This calls the original function
print(f"Distance between '{s1}' and '{s2}': {distance}")

# To visualize, we'd need to modify the function to return the dp table,
# or trace it mentally/manually as done in the theory section.
# Note: The provided `levenshtein_distance` function only returns the distance.
# To use `print_dp_table`, you'd need to change `levenshtein_distance`
# to return `dp` as well, or modify it to print internally.
# For example, if we modify `levenshtein_distance` to:
# def levenshtein_distance(...):
#     ...
#     return dp # return the whole table

# And then call:
# dp_table_result = levenshtein_distance_with_table(s1, s2)
# print_dp_table(s1, s2, dp_table_result)

For “ab” vs “ac”, the table would look like:

a c
0 1 2
a 1 0 1
b 2 1 1

The distance is 1 (substitute ‘b’ with ‘c’).

Practical Applications in More Detail

Simple Spell Checker

We can use Levenshtein distance to build a rudimentary spell checker. Given a misspelled word, we can compare it against a dictionary of correctly spelled words and suggest the one with the minimum edit distance.

def simple_spell_checker(word: str, dictionary: list[str], max_distance: int = 2) -> list[str]:
    """
    Suggests correctly spelled words from a dictionary based on Levenshtein distance.

    Args:
        word (str): The potentially misspelled word.
        dictionary (list[str]): A list of correctly spelled words.
        max_distance (int): Maximum edit distance to consider for suggestions.

    Returns:
        list[str]: A list of suggested words, sorted by distance.
    """
    suggestions = []
    for dict_word in dictionary:
        distance = levenshtein_distance(word, dict_word)
        if distance <= max_distance:
            suggestions.append((distance, dict_word))
    
    # Sort suggestions by distance (ascending)
    suggestions.sort()
    
    # Return just the words
    return [word for dist, word in suggestions]

# Example Usage:
common_words = ["apple", "apply", "appetite", "banana", "bandana", "aple"]

misspelled_word = "appl"
print(f"\nSuggestions for '{misspelled_word}':")
suggestions = simple_spell_checker(misspelled_word, common_words)
if suggestions:
    for s in suggestions:
        print(f"- {s}")
else:
    print("No close suggestions found.")

misspelled_word = "bannana"
print(f"\nSuggestions for '{misspelled_word}':")
suggestions = simple_spell_checker(misspelled_word, common_words, max_distance=3)
if suggestions:
    for s in suggestions:
        print(f"- {s}")
else:
    print("No close suggestions found.")
Suggestions for 'appl':
- apple
- apply

Suggestions for 'bannana':
- banana
- bandana

“Did you mean…?” for Search

Similar to spell checking, you can use edit distance to suggest alternative search queries.

def search_suggestion(query: str, available_terms: list[str], threshold: int = 2) -> list[str]:
    """
    Suggests similar search terms based on Levenshtein distance.

    Args:
        query (str): The user's search query.
        available_terms (list[str]): A list of valid search terms/keywords.
        threshold (int): Maximum edit distance for a suggestion to be considered.

    Returns:
        list[str]: A list of suggested terms, sorted by distance.
    """
    matches = []
    for term in available_terms:
        distance = levenshtein_distance(query.lower(), term.lower()) # Case-insensitive
        if distance <= threshold:
            matches.append((distance, term))
    
    matches.sort()
    return [term for dist, term in matches]

# Example Usage:
product_catalog = ["Laptop", "Desktop PC", "Monitor", "Keyboard", "Mouse", "Headphones", "Webcam"]

user_query = "laptoop"
print(f"\nFor query '{user_query}', did you mean:")
suggestions = search_suggestion(user_query, product_catalog)
if suggestions:
    for s in suggestions:
        print(f"- {s}")
else:
    print("No suggestions found.")

user_query = "moniter"
print(f"\nFor query '{user_query}', did you mean:")
suggestions = search_suggestion(user_query, product_catalog)
if suggestions:
    for s in suggestions:
        print(f"- {s}")
else:
    print("No suggestions found.")
For query 'laptoop', did you mean:
- Laptop

For query 'moniter', did you mean:
- Monitor

Performance Considerations

The Levenshtein algorithm, as implemented using dynamic programming, has:

  • Time Complexity: O(m * n) Where m and n are the lengths of the two strings. This is because we fill an (m+1) x (n+1) matrix, and each cell takes constant time to compute.

  • Space Complexity: O(m * n) This is due to storing the entire (m+1) x (n+1) DP matrix.

Note: For very long strings (e.g., comparing entire documents), the O(m*n) complexity can become a bottleneck, both in terms of computation time and memory usage.

Space Optimization

It’s possible to reduce the space complexity to O(min(m, n)). This is because to calculate the current row of the DP table, you only need the values from the previous row. You don’t need to store the entire m rows. You can typically optimize this by only keeping two rows (current and previous) or even just one row if you’re careful about update order.

For practical purposes, for string lengths up to a few thousand characters, the O(m*n) space approach is usually fine and easier to implement correctly. For very long sequences (like DNA), more advanced algorithms or specialized libraries that incorporate these space optimizations are typically used.

Variations & Related Concepts

While Levenshtein distance is prominent, it’s part of a larger family of string metrics:

  • Damerau-Levenshtein Distance: This is an extension of Levenshtein that also includes transposition (swapping two adjacent characters) as a single edit operation. For example, “acb” to “abc” would be 1 using Damerau-Levenshtein, but 2 (substitute c for b, then b for c, or delete c, insert b, delete b, insert c) using standard Levenshtein. It’s often more suitable for human typoes.
  • Hamming Distance: Only applicable to strings of equal length. It counts the number of positions at which the corresponding symbols are different. It only allows substitution.
  • Jaro-Winkler Distance: A similarity measure (higher value means more similar, unlike edit distance) designed for short strings like personal names. It’s often used in record linkage and de-duplication.
  • Longest Common Subsequence (LCS): While not a distance metric directly, it’s closely related. LCS finds the longest sequence of characters common to both strings, in the same order, but not necessarily contiguous. Edit distance can be derived from LCS.

Conclusion

Edit distance, particularly the Levenshtein algorithm, is a powerful and fundamental tool in a developer’s toolkit. By understanding its dynamic programming foundation, you can tackle a wide range of problems involving string comparison and similarity. While its O(m*n) complexity means it’s not always the fastest for extremely long strings, for common use cases like spell checking, search suggestions, and data cleaning, it remains an indispensable algorithm. Experiment with the provided code, modify it, and think about how you might apply this concept in your own projects!

Last updated on