this post was submitted on 23 Jun 2025
10 points (100.0% liked)

PHP

704 readers
1 users here now

Welcome to /c/php! This is a community for PHP developers and enthusiasts to share and discuss anything related to PHP. From the latest updates and tutorials, to your burning questions and amazing personal projects, we welcome all contributions.

Let's foster an environment of respect, learning, and mutual growth. Whether you're an experienced PHP developer, a beginner, or just interested in learning more about PHP, we're glad to have you here!

Let's code, learn, and grow together!

founded 2 years ago
MODERATORS
 

cross-posted from: https://chrastecky.dev/post/15

PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.

If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

For example, the following code returns 2 instead of the correct result, 1:

var_dump(levenshtein('göthe', 'gothe'));

There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.

With the new grapheme_levenshtein() function in PHP 8.5, the code above now correctly returns 1.

Grapheme-Based Comparison

What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9) or as a combination of the letter e (U+0065) and a combining accent (U+0301). In PHP, you can write these as:

$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";

Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein() function correctly recognizes this and returns 0 — meaning no difference.

This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.

Just for fun: what do you think the original levenshtein() function will return for the example above?

var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));
no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here