Module unicode_segmentation

Splitting strings on grapheme cluster, word, and sentence boundaries.


unicode-segmentation provides iterators for splitting text according to Unicode Standard Annex #29 rules. It handles the complexities of measuring Unicode text for display, where a single user-perceived character may be composed of multiple codepoints.

The crate's primary trait is UnicodeSegmentation, which provides methods for segmenting strings by grapheme clusters, words, and sentences. Grapheme clusters represent what users think of as single characters, which is essential for correctly counting characters, truncating strings, or implementing text editors.

Note that while Unicode segmentation is a crucial algorithm, it is rarely the right tool for most software โ€” it is mostly used by GUI toolkits for laying out text, or by software that needs to understand the human concepts of "words" and "sentences".

For modern background on Unicode units see Let's Stop Ascribing Meaning to Code Points by Manish Goregaokar.

Examples

Count user-perceived characters correctly:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let text = "Hello ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ!";

    // Wrong: counting bytes
    assert_eq!(text.len(), 32);

    // Wrong: counting codepoints
    assert_eq!(text.chars().count(), 14);

    // Correct: counting grapheme clusters
    assert_eq!(text.graphemes(true).count(), 8);
}

Split text into words:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let text = "Hello, world! How are you?";

    let words: Vec<&str> = text.unicode_words().collect();
    assert_eq!(words, vec!["Hello", "world", "How", "are", "you"]);
}

Traits