Module unicode_segmentation
Splitting strings on grapheme cluster, word, and sentence boundaries.
unicode-segmentation provides iterators for splitting text according to Unicode Standard Annex #29 rules.
It handles the complexities of measuring Unicode text for display,
where a single user-perceived character may be composed of multiple codepoints.
The crate's primary trait is UnicodeSegmentation,
which provides methods for segmenting strings by grapheme clusters, words, and sentences.
Grapheme clusters represent what users think of as single characters,
which is essential for correctly counting characters,
truncating strings, or implementing text editors.
Note that while Unicode segmentation is a crucial algorithm, it is rarely the right tool for most software โ it is mostly used by GUI toolkits for laying out text, or by software that needs to understand the human concepts of "words" and "sentences".
For modern background on Unicode units see Let's Stop Ascribing Meaning to Code Points by Manish Goregaokar.
Examples
Count user-perceived characters correctly:
use UnicodeSegmentation;
Split text into words:
use UnicodeSegmentation;
Traits
- UnicodeSegmentation Methods for segmenting strings according to Unicode Standard Annex #29.