Crate icu_segmenter
source ·Expand description
Segment strings by lines, graphemes, words, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
- Line segmenter that is compatible with Unicode Standard Annex #14, Unicode Line
Breaking Algorithm, with options to tailor line-breaking behavior for CSS
line-break
andword-break
properties. - Grapheme cluster segmenter, word segmenter, and sentence segmenter that are compatible with Unicode Standard Annex #29, Unicode Text Segmentation.
§Examples
§Line Break
Find line break opportunities:
use icu::segmenter::LineSegmenter;
let segmenter = LineSegmenter::new_auto();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);
See LineSegmenter
for more examples.
§Grapheme Cluster Break
Find all grapheme cluster boundaries:
use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(
&breakpoints,
&[
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
]
);
See GraphemeClusterSegmenter
for more examples.
§Word Break
Find all word boundaries:
use icu::segmenter::WordSegmenter;
let segmenter = WordSegmenter::new_auto();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(
&breakpoints,
&[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
);
See WordSegmenter
for more examples.
§Sentence Break
Segment the string into sentences:
use icu::segmenter::SentenceSegmenter;
let segmenter = SentenceSegmenter::new();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(&breakpoints, &[0, 13, 36]);
See SentenceSegmenter
for more examples.
Re-exports§
pub use SegmenterError as Error;
Modules§
- 🚧 [Unstable] Data provider struct definitions for this ICU4X component.
Structs§
- Implements the
Iterator
trait over the grapheme cluster boundaries of the given string. - Segments a string into grapheme clusters.
- Implements the
Iterator
trait over the line break opportunities of the given string. - Options to tailor line-breaking behavior.
- Supports loading line break data, and creating line break iterators for different string encodings.
- Implements the
Iterator
trait over the sentence boundaries of the given string. - Supports loading sentence break data, and creating sentence break iterators for different string encodings.
- Implements the
Iterator
trait over the word boundaries of the given string. - Supports loading word break data, and creating word break iterators for different string encodings.
Enums§
- An enum specifies the strictness of line-breaking rules. It can be passed as an argument when creating a line segmenter.
- An enum specifies the line break opportunities between letters. It can be passed as an argument when creating a line segmenter.
- A list of error outcomes for various operations in this module.
- The word type tag that is returned by
WordBreakIterator::word_type()
.
Type Aliases§
- Grapheme cluster break iterator for a Latin-1 (8-bit) string.
- Grapheme cluster break iterator for a potentially invalid UTF-8 string.
- Grapheme cluster break iterator for an
str
(a UTF-8 string). - Grapheme cluster break iterator for a UTF-16 string.
- Line break iterator for a Latin-1 (8-bit) string.
- Line break iterator for a potentially invalid UTF-8 string.
- Line break iterator for an
str
(a UTF-8 string). - Line break iterator for a UTF-16 string.
- Sentence break iterator for a Latin-1 (8-bit) string.
- Sentence break iterator for a potentially invalid UTF-8 string.
- Sentence break iterator for an
str
(a UTF-8 string). - Sentence break iterator for a UTF-16 string.
- Word break iterator for a Latin-1 (8-bit) string.
- Word break iterator for a potentially invalid UTF-8 string.
- Word break iterator for an
str
(a UTF-8 string). - Word break iterator for a UTF-16 string.