Crate unicode_width

Expand description

Determine displayed width of char and str types according to Unicode Standard Annex #11 and other portions of the Unicode standard. See the Rules for determining width section for the exact rules.

This crate is #![no_std].

use unicode_width::UnicodeWidthStr;

let teststr = "Ｈｅｌｌｏ, ｗｏｒｌｄ!";
let width = UnicodeWidthStr::width(teststr);
println!("{}", teststr);
println!("The above string is {} columns wide.", width);

§`"cjk"` feature flag

This crate has one Cargo feature flag, "cjk" (enabled by default). It enables the UnicodeWidthChar::width_cjk and UnicodeWidthStr::width_cjk, which perform an alternate width calculation more suited to CJK contexts. The flag also unseals the UnicodeWidthChar and UnicodeWidthStr traits.

Disabling the flag (with no_default_features in Cargo.toml) will reduce the amount of static data needed by the crate.

use unicode_width::UnicodeWidthStr;

let teststr = "“𘀀”";
assert_eq!(teststr.width(), 4);

#[cfg(feature = "cjk")]
assert_eq!(teststr.width_cjk(), 6);

§Rules for determining width

This crate currently uses the following rules to determine the width of a character or string, in order of decreasing precedence. These may be tweaked in the future.

In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
- The sequence "\r\n" has width 1.
- Emoji-specific ligatures:
  - Well-formed, fully-qualified emoji ZWJ sequences have width 2.
  - Emoji modifier sequences have width 2.
  - Emoji presentation sequences have width 2.
  - Outside of an East Asian context, text presentation sequences have width 1 if their base character:
    - Has the Emoji_Presentation property, and
    - Is not in the Enclosed Ideographic Supplement block.
- Script-specific ligatures:
  - For all the following ligatures, the insertion of any number of default-ignorable combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic ligatures, the insertion of any number of '\u{200D}' ZERO WIDTH JOINERs will not affect the width.
  - Arabic: A character sequence consisting of one character with Joining_Group=Lam, followed by any number of characters with Joining_Type=Transparent, followed by one character with Joining_Group=Alef, has total width 1. For example: لا‎, لآ‎, ڸا‎, لٟٞأ
  - Buginese: "\u{1A15}\u{1A17}\u{200D}\u{1A10}" (<a, -i> ya, ᨕᨗ‍ᨐ) has total width 1.
  - Hebrew: "א\u{200D}ל" (Alef-Lamed, א‍ל) has total width 1.
  - Khmer: Coeng signs consisting of '\u{17D2}' followed by a character in '\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}' have width 0.
  - Lisu: Tone letter combinations consisting of a character in the range '\u{A4F8}'..='\u{A4FB}' followed by a character in the range '\u{A4FC}'..='\u{A4FD}' have width 1. For example: ꓹꓼ
  - Old Turkic: "\u{10C32}\u{200D}\u{10C03}" (𐰲‍𐰃) has total width 1.
  - Tifinagh: A sequence of a Tifinagh consonant in the range '\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either '\u{2D7F}' TIFINAGH CONSONANT JOINER or '\u{200D}', followed by another Tifinangh consonant, has total width 1. For example: ⵏ⵿ⴾ
- In an East Asian context only, <, =, or > have width 2 when followed by '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:
  - Has Canonical_Combining_Class greater than 1, or
  - Is a default-ignorable combining mark.
In all other cases, the width of the string equals the sum of its character widths:
1. '\u{2D7F}' TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).
2. '\u{115F}' HANGUL CHOSEONG FILLER and '\u{17A4}' KHMER INDEPENDENT VOWEL QAA have width 2.
3. '\u{17D8}' KHMER SIGN BEYYAL has width 3.
4. The following have width 0:
  - Characters with the Default_Ignorable_Code_Point property.
  - Characters with the Grapheme_Extend property.
  - The following 8 characters, all of which have NFD decompositions consisting of two Grapheme_Extend characters:
  - Characters with a Hangul_Syllable_Type of Vowel_Jamo (V) or Trailing_Jamo (T).
  - The following Prepended_Concatenation_Marks:
  - Characters with the Grapheme_Extend=Prepend property, that are not also Prepended_Concatenation_Marks.
  - '\u{A8FA}' DEVANAGARI CARET.
5. Characters with an East_Asian_Width of Fullwidth or Wide have width 2.
6. Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
  - Has an East_Asian_Width of Ambiguous, or has a canonical decomposition to an Ambiguous character followed by '\u{0338}' COMBINING LONG SOLIDUS OVERLAY, or is '\u{0387}' GREEK ANO TELEIA, and
  - Does not have a General_Category of Letter or Modifier_Symbol.
7. All other characters have width 1.

§Canonical equivalence

Canonically equivalent strings are assigned the same width (CJK and non-CJK).

Constants§

UNICODE_VERSION: The version of Unicode that this version of unicode-width is based on.

Traits§

UnicodeWidthChar: Methods for determining displayed width of Unicode characters.
UnicodeWidthStr: Methods for determining displayed width of Unicode strings.