Crate unicode_width

source ·
Expand description

Determine displayed width of char and str types according to Unicode Standard Annex #11 and other portions of the Unicode standard. See the Rules for determining width section for the exact rules.

This crate is #![no_std].

use unicode_width::UnicodeWidthStr;

let teststr = "Hello, world!";
let width = UnicodeWidthStr::width(teststr);
println!("{}", teststr);
println!("The above string is {} columns wide.", width);

§"cjk" feature flag

This crate has one Cargo feature flag, "cjk" (enabled by default). It enables the UnicodeWidthChar::width_cjk and UnicodeWidthStr::width_cjk, which perform an alternate width calculation more suited to CJK contexts. The flag also unseals the UnicodeWidthChar and UnicodeWidthStr traits.

Disabling the flag (with no_default_features in Cargo.toml) will reduce the amount of static data needed by the crate.

use unicode_width::UnicodeWidthStr;

let teststr = "“𘀀”";
assert_eq!(teststr.width(), 4);

#[cfg(feature = "cjk")]
assert_eq!(teststr.width_cjk(), 6);

§Rules for determining width

This crate currently uses the following rules to determine the width of a character or string, in order of decreasing precedence. These may be tweaked in the future.

  1. In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
    • The sequence "\r\n" has width 1.
    • Emoji-specific ligatures:
    • Script-specific ligatures:
      • For all the following ligatures, the insertion of any number of default-ignorable combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic ligatures, the insertion of any number of '\u{200D}' ZERO WIDTH JOINERs will not affect the width.
      • Arabic: A character sequence consisting of one character with Joining_Group=Lam, followed by any number of characters with Joining_Type=Transparent, followed by one character with Joining_Group=Alef, has total width 1. For example: لا‎, لآ‎, ڸا‎, لٟٞأ
      • Buginese: "\u{1A15}\u{1A17}\u{200D}\u{1A10}" (<a, -i> ya, ᨕᨗ‍ᨐ) has total width 1.
      • Hebrew: "א\u{200D}ל" (Alef-Lamed, א‍ל) has total width 1.
      • Khmer: Coeng signs consisting of '\u{17D2}' followed by a character in '\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}' have width 0.
      • Lisu: Tone letter combinations consisting of a character in the range '\u{A4F8}'..='\u{A4FB}' followed by a character in the range '\u{A4FC}'..='\u{A4FD}' have width 1. For example: ꓹꓼ
      • Old Turkic: "\u{10C32}\u{200D}\u{10C03}" (𐰲‍𐰃) has total width 1.
      • Tifinagh: A sequence of a Tifinagh consonant in the range '\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either '\u{2D7F}' TIFINAGH CONSONANT JOINER or '\u{200D}', followed by another Tifinangh consonant, has total width 1. For example: ⵏ⵿ⴾ
    • In an East Asian context only, <, =, or > have width 2 when followed by '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:
  2. In all other cases, the width of the string equals the sum of its character widths:
    1. '\u{2D7F}' TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).
    2. '\u{115F}' HANGUL CHOSEONG FILLER and '\u{17A4}' KHMER INDEPENDENT VOWEL QAA have width 2.
    3. '\u{17D8}' KHMER SIGN BEYYAL has width 3.
    4. The following have width 0:
    5. Characters with an East_Asian_Width of Fullwidth or Wide have width 2.
    6. Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
    7. All other characters have width 1.

§Canonical equivalence

Canonically equivalent strings are assigned the same width (CJK and non-CJK).

Constants§

Traits§