Crate unicode_width
source ·Expand description
Determine displayed width of char
and str
types according to
Unicode Standard Annex #11
and other portions of the Unicode standard.
See the Rules for determining width section
for the exact rules.
This crate is #![no_std]
.
use unicode_width::UnicodeWidthStr;
let teststr = "Hello, world!";
let width = UnicodeWidthStr::width(teststr);
println!("{}", teststr);
println!("The above string is {} columns wide.", width);
§"cjk"
feature flag
This crate has one Cargo feature flag, "cjk"
(enabled by default).
It enables the UnicodeWidthChar::width_cjk
and UnicodeWidthStr::width_cjk
,
which perform an alternate width calculation
more suited to CJK contexts. The flag also unseals the
UnicodeWidthChar
and UnicodeWidthStr
traits.
Disabling the flag (with no_default_features
in Cargo.toml
)
will reduce the amount of static data needed by the crate.
use unicode_width::UnicodeWidthStr;
let teststr = "“𘀀”";
assert_eq!(teststr.width(), 4);
#[cfg(feature = "cjk")]
assert_eq!(teststr.width_cjk(), 6);
§Rules for determining width
This crate currently uses the following rules to determine the width of a character or string, in order of decreasing precedence. These may be tweaked in the future.
- In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
- The sequence
"\r\n"
has width 1. - Emoji-specific ligatures:
- Well-formed, fully-qualified emoji ZWJ sequences have width 2.
- Emoji modifier sequences have width 2.
- Emoji presentation sequences have width 2.
- Outside of an East Asian context, text presentation sequences have width 1 if their base character:
- Has the
Emoji_Presentation
property, and - Is not in the Enclosed Ideographic Supplement block.
- Has the
- Script-specific ligatures:
- For all the following ligatures, the insertion of any number of default-ignorable
combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic
ligatures, the insertion of any number of
'\u{200D}'
ZERO WIDTH JOINERs will not affect the width. - Arabic: A character sequence consisting of one character with
Joining_Group
=Lam
, followed by any number of characters withJoining_Type
=Transparent
, followed by one character withJoining_Group
=Alef
, has total width 1. For example:لا
,لآ
,ڸا
,لٟٞأ
- Buginese:
"\u{1A15}\u{1A17}\u{200D}\u{1A10}"
(<a, -i> ya,ᨕᨗᨐ
) has total width 1. - Hebrew:
"א\u{200D}ל"
(Alef-Lamed,אל
) has total width 1. - Khmer: Coeng signs consisting of
'\u{17D2}'
followed by a character in'\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}'
have width 0. - Lisu: Tone letter combinations consisting of a character in the range
'\u{A4F8}'..='\u{A4FB}'
followed by a character in the range'\u{A4FC}'..='\u{A4FD}'
have width 1. For example:ꓹꓼ
- Old Turkic:
"\u{10C32}\u{200D}\u{10C03}"
(𐰲𐰃
) has total width 1. - Tifinagh: A sequence of a Tifinagh consonant in the range
'\u{2D31}'..='\u{2D65}' | '\u{2D6F}'
, followed by either'\u{2D7F}'
TIFINAGH CONSONANT JOINER or'\u{200D}'
, followed by another Tifinangh consonant, has total width 1. For example:ⵏ⵿ⴾ
- For all the following ligatures, the insertion of any number of default-ignorable
combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic
ligatures, the insertion of any number of
- In an East Asian context only,
<
,=
, or>
have width 2 when followed by'\u{0338}'
COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:- Has
Canonical_Combining_Class
greater than 1, or - Is a default-ignorable combining mark.
- Has
- The sequence
- In all other cases, the width of the string equals the sum of its character widths:
'\u{2D7F}'
TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).'\u{115F}'
HANGUL CHOSEONG FILLER and'\u{17A4}'
KHMER INDEPENDENT VOWEL QAA have width 2.'\u{17D8}'
KHMER SIGN BEYYAL has width 3.- The following have width 0:
- Characters
with the
Default_Ignorable_Code_Point
property. - Characters
with the
Grapheme_Extend
property. - The following 8 characters, all of which have NFD decompositions consisting of two
Grapheme_Extend
characters:'\u{0CC0}'
KANNADA VOWEL SIGN II,'\u{0CC7}'
KANNADA VOWEL SIGN EE,'\u{0CC8}'
KANNADA VOWEL SIGN AI,'\u{0CCA}'
KANNADA VOWEL SIGN O,'\u{0CCB}'
KANNADA VOWEL SIGN OO,'\u{1B3B}'
BALINESE VOWEL SIGN RA REPA TEDUNG,'\u{1B3D}'
BALINESE VOWEL SIGN LA LENGA TEDUNG, and'\u{1B43}'
BALINESE VOWEL SIGN PEPET TEDUNG.
- Characters
with a
Hangul_Syllable_Type
ofVowel_Jamo
(V
) orTrailing_Jamo
(T
). - The following
Prepended_Concatenation_Mark
s: - Characters
with the
Grapheme_Extend=Prepend
property, that are not alsoPrepended_Concatenation_Mark
s. '\u{A8FA}'
DEVANAGARI CARET.
- Characters
with the
- Characters
with an
East_Asian_Width
ofFullwidth
orWide
have width 2. - Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
- Has an
East_Asian_Width
ofAmbiguous
, or has a canonical decomposition to anAmbiguous
character followed by'\u{0338}'
COMBINING LONG SOLIDUS OVERLAY, or is'\u{0387}'
GREEK ANO TELEIA, and - Does not have a
General_Category
ofLetter
orModifier_Symbol
.
- Has an
- All other characters have width 1.
§Canonical equivalence
Canonically equivalent strings are assigned the same width (CJK and non-CJK).
Constants§
- The version of Unicode that this version of unicode-width is based on.
Traits§
- Methods for determining displayed width of Unicode characters.
- Methods for determining displayed width of Unicode strings.