Expand description
This crate contains an assortment of utilities to deal with paths and their conversions.
Generally git
treats paths as bytes, but inherently assumes non-illformed UTF-8 as encoding on windows. Internally, it expects
slashes to be used as path separators and paths in files must have slashes, with conversions being performed on windows accordingly.
§Research
- windows
dirent.c
contains all implementation (seemingly) of opening directories and reading their entries, along with all path conversions (UTF-16 for windows). This is done on the fly so git can work with in UTF-8.- mingw is used for the conversion and it appears they handle surrogates during the conversion, maybe some sort of non-strict UTF-8 converter? Actually it uses WideCharToMultiByte under the hood which by now does fail if the UTF-8 would be invalid unicode, i.e. unicode pairs.
OsString
on windows already stores strings as WTF-8, which supports surrogate pairs, something that UTF-8 isn’t allowed do it for security reasons, after all it’s UTF-16 specific and exists only to extend the encodable code-points.- informative reading on WTF-8 which is the encoding used by Rust internally that deals with surrogates and non-wellformed surrogates (those that aren’t in pairs).
- unix
- It uses opendir and readdir respectively. There is no encoding specified, except that these paths are null-terminated.
§Learnings
Surrogate pairs are a way to extend the encodable value range in UTF-16 encodings, used primarily on windows and in Javascript. For a long time these codepoints used for surrogates, always to be used in pairs, were not assigned, until…they were for rare emojies and the likes. The unicode standard does not require surrogates to happen in pairs, even though by now unpaired surrogates in UTF-16 are considered ill-formed, which aren’t supposed to be converted to UTF-8 for example.
This is the reason we have to deal with to_string_lossy()
, it’s just for that quirk.
This also means the only platform ever eligible to see conversion errors is windows, and there it’s only older pre-vista windows versions which incorrectly allow ill-formed UTF-16 strings. Newer versions don’t perform such conversions anymore, for example when going from UTF-16 to UTF-8, they will trigger an error.
§Conclusions
Since WideCharToMultiByte by now is
fixed (Vista onward) to produce valid UTF-8, lone surrogate codepoints will cause failure, which git
doesn’t care about.
We will, though, which means from now on we can just convert to UTF-8 on windows and bubble up errors where necessary, preventing potential mismatched surrogate pairs to ever be saved on disk by gitoxide.
Even though the error only exists on older windows versions, we will represent it in the type system through fallible function calls.
Callers may .expect()
on the result to indicate they don’t wish to handle this special and rare case. Note that servers should not
ever get into a code-path which does panic though.
Modules§
Structs§
- Utf8
Error - The error type returned by
into_bstr()
and others may suffer from failed conversions from or to bytes.
Functions§
- from_
bstr - Similar to
try_from_bstr()
, but panics if malformed surrogates are encountered on windows. - from_
bstring - Similar to
try_from_bstring()
, but will panic if there is ill-formed UTF-8 in theinput
. - from_
byte_ slice - Similar to
try_from_byte_slice()
, but will panic if there is ill-formed UTF-8 in theinput
. - into_
bstr - Similar to
try_into_bstr()
but panics if malformed surrogates are encountered on windows. - is_
absolute - return true if
path
is absolute, which depends on the platform but is always true if it starts with aslash
, hence looks like a linux path. - join_
bstr_ unix_ pathsep - Join
path
tobase
such that they are separated with a/
, i.e.base/path
. - normalize
- Resolve relative components virtually without accessing the file system, e.g. turn
a/./b/c/.././..
intoa
, without keeping intermediate..
and/a/../b/..
becomes/
. If the input path was relative and ends up being thecurrent_dir
,.
is returned instead of the full path tocurrent_dir
. Note that single.
components as well as duplicate separators are left untouched. - os_
str_ into_ bstr - Like
into_bstr()
, but takesOsStr
as input for a lossless, but fallible, conversion. - os_
string_ into_ bstring - Like
into_bstr()
, but takesOsString
as input for a lossless, but fallible, conversion. - realpath
- Check each component of
path
and see if it is a symlink. If so, resolve it. Do not fail for non-existing components, but assume these are as is. - realpath_
opts - The same as
realpath()
, but allow to configuremax_symlinks
to configure how many symbolic links we are going to follow. This serves to avoid running into cycles or doing unreasonable amounts of work. - relativize_
with_ prefix - Rebuild the worktree-relative
relative_path
to be relative toprefix
, which is the worktree-relative path equivalent to the position of the user, or current working directory. This is a no-op ifprefix
is empty. - to_
native_ path_ on_ windows - Convert paths with slashes to backslashes on windows and do nothing on unix, but panics if malformed surrogates are encountered on windows.
- to_
native_ separators - Assures the given bytes use the native path separator.
- to_
unix_ separators - Replaces windows path separators with slashes, unconditionally.
- to_
unix_ separators_ on_ windows - Replaces windows path separators with slashes, but only do so on windows.
- to_
windows_ separators - Find backslashes and replace them with slashes, which typically resembles a unix path, unconditionally.
- try_
from_ bstr - Similar to
from_byte_slice()
, but takes either borrowed or ownedinput
. - try_
from_ bstring - Similar to
try_from_bstr()
, but takes and produces owned data. - try_
from_ byte_ slice - Given
input
bytes, produce aPath
from them ignoring encoding entirely if on unix. - try_
into_ bstr - Convert the given path either into its raw bytes on unix or its UTF8 encoded counterpart on windows.
- try_
os_ str_ into_ bstr - Like
into_bstr()
, but takesCow<OsStr>
as input for a lossless, but fallible, conversion.