pub struct Options<R: Replace> {Show 14 fields
pub length_limit: usize,
pub reserve_extra: usize,
pub extension_cleverness: bool,
pub most_fs_safe: bool,
pub windows_safe: bool,
pub url_safe: bool,
pub normalise_whitespace: bool,
pub trim_spaces_and_full_stops: bool,
pub trim_more_punctuation: bool,
pub remove_control_characters: bool,
pub remove_reordering_characters: bool,
pub replace_with: R,
pub collapse_replacements: bool,
pub six_measures_of_barley: &'static str,
}
Expand description
Sanitisation options. Defaults marked on each field.
Take a look around, but I think everything’s pretty sane by default; the ones I think you’re
most likely to want to change are url_safe
and windows_safe
, though replace_with
,
collapse_replacements
and six_measures_of_barley
can be interesting too for yielding
prettier results.
If you set length_limit
to usize::MAX
, all the bool fields to false
, and
six_measures_of_barley
to an empty string, sanitise
will not alter the input string in any
way. But that would be a rather expensive alternative to .clone()
. In practice, I doubt you
ever want to disable most_fs_safe
, which is a good baseline.
Fields§
§length_limit: usize
Limit the complete file name to this many UTF-8 code units. The default is 255, which is suitable for all practical platforms.
(Some file systems limit lengths in UTF-8 code units and some in UTF-16 code units, but UTF-16 never takes more code units than UTF-8 to encode a given Unicode string, so we can ignore it.)
Reasons you might want to reduce it:
-
You haven’t appended the extension yet, and so want to subtract the extension’s length. (In that case I suggest writing
Options::DEFAULT.length_limit
instead of hard coding 255—that’ll work in const context.) -
You want smoother Windows support, for on Windows some things start falling over if the total path length is greater than 260 characters; so measuring or estimating the path length could potentially be useful—but unless you know, probably don’t worry too much, someone’ll probably drop it deep in a node_modules tree at some point and then you’ll be in trouble anyway. 😀
One other mildly significant note here: if you care about Apple’s pre-2017 HFS+ file
system, you should perform Unicode normalisation to NFD (most likely via the
unicode-normalization
crate) before performing sanitisation, because the decomposed form
may be longer; if you don’t, then the path will be normalised to NFD by the file system
when you try to write it, which could take it over 255 and make it fail. I don’t think
there are any popular file systems that normalise any more, though APFS kinda prefers NFC,
so you might want to normalise to NFC. I do not know if normalising to NFC will ever
lengthen a UTF-8 string, but the spec allows it to (UAX #15, goal 3.2).
The minimum permitted value is 10, for reasons of implementation convenience and because I don’t think there’s any legitimate use case for a smaller value. If you provide a value less than ten, you’ll get an empty string back every time.
Truncations are performed at char
granularity (Unicode scalar value), which means that
extended grapheme clusters could be broken. This could change in the future (it’ll be an
optional dependency on unicode-segmentation
), but for now it was just too much thought.
If I ever implement this, I’ll probably ditch the minimum value of 10 too.
reserve_extra: usize
When allocating the string (since it allocates as small a string as possible), reserve at
least this many extra bytes. This is good for efficiency when you append the extension
after sanitisation (in which case, also disable extension_cleverness
). Default 0.
extension_cleverness: bool
Make other options try to be clever about a file extension in the input. Default true
.
Specifically, if a file extension is detected (done by looking for the last full stop in the name, and splitting at that point into base name and extension):
-
length_limit
will try to keep the extension intact, truncating the base name rather than the extension. “Try”, because if the extension is longer than six code units less than the length limit, it will be deemed unsalvageable. (Why six? The base name must retain at least one character, so for convenience that’s four UTF-8 code units, plus one more for the dot, and ifwindows_safe
is on, the longest reserved name causes a five code unit base name likeLPT1_
, and ridiculously long extensions are a corner case anyway so I decided to just call it a day at six. If I subsequently implement grapheme-cluster-aware truncation, this six will increase if the first grapheme cluster in the base name is more than five code units long.) An unsalvageable extension is the only case where sanitisation may take two steps to quiesce, rather than one: if the extension is entirely truncated and the base name contains a dot which in a subsequent run will be interpreted as the extension separator, trimming will happen around it on that subsequent run but not the first. -
windows_safe
will detect reserved names with extensions. -
trim_spaces_and_full_stops
andtrim_more_punctuation
will trim those characters from the end of the base name and the start of the extension, in addition to the start and end of the full name. (Expressed otherwise, the base name and extension will be trimmed independently.)
If you’re appending the extension after sanitisation, you should turn this to false.
most_fs_safe: bool
Remove characters that are not safe on just about any file system. Default true
, and if
you actually want to disable it you’re probably using the wrong crate.
This plus length_limit
is enough to satisfy most platforms other than Windows, though
cleaning somewhat more is probably a good idea.
Characters removed:
/
(slash)- ␀ (null, character zero)
Also disallows names comprising exclusively dots ("."
, ".."
, "..."
, &c.), NOT using
replace_with
on them but yielding an empty string.
This is a tiny subset of windows_safe
.
windows_safe: bool
Ensure the file name is safe on Windows. Default true
.
-
These characters are removed (and
replace_with
employed):<
(less than)>
(greater than):
(colon)"
(double quote)/
(forward slash)\
(backslash)|
(vertical bar/pipe)?
(question mark)*
(asterisk)- The C0 control characters, 0–31 and 127 (U+0000–U+001F, U+007F); note that U+007F isn’t actually part of C0, but Microsoft included it in this list so I do too.
-
Names must not end with a space or a dot (so these are removed recursively—for reasons of technical convenience,
replace_with
is NOT employed). -
These names are reserved (and so a trailing underscore is added to the base name), including with an extension if
extension_cleverness
is enabled:- CON, PRN, AUX, NUL,
- COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9,
- LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9
Most of these restrictions are actually not quite universal in Windows, but getting around
them requires switching into POSIX mode or using long UNC paths (e.g. \\.\C:\CON
,
\\?\D:\aux.h
), and your life will certainly be miserable if you try using them; so
they’re all considered not Windows-safe.
url_safe: bool
Remove characters that may be problematic in the usual places in URLs. Default false
.
If you want something URL-safe, consider slugifying instead (see below).
This removes any character that is not what’s called a URL code point, also removes the
characters &
, /
and ?
, and forbids the names .
and ..
which have a special
meaning in paths. The result is either an empty string, or suitable for use as a path
component, query string value or fragment, without generally needing percent-encoding:
such a URL will be correctly parsed by a WHATWG URL Standard parser, though nominally
invalid¹, but older or poorer-quality URL parsers may need percent-encoding to cope with
the non-ASCII that is retained.
Some notable characters that are removed: /
, \
, %
, ?
, #
, &
, "
, and space.
Almost all non-ASCII is retained.
Notes on using these URLs in some common formats:
-
In HTML, no escaping is needed in
<a href="http://www.example/fïle_ñamê">
, because&
and"
are the only two characters needing escaping in a double-quoted attribute value, and both are removed byurl_safe
. -
In plain text formats following the longstanding convention of angle bracket delimition (
<http://www.example/lïke_τhis>
), no escaping should be required as>
is removed byurl_safe
. This includes Markdown. However, some such parsers could be stricter about what’s allowed inside the angle brackets, so you may need or want to use a URL Standard serialiser to do percent-encoding of the non-ASCII. -
In Markdown
[text](href)
links, you’ll want to manually percent-encode(
to%28
and)
to%29
. This is yet another bad choice in Markdown’s technical foundation: parentheses aren’t percent-encoded, never have been; so using a URL Standard serialiser won’t help you, you’ll instead need to manually encode them, or unpaired parentheses will break the link and possibly eat your laundry².
Given that this produces nominally-invalid URLs, you may be wondering why to bother at all;
it really comes down to characters like ?
, /
and #
: you can include them in paths
by percent-encoding, but it’s too likely that somewhere along the way, something will
mangle your path, not encoding it properly, and everything will break—basically the entire
system has to process the URL correctly; ever tried a path component containing %2F
?
But if you’ve removed the genuinely problematic characters, then in theory things can no
longer go wrong once you’re past the parser. And being able to skip percent-encoding your
URLs when you know you’ll be using a proper URL parser is nice.
I deliberately haven’t provided an option for removing characters that would make a URL
nominally invalid (which is “non-ASCII”), because I think that goes too far: in such a
case, I don’t think you should strip such characters, but rather slugify the whole thing
(which can do things like Voilà!
→ voila
).
replace_with
is used for the character removals, but NOT for the forbidding of the names
"."
and ".."
, for which it will instead yield an empty string.
—⁂—
¹ “Invalid” is just a label in WHATWG specs; it doesn’t change anything, and parsing is still well-defined, it’s generally just a hint that either you may have made a mistake, or that older tools might not handle this case the same way.
² When Americans say “eat your laundry” they mean the clothes. An Australian seeking to express that concept would say “eat your washing” (and probably be looked at strangely because it’s not an expression in common use). The laundry is the room in which clothes are washed; so when I say injection attacks might eat your laundry——
normalise_whitespace: bool
Replace all sequences of whitespace with one space. Default true
.
This uses the Unicode White_Space
property to decide (char::is_whitespace
).
This is done in two phases:
-
Before safety character replacements, each whitespace character is normalised to a U+0020 SPACE;
replace_with
is not invoked. -
After all character replacements, adjacent spaces (including any produced by
replace_with
, independent ofcollapse_replacements
) are collapsed to just one.
trim_spaces_and_full_stops: bool
Remove spaces and full stops (.
) from the start and end of the name. Default true
.
normalise_whitespace
is performed before this; with it on, this will trim all whitespace,
with it off it’ll only trim U+0020 SPACE.
All things that invoke replace_with
are performed before this; thus, if you replace a
character with a space or full stop, that could get trimmed. replace_with
is not invoked
on any characters removed by this.
If extension_cleverness
is enabled (which it is by default), on names with an extension
this trims from the start and end of the base name and extension independently, rather than
just the start and end of the full string. That is, " foo . bar . baz "
will become
"foo . bar.baz"
with extension_cleverness
, and "foo . bar . baz"
without.
This is independent of windows_safe
, which also trims trailing spaces and dots from the
complete name.
trim_more_punctuation: bool
Remove a few more punctuationy characters from the start and end of the name.
Default true
.
This is a more aggressive supplement to trim_spaces_and_full_stops
, trimming from the
same places in the same way. These characters are removed:
_
(underscore; especially significant becausereplace_with
defaults to an underscore)-
(hyphen/dash/minus),
(comma);
(semicolon)
remove_control_characters: bool
Remove control characters. Default true
.
This removes all characters with the general category Control: C0 controls U+0000–U+001F, control character U+007F, and C1 controls U+0080–U+009F.
replace_with
is invoked on these removals.
remove_reordering_characters: bool
Remove BiDi control characters that are relevant to reordering attacks. Default true
.
https://trojansource.codes/trojan-source.pdf is a paper with info about the attack.
This removes U+202A–U+202E and U+2066–U+2069. It does NOT remove the remaining three Bidi_Control characters U+061C, U+200E and U+200F (ALM, LRM, RLM), which are not implicated in the attack and are conceivably useful in file names.
replace_with
is invoked on these removals.
replace_with: R
Where characters are removed (except as marked), replace them with this.
Default Some('_')
.
If you provide a character that would normally be removed, it will not be removed: that processing is done once only.
If you provide a character that would be trimmed, it may or may not be trimmed: end matches will be trimmed, start matches only will be if ridiculously long names and/or extensions force unusual truncation, exposing the start of the string (so that it gets trimmed to nothing).
collapse_replacements: bool
Where multiple adjacent characters are to be replaced, only replace the first, and remove
any subsequent ones. Default false
.
See also normalise_whitespace
, which can collapse replacements if you replace with
whitespace.
six_measures_of_barley: &'static str
If sanitisation would leave the path empty, return this string instead. Default "_"
.
This exists because I found myself writing if name.is_empty() { name.push('_') }
after
every time I called sanitise
. I think most of the time you don’t want to be left with an
empty string, and inserting something is tolerable, so this is on by default as something
fairly neutral that aligns with the replace_with
default as well. You can effectively
disable this by setting this to an empty string.
length_limit
is not taken into account on this. If you put something ridiculously long in
it, you brought it on yourself and I wash my hands of it, as Pontius Pilate of old.
(Read Ruth 3:15–17 from the Bible to understand the name of this option.)
Implementations§
Source§impl<R: Replace> Options<R>
impl<R: Replace> Options<R>
Sourcepub fn with_replace_with<R2: Replace>(self, new_replace_with: R2) -> Options<R2>
pub fn with_replace_with<R2: Replace>(self, new_replace_with: R2) -> Options<R2>
A workaround for an otherwise-messy type situation with filling in defaults.
This solves the problem that you can’t write this:
Options { replace_with: |c| /* … */, ..Options::DEFAULT }
… because struct update syntax doesn’t currently allow you to change types, and
Options::DEFAULT
is an Options<Option<char>>
, but with a closure for replace_with
you’re needing to change it to Options<[closure@…]>
. So instead, write like one of these:
Options::DEFAULT.with_replace_with(|c| /* … */)
Options { /* … */, ..Options::DEFAULT }.with_replace_with(|c| /* … */)
If you’re using nightly rustc, you can try the incomplete type-changing-struct-update feature instead, which lets the first code work (so long as this unstable and incomplete feature is working):
#![feature(type_changing_struct_update)]
use sanitise_file_name::Options;
fn main() {
Options { replace_with: |c| /* … */, ..Options::DEFAULT }
}