sanitise_file_name

Struct Options

Source
pub struct Options<R: Replace> {
Show 14 fields pub length_limit: usize, pub reserve_extra: usize, pub extension_cleverness: bool, pub most_fs_safe: bool, pub windows_safe: bool, pub url_safe: bool, pub normalise_whitespace: bool, pub trim_spaces_and_full_stops: bool, pub trim_more_punctuation: bool, pub remove_control_characters: bool, pub remove_reordering_characters: bool, pub replace_with: R, pub collapse_replacements: bool, pub six_measures_of_barley: &'static str,
}
Expand description

Sanitisation options. Defaults marked on each field.

Take a look around, but I think everything’s pretty sane by default; the ones I think you’re most likely to want to change are url_safe and windows_safe, though replace_with, collapse_replacements and six_measures_of_barley can be interesting too for yielding prettier results.

If you set length_limit to usize::MAX, all the bool fields to false, and six_measures_of_barley to an empty string, sanitise will not alter the input string in any way. But that would be a rather expensive alternative to .clone(). In practice, I doubt you ever want to disable most_fs_safe, which is a good baseline.

Fields§

§length_limit: usize

Limit the complete file name to this many UTF-8 code units. The default is 255, which is suitable for all practical platforms.

(Some file systems limit lengths in UTF-8 code units and some in UTF-16 code units, but UTF-16 never takes more code units than UTF-8 to encode a given Unicode string, so we can ignore it.)

Reasons you might want to reduce it:

  1. You haven’t appended the extension yet, and so want to subtract the extension’s length. (In that case I suggest writing Options::DEFAULT.length_limit instead of hard coding 255—that’ll work in const context.)

  2. You want smoother Windows support, for on Windows some things start falling over if the total path length is greater than 260 characters; so measuring or estimating the path length could potentially be useful—but unless you know, probably don’t worry too much, someone’ll probably drop it deep in a node_modules tree at some point and then you’ll be in trouble anyway. 😀

One other mildly significant note here: if you care about Apple’s pre-2017 HFS+ file system, you should perform Unicode normalisation to NFD (most likely via the unicode-normalization crate) before performing sanitisation, because the decomposed form may be longer; if you don’t, then the path will be normalised to NFD by the file system when you try to write it, which could take it over 255 and make it fail. I don’t think there are any popular file systems that normalise any more, though APFS kinda prefers NFC, so you might want to normalise to NFC. I do not know if normalising to NFC will ever lengthen a UTF-8 string, but the spec allows it to (UAX #15, goal 3.2).

The minimum permitted value is 10, for reasons of implementation convenience and because I don’t think there’s any legitimate use case for a smaller value. If you provide a value less than ten, you’ll get an empty string back every time.

Truncations are performed at char granularity (Unicode scalar value), which means that extended grapheme clusters could be broken. This could change in the future (it’ll be an optional dependency on unicode-segmentation), but for now it was just too much thought. If I ever implement this, I’ll probably ditch the minimum value of 10 too.

§reserve_extra: usize

When allocating the string (since it allocates as small a string as possible), reserve at least this many extra bytes. This is good for efficiency when you append the extension after sanitisation (in which case, also disable extension_cleverness). Default 0.

§extension_cleverness: bool

Make other options try to be clever about a file extension in the input. Default true.

Specifically, if a file extension is detected (done by looking for the last full stop in the name, and splitting at that point into base name and extension):

  1. length_limit will try to keep the extension intact, truncating the base name rather than the extension. “Try”, because if the extension is longer than six code units less than the length limit, it will be deemed unsalvageable. (Why six? The base name must retain at least one character, so for convenience that’s four UTF-8 code units, plus one more for the dot, and if windows_safe is on, the longest reserved name causes a five code unit base name like LPT1_, and ridiculously long extensions are a corner case anyway so I decided to just call it a day at six. If I subsequently implement grapheme-cluster-aware truncation, this six will increase if the first grapheme cluster in the base name is more than five code units long.) An unsalvageable extension is the only case where sanitisation may take two steps to quiesce, rather than one: if the extension is entirely truncated and the base name contains a dot which in a subsequent run will be interpreted as the extension separator, trimming will happen around it on that subsequent run but not the first.

  2. windows_safe will detect reserved names with extensions.

  3. trim_spaces_and_full_stops and trim_more_punctuation will trim those characters from the end of the base name and the start of the extension, in addition to the start and end of the full name. (Expressed otherwise, the base name and extension will be trimmed independently.)

If you’re appending the extension after sanitisation, you should turn this to false.

§most_fs_safe: bool

Remove characters that are not safe on just about any file system. Default true, and if you actually want to disable it you’re probably using the wrong crate.

This plus length_limit is enough to satisfy most platforms other than Windows, though cleaning somewhat more is probably a good idea.

Characters removed:

  • / (slash)
  • ␀ (null, character zero)

Also disallows names comprising exclusively dots (".", "..", "...", &c.), NOT using replace_with on them but yielding an empty string.

This is a tiny subset of windows_safe.

§windows_safe: bool

Ensure the file name is safe on Windows. Default true.

These are the rules applied:

  • These characters are removed (and replace_with employed):

    • < (less than)
    • > (greater than)
    • : (colon)
    • " (double quote)
    • / (forward slash)
    • \ (backslash)
    • | (vertical bar/pipe)
    • ? (question mark)
    • * (asterisk)
    • The C0 control characters, 0–31 and 127 (U+0000–U+001F, U+007F); note that U+007F isn’t actually part of C0, but Microsoft included it in this list so I do too.
  • Names must not end with a space or a dot (so these are removed recursively—for reasons of technical convenience, replace_with is NOT employed).

  • These names are reserved (and so a trailing underscore is added to the base name), including with an extension if extension_cleverness is enabled:

    • CON, PRN, AUX, NUL,
    • COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9,
    • LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9

Most of these restrictions are actually not quite universal in Windows, but getting around them requires switching into POSIX mode or using long UNC paths (e.g. \\.\C:\CON, \\?\D:\aux.h), and your life will certainly be miserable if you try using them; so they’re all considered not Windows-safe.

§url_safe: bool

Remove characters that may be problematic in the usual places in URLs. Default false.

If you want something URL-safe, consider slugifying instead (see below).

This removes any character that is not what’s called a URL code point, also removes the characters &, / and ?, and forbids the names . and .. which have a special meaning in paths. The result is either an empty string, or suitable for use as a path component, query string value or fragment, without generally needing percent-encoding: such a URL will be correctly parsed by a WHATWG URL Standard parser, though nominally invalid¹, but older or poorer-quality URL parsers may need percent-encoding to cope with the non-ASCII that is retained.

Some notable characters that are removed: /, \, %, ?, #, &, ", and space.

Almost all non-ASCII is retained.

Notes on using these URLs in some common formats:

  • In HTML, no escaping is needed in <a href="http://www.example/fïle_ñamê">, because & and " are the only two characters needing escaping in a double-quoted attribute value, and both are removed by url_safe.

  • In plain text formats following the longstanding convention of angle bracket delimition (<http://www.example/lïke_τhis>), no escaping should be required as > is removed by url_safe. This includes Markdown. However, some such parsers could be stricter about what’s allowed inside the angle brackets, so you may need or want to use a URL Standard serialiser to do percent-encoding of the non-ASCII.

  • In Markdown [text](href) links, you’ll want to manually percent-encode ( to %28 and ) to %29. This is yet another bad choice in Markdown’s technical foundation: parentheses aren’t percent-encoded, never have been; so using a URL Standard serialiser won’t help you, you’ll instead need to manually encode them, or unpaired parentheses will break the link and possibly eat your laundry².

Given that this produces nominally-invalid URLs, you may be wondering why to bother at all; it really comes down to characters like ?, / and #: you can include them in paths by percent-encoding, but it’s too likely that somewhere along the way, something will mangle your path, not encoding it properly, and everything will break—basically the entire system has to process the URL correctly; ever tried a path component containing %2F? But if you’ve removed the genuinely problematic characters, then in theory things can no longer go wrong once you’re past the parser. And being able to skip percent-encoding your URLs when you know you’ll be using a proper URL parser is nice.

I deliberately haven’t provided an option for removing characters that would make a URL nominally invalid (which is “non-ASCII”), because I think that goes too far: in such a case, I don’t think you should strip such characters, but rather slugify the whole thing (which can do things like Voilà!voila).

replace_with is used for the character removals, but NOT for the forbidding of the names "." and "..", for which it will instead yield an empty string.

—⁂—

¹ “Invalid” is just a label in WHATWG specs; it doesn’t change anything, and parsing is still well-defined, it’s generally just a hint that either you may have made a mistake, or that older tools might not handle this case the same way.

² When Americans say “eat your laundry” they mean the clothes. An Australian seeking to express that concept would say “eat your washing” (and probably be looked at strangely because it’s not an expression in common use). The laundry is the room in which clothes are washed; so when I say injection attacks might eat your laundry——

§normalise_whitespace: bool

Replace all sequences of whitespace with one space. Default true.

This uses the Unicode White_Space property to decide (char::is_whitespace).

This is done in two phases:

  1. Before safety character replacements, each whitespace character is normalised to a U+0020 SPACE; replace_with is not invoked.

  2. After all character replacements, adjacent spaces (including any produced by replace_with, independent of collapse_replacements) are collapsed to just one.

§trim_spaces_and_full_stops: bool

Remove spaces and full stops (.) from the start and end of the name. Default true.

normalise_whitespace is performed before this; with it on, this will trim all whitespace, with it off it’ll only trim U+0020 SPACE.

All things that invoke replace_with are performed before this; thus, if you replace a character with a space or full stop, that could get trimmed. replace_with is not invoked on any characters removed by this.

If extension_cleverness is enabled (which it is by default), on names with an extension this trims from the start and end of the base name and extension independently, rather than just the start and end of the full string. That is, " foo . bar . baz " will become "foo . bar.baz" with extension_cleverness, and "foo . bar . baz" without.

This is independent of windows_safe, which also trims trailing spaces and dots from the complete name.

§trim_more_punctuation: bool

Remove a few more punctuationy characters from the start and end of the name. Default true.

This is a more aggressive supplement to trim_spaces_and_full_stops, trimming from the same places in the same way. These characters are removed:

  • _ (underscore; especially significant because replace_with defaults to an underscore)
  • - (hyphen/dash/minus)
  • , (comma)
  • ; (semicolon)
§remove_control_characters: bool

Remove control characters. Default true.

This removes all characters with the general category Control: C0 controls U+0000–U+001F, control character U+007F, and C1 controls U+0080–U+009F.

replace_with is invoked on these removals.

§remove_reordering_characters: bool

Remove BiDi control characters that are relevant to reordering attacks. Default true.

https://trojansource.codes/trojan-source.pdf is a paper with info about the attack.

This removes U+202A–U+202E and U+2066–U+2069. It does NOT remove the remaining three Bidi_Control characters U+061C, U+200E and U+200F (ALM, LRM, RLM), which are not implicated in the attack and are conceivably useful in file names.

replace_with is invoked on these removals.

§replace_with: R

Where characters are removed (except as marked), replace them with this. Default Some('_').

If you provide a character that would normally be removed, it will not be removed: that processing is done once only.

If you provide a character that would be trimmed, it may or may not be trimmed: end matches will be trimmed, start matches only will be if ridiculously long names and/or extensions force unusual truncation, exposing the start of the string (so that it gets trimmed to nothing).

§collapse_replacements: bool

Where multiple adjacent characters are to be replaced, only replace the first, and remove any subsequent ones. Default false.

See also normalise_whitespace, which can collapse replacements if you replace with whitespace.

§six_measures_of_barley: &'static str

If sanitisation would leave the path empty, return this string instead. Default "_".

This exists because I found myself writing if name.is_empty() { name.push('_') } after every time I called sanitise. I think most of the time you don’t want to be left with an empty string, and inserting something is tolerable, so this is on by default as something fairly neutral that aligns with the replace_with default as well. You can effectively disable this by setting this to an empty string.

length_limit is not taken into account on this. If you put something ridiculously long in it, you brought it on yourself and I wash my hands of it, as Pontius Pilate of old.

(Read Ruth 3:15–17 from the Bible to understand the name of this option.)

Implementations§

Source§

impl Options<Option<char>>

Source

pub const DEFAULT: Self = _

The default options. This is more useful than Options::default() (which just returns this) because it’s const, so you can access Options::DEFAULT.length_limit in const context.

Source§

impl<R: Replace> Options<R>

Source

pub fn with_replace_with<R2: Replace>(self, new_replace_with: R2) -> Options<R2>

A workaround for an otherwise-messy type situation with filling in defaults.

This solves the problem that you can’t write this:

Options { replace_with: |c| /* … */, ..Options::DEFAULT }

… because struct update syntax doesn’t currently allow you to change types, and Options::DEFAULT is an Options<Option<char>>, but with a closure for replace_with you’re needing to change it to Options<[closure@…]>. So instead, write like one of these:

Options::DEFAULT.with_replace_with(|c| /* … */)
Options { /* … */, ..Options::DEFAULT }.with_replace_with(|c| /* … */)

If you’re using nightly rustc, you can try the incomplete type-changing-struct-update feature instead, which lets the first code work (so long as this unstable and incomplete feature is working):

#![feature(type_changing_struct_update)]
use sanitise_file_name::Options;

fn main() {
    Options { replace_with: |c| /* … */, ..Options::DEFAULT }
}

Trait Implementations§

Source§

impl<R: Debug + Replace> Debug for Options<R>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for Options<Option<char>>

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl<R> Freeze for Options<R>
where R: Freeze,

§

impl<R> RefUnwindSafe for Options<R>
where R: RefUnwindSafe,

§

impl<R> Send for Options<R>
where R: Send,

§

impl<R> Sync for Options<R>
where R: Sync,

§

impl<R> Unpin for Options<R>
where R: Unpin,

§

impl<R> UnwindSafe for Options<R>
where R: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.