It is inspired by "How to make a valid Windows filename from an arbitrary string?", I've written a function that will take arbitrary string and make it a valid filename.
My function should technically be an answer to this question, but I want to make sure I've not done anything stupid, or overlooked anything, before posting it as an answer.
I wrote this as part of tvnamer - a utility which takes TV episode filenames, and renames them nice and consistently, with an episode pulled from http://www.thetvdb.com - while the source filename must be a valid file, the series name is corrected, and the episode name - so both could contain theoretically any characters. I'm not so much concerned about security as usability - it's mainly to prevent files being renamed .some.series - [01x01].avi and the file "disappearing" (rather than to thwart evil people)
It makes a few assumptions:
- The filesystem supports Unicode filenames. HFS+ and NTFS both do, which will cover a majority of users. There is also a
normalize_unicodeargument to strip out Unicode characters (in tvnamer, this is set via the config XML file) - The platform is either Darwin, Linux, and everything else is treated as Windows
- The filename is intended to be visible (not a dotfile like
.bashrc) - it would be simple enough to modify the code to allow.abcformat filenames, if desired
Things I've (hopefully) handled:
- Prepend underscore if filename starts with
.(prevents filenames...and files from disappearing) - Remove directory separators:
/on Linux, and/and:on OS X - Removing invalid Windows filename characters
\/:*?"<>|(when on Windows, or forced withwindows_safe=True) - Prepend reserved filenames with underscore (
COM2becomes_COM2,NULbecomes_NULetc) - Optional normalisation of Unicode data, so
åbecomesaand non-convertable characters are removed - Truncation of filenames over 255 characters on Linux/Darwin, and 32 characters on Windows
The code and a bunch of test-cases can be found and fiddled with at http://gist.github.com/256270. The "production" code can be found in tvnamer/utils.py
Is there any errors with this function? Any conditions I've missed?