ToUpperInvariant() – is MSDN wrong on its recommendation?

Question

In Best Practices for Using Strings in the .NET Framework, StringComparison OrdinalIgnoreCase is recommended for case-insensitive file paths. (Let's call it Statement A.)

I can agree with that, because I can create two files in the same directory:

é.txt
é.txt

Their filenames are not the same, second one is composed from e and modifier, so it actually has two letters. (You can try yourself using copy-paste.)

If there was Invariant culture comparison (and not ordinal comparison) in effect, NTFS wouldn't allow these files, because in the same article they explain, that in invariant culture a + ̊ = å

But in article on String.ToUpperInvariant() there is different recommendation: (Statement B.)

If you need the lowercase or uppercase version of an operating system identifier, such as a file name, named pipe, or registry key, use the ToLowerInvariant or ToUpperInvariant methods.

I need to create file path collection (in fact HashSet) to detect duplicates. So if I will obey statement B when creating the map, I could end with false positives, because abovementioned filenames é.txt and é.txt will be considered as one. Am I understanding it correctly that statement B found in MSDN is misleading? Or am I missing something?

I'm about to build a library, preferably without known bugs from start, so I simply don't want to neglect this.

Update:

Statement B seems to have one more issue: ToLowerInvariant() cannot be actually used. Reason (I quote Best practices article): DO: Use ToUpperInvariant rather than ToLowerInvariant when normalizing strings for comparison. Actual reason: There is a small range of characters that do not roundtrip, and going to lowercase will make these characters unavailable. (source)

I am not entirely sure "the lowercase or uppercase version of an operating system identifier" is meant to be the same as "an unambiguous mapping of an operating system identifier to a lowercase or uppercase version". It could also mean "a mapping of an operating system identifier to a non-unique lowercase or uppercase version that will work the same way regardless of the system's locale". — O. R. Mapper, Sep 23 '15 at 13:12
OT, but who knows what your library does: NTFS also allows `:`, `*` or `?` in file names. It's just Windows that doesn't support it. It's quite easy to create such files on NTFS under Linux. — Thomas Weller, Sep 23 '15 at 13:15
@O.R.Mapper – a good way of reading of that statement... In this context it looks logical. On the other hand, they could either leave out mentioning file names or add a short note on (non-)uniqueness. — miroxlav, Sep 23 '15 at 13:59

score 5 · Accepted Answer · answered Sep 23 '15 at 13:31

5

Neither uppercasing nor lowercasing is correct when you want to compare strings for equality case-insensitively. In both variants there are characters that mess this up.

The correct way to compare strings case-insensitively is to use one of the insensitive StringComparison options (you know that).

The right way to use a data structure case-insensitively is to use one of StringComparer.*IgnoreCase. For example:

new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)

Do not uppercase strings before adding them to a data structure. I would fail that in any code review.

If you need the lowercase or uppercase version of an operating system identifier

You do not need such as thing. This statement does not apply to your case.

answered Sep 23 '15 at 13:31

usr

168,620
35
240
369

So in case of NTFS filenames, this means `new HashSet(StringComparer.OrdinalIgnoreCase)` (or just `OrdinalCase`, depending on how NTFS case sensitivity is switched in specific case). – miroxlav Sep 23 '15 at 13:45
I don't know what kind of comparison NTFS uses. It can be configured. There is a hidden file on each NTFS volume that stores the Unicode case mapping table. I guess it could be arbitrary. Not sure what it is in practice. – usr Sep 23 '15 at 13:47
Yes I know that... It means we might actually need something like `NtfsIgnoreCase` comparison, working based on content of that hidden `$UpCase` file :) – miroxlav Sep 23 '15 at 13:49
See [this answer of mine](http://stackoverflow.com/a/26231047/3764814) (for short: use `OrdinalIgnoreCase` for file names). – Lucas Trzesniewski Sep 24 '15 at 07:31
@LucasTrzesniewski – I've actually seen it :) and also [this](http://stackoverflow.com/q/1061224/2392157) and [this](http://stackoverflow.com/q/410502/2392157) noteworthy Q&A. Finally I have used `Dictionary(Of T1, T2)(StringComparer.OrdinalIgnoreCase)` for my specific need. – miroxlav Sep 24 '15 at 08:02

ToUpperInvariant() – is MSDN wrong on its recommendation?

1 Answers1

Linked