When I was working on some set operations in Python, using numpy and pandas I came across a strange phenomenon, which, I would state, results in an inconsistency in nans handling.
Let's assume that we have a very simple situation with sets as our points of interest:
import numpy as np
import pandas as pd
a = {1, 2, 3, np.nan}
b = {1, 2, 4, np.nan}
print(a - b)
Out:
{3}
This is perfectly fine and we would expect it to be so, but let us continue with somewhat more complicated example, where pandas series / data frame is included:
series = pd.Series([1, 2, 3, np.nan, 1, 2, 3])
d = set(series)
print(d)
Out:
{nan, 1.0, 2.0, 3.0}
Once again, perfectly fine. Though, when we call:
print(d - b) # the same applies to a single column of a data frame in place of a series
the result is (quite unexpectedly to me):
{nan, 3.0}
There is still a nan value in the output.
I do understand that when we create the series variable all of the input values are cast under the hood to a float64 format, including the nan value.
type(series.iloc[3])
Out:
numpy.float64
Whereas the type of a freely created np.nan is just float. Of course, the np.isnan() function in both cases returns True. I still see it though as an inconsistency, because I would assume that all basic Python operations (to which set operations undoubtedly belong) will be treating nans in a similar manner to the numbers. Even if the same type conversion as in case of nans was applied to the numbers in the sets (in pure Python they are ints, whereas in pandas series they are floats), set operations still consider them as the same entities and remove values adequately. nan is supposed to be also (quasi-)numeric and yet is handled differently. Is this a feature, a bug or an acknowledged situation which cannot be for some reason resolved?
Python version: 3.6.6. Numpy version: 1.16.2. Pandas version: 0.24.2.