Caveats:
- DataFrames have a lot of attributes. If a
DataFrame attribute is a number, you probably just want to return that number. But if the DataFrame attribute is DataFrame you probably want to return a Container. What should we do if the DataFrame attribute is a Series or a descriptor? To implement Container.__getattr__ properly, you really
have to write unit tests for each and every attribute.
- Unit testing is also needed for
__getitem__.
- You'll also have to define and unit test
__setattr__ and __setitem__, __iter__, __len__, etc.
- Pickling is a form of serialization, so if
DataFrames are picklable, I'm not sure how Containers really help with serialization.
Some comments:
__getattr__ is only called if the attribute is not in self.__dict__. So you do not need if item in self.__dict__ in your __getattr__.
self.contained.__getattr__(item) calls self.contained's
__getattr__ method directly. That is usually not what you want to
do, because it circumvents the whole Python attribute lookup
mechanism. For example, it ignores the possibility that the attribute
could be in self.contained.__dict__, or in the __dict__ of one of
the bases of self.contained.__class__ or if item refers to a
descriptor. Instead use getattr(self.contained, item).
import pandas
import numpy as np
def tocontainer(func):
def wrapper(*args, **kwargs):
result = func(*args, **kwargs)
return Container(result)
return wrapper
class Container(object):
def __init__(self, df):
self.contained = df
def __getitem__(self, item):
result = self.contained[item]
if isinstance(result, type(self.contained)):
result = Container(result)
return result
def __getattr__(self, item):
result = getattr(self.contained, item)
if callable(result):
result = tocontainer(result)
return result
def __repr__(self):
return repr(self.contained)
Here is some random code to test if -- at least superficially -- Container delegates to DataFrames properly and returns Containers:
df = pandas.DataFrame(
[(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])
df = Container(df)
df['col1'][3] = 0
print(df)
# col1 col2
# 0 1 2
# 1 1 3
# 2 1 4
# 3 2 1
# 4 2 2
gp = df.groupby('col1').aggregate(np.count_nonzero)
print(gp)
# col2
# col1
# 1 3
# 2 2
print(type(gp))
# <class '__main__.Container'>
print(type(gp[gp.col2 > 2]))
# <class '__main__.Container'>
tf = gp[gp.col2 > 2].reset_index()
print(type(tf))
# <class '__main__.Container'>
result = df[df.col1 == tf.col1]
print(type(result))
# <class '__main__.Container'>