resampling is similar to grouping with a TimeGrouper. While resampling's
how parameter only allows you to specify one aggregator per column,
The GroupBy object returned by df.groupby(...) has an agg method which can be passed various functions (e.g. mean, sum, or count) to aggregate the groups in various ways. You can use these results to build the desired DataFrame:
import datetime as DT
import numpy as np
import pandas as pd
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 5, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
resampled = df.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
print(resampled[['radiation', 'tamb']])
# radiation tamb
# 2012-04-05 08:00:00 279.432788 4.549235
# 2012-04-05 09:00:00 310.032188 4.414302
# 2012-04-05 10:00:00 257.504226 5.056613
# 2012-04-05 11:00:00 299.594032 4.652067
# 2012-04-05 12:00:00 8.109946 7.795668
def using_agg(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
return pd.DataFrame(
{'radiation': grouped['radiation'].agg('sum'),
'tamb': grouped['tamb'].agg('mean'),
'new_column': grouped['tamb'].agg('count')})
print(using_agg(df))
yields
new_column radiation tamb
2012-04-05 08:00:00 60 279.432788 4.549235
2012-04-05 09:00:00 60 310.032188 4.414302
2012-04-05 10:00:00 60 257.504226 5.056613
2012-04-05 11:00:00 60 299.594032 4.652067
2012-04-05 12:00:00 1 8.109946 7.795668
Note my first answer suggested using groupby/apply:
def using_apply(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
result = grouped.apply(foo).unstack(-1)
result = result.sortlevel(axis=1)
return result[['radiation', 'tamb', 'new_column']]
def foo(grp):
radiation = grp['radiation'].sum()
tamb = grp['tamb'].mean()
cnt = grp['tamb'].count()
return pd.Series([radiation, tamb, cnt], index=['radiation', 'tamb', 'new_column'])
It turns out that using apply here is much slower than using agg. If we benchmark using_agg versus using_apply on a 1681-row DataFrame:
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 6, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
I find using IPython's %timeit function
In [83]: %timeit using_apply(df)
100 loops, best of 3: 16.9 ms per loop
In [84]: %timeit using_agg(df)
1000 loops, best of 3: 1.62 ms per loop
using_agg is significantly faster than using_apply and (based on additional
%timeit tests) the speed advantage in favor of using_agg grows as len(df)
grows.
By the way, regarding
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
'new_column': count()})
besides the problem that the how dict does not accept non-existant column names, the parentheses in count are problematic. The values in the how dict should be function objects. count is a function object, but count() is the value returned by calling count.
Since Python evaluates arguments before calling functions, count() is getting called before frame.resample(...), and the return value of count() is then associated with the key 'new_column' in the dict bound to the how parameter. That's not what you want.
Regarding the updated question: Precompute the values that you will need before calling groupby/agg:
Instead of
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
...
# ImplementationError since `grouped['col1']` does not implement __getitem__
use
df['col1_pos'] = df['col1'].clip(lower=0)
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1_pos'].agg('sum')
...
See the bottom of this post for more on why pre-computation helps performance.