Group by to return entirety of the data not just what I am grouping by

Question

Is it possible to return the entirety of data not just part of which we are grouping by?

I mean for example - I have a dataframe with 5 columns and one of those columns contains distance, the other one is timestamp and the last important one is name. I grouped dataframe by timestamp - agg function I applied is (min) on distance. As a return i get correctly grouped dataframe with timestamp and distance - how can i add columns name there. If I group it by name as well then timestamp becomes duplicated - it has to stay unique. As a final result I need to get dataframe like this:

timestamp	name	distance
2020-03-03 15:30:235	Billy	123
2020-03-03 15:30:435	Johny	111

But instead i get this:

timestamp	distance
2020-03-03 15:30:235	123
2020-03-03 15:30:435	111

Whole table has more than 700k rows so joining it back on distance gives me that amount of rows which my PC can't even handle.

Here is my groupby which gives me 2nd table:

grouped_df = df1.groupby('timestamp')['distance'].min()

Here is what i tried to do in order to get name inside the table:

grouped_df.merge(df1, how='left', left_on=['timestamp','distance'],    
right_on = ['timestamp','distance'])

can you post the relevant lines of your code? then it'll be easy — RusJaI, Jun 13 '21 at 14:58
i hope this will answer your problem https://stackoverflow.com/questions/30244952/how-do-i-create-a-new-column-from-the-output-of-pandas-groupby-sum — RusJaI, Jun 13 '21 at 15:09

BENY · Accepted Answer · 2021-06-13T15:43:29.337

1

Just try

out = df.sort_values('distance').drop_duplicates('timestamp')

Then try with transform

m = df.groupby('timestamp')['distance'].transform('min')
dout = df[df.distance==m]

edited Jun 13 '21 at 15:43

answered Jun 13 '21 at 15:37

BENY

317,841
20
164
234

Ank · Answer 2 · 2021-06-15T03:48:22.517

0

You can use GroupBy.agg method to apply min on the distance column, and apply a relevant function on name column (lambda x:x to simply return its data). This will return the dataframe with both columns back to you:

grouped_df = df1.groupby('timestamp').agg({'distance':'min','name':lambda x:x})

Also if you want timestamp back as a column and not as index, you can set as_index=False during groupby.

Output:

>>> grouped_df = df1.groupby('timestamp', as_index=False).agg({'distance':'min','name':lambda x:x})
>>> grouped_df
              timestamp  distance   name
0  2020-03-03 15:30:235       123  Billy
1  2020-03-03 15:30:435       111  Johny

edited Jun 15 '21 at 03:48

answered Jun 13 '21 at 15:19

Ank

1,704
9
14

Hi, thanks. Thought it doesn't return entire dataframe, column name is still not returned. It gives only timestamp and distance but doesnt show the person who did the distance. – Aleksander Kuś Jun 13 '21 at 15:24
ohh Yes, posted code too early, you can apply a relevant function on name column as well. Edited the answer, try now. – Ank Jun 13 '21 at 15:36

Group by to return entirety of the data not just what I am grouping by

2 Answers2