My df looks like this:
sid, class_start, class_end, graduated, grad_date, college_name, degree, major
123 2010-01-01 2010-04-15 NA NA ABC NA NA
123 2010-06-01 2010-09-15 NA NA ABC NA NA
123 2010-10-01 2010-12-15 NA NA ABC NA NA
123 NA NA Y 2010-12-15 ABC BS Biology
123 2011-01-01 2011-04-15 NA NA WRT NA NA
123 2011-06-01 2011-09-15 NA NA WRT NA NA
123 2011-10-01 2011-12-15 NA NA WRT NA NA
123 NA NA Y 2011-12-15 ABC BS Chem
123 2012-01-01 2011-04-15 NA NA ABC NA NA
123 2012-06-01 2011-09-15 NA NA ABC NA NA
123 2012-10-01 2011-12-15 NA NA ABC NA NA
123 NA NA Y 2012-12-15 ABC MS Biology
765 2010-01-01 2010-04-15 NA NA EGF NA NA
765 2010-06-01 2010-09-15 NA NA EGF NA NA
765 2010-10-01 2010-12-15 NA NA EGF NA NA
765 NA NA N NA EGF BS Physics
765 2014-01-01 2014-04-15 NA NA RSE NA NA
765 2014-06-01 2014-09-15 NA NA RSE NA NA
765 2014-10-01 2014-12-15 NA NA RSE NA NA
765 NA NA N NA RSE BS Physics
what I am trying to achieve is for each sid I want to get the first graduated value from from the first college. If a student does not graduate then we get the value of the last college.
Also, note that that Desired Output has a new columns first_start and last_end
first_start= students first enrollment datelast_end= student last enrollment date
For student graduated == Y, we only take this from college they first graduated.
For student graduated == N, first_start is the first college start date, last_end is the date last_end of last college
Desired Output
sid, first_start, last_end, graduated, grad_date, college_name, degree, major
123 2010-01-01 2010-12-15 Y 2010-12-15 ABC BS Biology
765 2010-01-01 2014-12-15 N NA RSE BS Physics
My Approach:
- the closest I got is when I group by ['sid', 'college_name'] and then aggregrate value for other columns such as
minvalue forfirst_startandmaxforlast_endand so on.. that does not get me to the Desired Output
Could someone please help achieve this result?