I have following pandas df :
import pandas as pd
import numpy as np
pd_df = pd.DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', np.nan, 'banana', 'banana', 'banana'],
'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})
I'd like to implement where() on two columns only Qu1 and Qu2 and keep the rest
original stackoverflow question
, so I created pd1
pd1 = pd_df.where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
"other")[['Qu1', 'Qu2']]
Then I added a rest of pd_df,pd_df['Qu3'] to pd1
pd1['Qu3'] = pd_df['Qu3']
pd_df = []
My question is : Originally I want to execute where() on part of df and keep rest of columns as is, so could the code above be dangerous for large dataset ? Can I harm the original data this way ? If yes what the best way to do it ?
Thanks a lot !