Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
df['column_name'] = df['column_name'].astype('bool')
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_integers(0,1,size=5),
columns=['foo'])
print(df)
# foo
# 0 0
# 1 1
# 2 0
# 3 1
# 4 1
df['foo'] = df['foo'].astype('bool')
print(df)
yields
foo
0 False
1 True
2 False
3 True
4 True
Given a list of column_names, you could convert multiple columns to bool dtype using:
df[column_names] = df[column_names].astype(bool)
If you don't have a list of column names, but wish to convert, say, all numeric columns, then you could use
column_names = df.select_dtypes(include=[np.number]).columns
df[column_names] = df[column_names].astype(bool)
There are various ways to achieve that, below one will see various options:
Using pandas.Series.map
Using pandas.Series.astype
Using pandas.Series.replace
Using pandas.Series.apply
Using numpy.where
As OP didn't specify the dataframe, in this answer I will be using the following dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 0, 0, 1, 0], 'col2': [0, 0, 1, 0, 1], 'col3': [1, 1, 1, 0, 1], 'col4': [0, 0, 0, 0, 1]})
[Out]:
col1 col2 col3 col4
0 1 0 1 0
1 0 0 1 0
2 0 1 1 0
3 1 0 0 0
4 0 1 1 1
We will consider that one wants to change to boolean only the values in col1. If one wants to transform the whole dataframe, see one of the notes below.
In the section Time Comparison one will measure the times of execution of each option.
Option 1
Using pandas.Series.map as follows
df['col1'] = df['col1'].map({1: True, 0: False})
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 2
Using pandas.Series.astype as follows
df['col1'] = df['col1'].astype(bool)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 3
Using pandas.Series.replace, with one of the following options
# Option 3.1
df['col1'] = df['col1'].replace({1: True, 0: False})
# or
# Option 3.2
df['col1'] = df['col1'].replace([1, 0], [True, False])
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 4
Using pandas.Series.apply and a custom lambda function as follows
df['col1'] = df['col1'].apply(lambda x: True if x == 1 else False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 5
Using numpy.where as follows
import numpy as np
df['col1'] = np.where(df['col1'] == 1, True, False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Time Comparison
For this specific case one has used time.perf_counter() to measure the time of execution.
method time
0 Option 1 0.00000120000913739204
1 Option 2 0.00000220000219997019
2 Option 3.1 0.00000179999915417284
3 Option 3.2 0.00000200000067707151
4 Option 4 0.00000400000135414302
5 Option 5 0.00000210000143852085
Notes:
There are strong opinions on using .apply(), so one might want to read this.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
To convert the whole dataframe, one can do, for example, the following
df = df.astype(bool)
[Out]:
col1 col2 col3 col4
0 True False True False
1 False False True False
2 False True True False
3 True False False False
4 False True True True
Reference: Stack Overflow unutbu (Jan 9 at 13:25), BrenBarn (Sep 18 2017)
I had numerical columns like age and ID which I did not want to convert to Boolean. So after identifying the numerical columns like unutbu showed us, I filtered out the columns which had a maximum more than 1.
# code as per unutbu
column_names = df.select_dtypes(include=[np.number]).columns
# re-extracting the columns of numerical type (using awesome np.number1 :)) then getting the max of those and storing them in a temporary variable m.
m=df[df.select_dtypes(include=[np.number]).columns].max().reset_index(name='max')
# I then did a filter like BrenBarn showed in another post to extract the rows which had the max == 1 and stored it in a temporary variable n.
n=m.loc[m['max']==1, 'max']
# I then extracted the indexes of the rows from n and stored them in temporary variable p.
# These indexes are the same as the indexes from my original dataframe 'df'.
p=column_names[n.index]
# I then used the final piece of the code from unutbu calling the indexes of the rows which had the max == 1 as stored in my variable p.
# If I used column_names directly instead of p, all my numerical columns would turn into Booleans.
df[p] = df[p].astype(bool)