Acquire raw data --> transform it and join it with other files --> email to end-users for review
What is the best approach?
If 'employee_id'+'customer_id'+'timestamp' is long, and you are interested in something that is unlikely to have collisions, you can replace it with a hash. The range and quality of the hash will determine the probability of collisions. Perhaps the simplest is to use the builtin hash. Assuming your DataFrame is df, and the columns are strings, this is
(df.employee_id + df.customer_id + df.timestamp).apply(hash)
If you want greater control of the size and collision probability, see this piece on non-crypotgraphic hash functions in Python.
Edit
Building on an answer to this question, you could build 10-character hashes like this:
import hashlib
df['survey_id'] = (df.employee_id + df.customer_id + df.timestamp).apply(
lambda s: hashlib.md5(s).digest().encode('base64')[: 10])
If anyone is looking for a modularized function, save this into a file for use where needed. (for Pandas DataFrames)
df is your dataframe, columns is a list of columns to hash over, and name is the name of your new column with hash values.
Returns a copy of the original dataframe with a new column containing the hash of each row.
def hash_cols(df, columns, name="hash"):
new_df = df.copy()
def func(row, cols):
col_data = []
for col in cols:
col_data.append(str(row.at[col]))
col_combined = ''.join(col_data).encode()
hashed_col = sha256(col_combined).hexdigest()
return hashed_col
new_df[name] = new_df.apply(lambda row: func(row,columns), axis=1)
return new_df
I had a similar problem, that I solved thusly:
import hashlib
import pandas as pd
df = pd.DataFrame.from_dict({'mine': ['yours', 'amazing', 'pajamas'], 'have': ['something', 'nothing', 'between'], 'num': [1, 2, 3]})
hashes = []
for index, row in df.iterrows():
hashes.append(hashlib.md5(str(row).encode('utf-8')).hexdigest())
# if you want the hashes in the df,
# in my case, I needed them to form a JSON entry per row
df['hash'] = hashes
The results will form an md5 hash, but you can really use any hash function you need to.