Here is one possible approach that avoids using apply
Raw DataFrame
date tableNameFrom tableNameJoin attributeName
0 29-03-2019 film language [film.languageId, language.languageID]
1 30-03-2019 inventory rental [invetory.inventoryId, rental.filmId]
Step 1 - (1) Split attributeName into 2 separate columns using , as the separator, (2) remove unwanted square brackets ([ or ]), (3) drop unwanted columns
# 1
df[['tableName','attributeName2']] = df['attributeName'].str.split(',', expand=True)
# 2
df['tableName'] = df['tableName'].str.strip('[')
df['attributeName2'] = df['attributeName2'].str.strip(']')
# 3
df.drop(['attributeName','tableNameFrom','tableNameJoin'], axis=1, inplace=True)
print(df)
date tableName attributeName2
0 29-03-2019 film.languageId language.languageID
1 30-03-2019 invetory.inventoryId rental.filmId
Step 2 - Finally, use this SO post to split the rows from the columns tableName and attributeName2 into separate columns
df_match = (df.set_index(['date'])
.stack()
.str.split('.', expand=True)
.stack()
.unstack(-1)
.reset_index(-1, drop=True)
.reset_index()
)
df_match.columns = ['date','tableName','attributeName']
print(df_match)
date tableName attributeName
0 29-03-2019 film languageId
1 29-03-2019 language languageID
2 30-03-2019 invetory inventoryId
3 30-03-2019 rental filmId
Details
.set_index(['date'] - set the columns that are to be kept as the index of the DataFrame
.stack() - stack rows
.str.split('.', expand=True) - call str.split on period (.) to split those cells into separate columns
.stack() - get rid of NULL values, so call stack again
.unstack(-1)
- since we want the last level of the index to become our columns, so unstack using
unstack(-1) (unstack on the last level)
- I have shown this in detail below1
.reset_index(-1, drop=True) - eliminate superfluous last level using reset_index(-1)
1 Here is the output before the .unstack(-1) step and an explanation of why we need to use -1 inside unstack()
df_intermediate = (df.set_index(['date'])
.stack()
.str.split('.', expand=True)
.stack()
)
print(df_intermediate)
date
29-03-2019 tableName 0 film
1 languageId
attributeName2 0 language
1 languageID
30-03-2019 tableName 0 invetory
1 inventoryId
attributeName2 0 rental
1 filmId
dtype: object
- this shows that we want the last index level (with values 0, 1) to become the columns so we choose
-1 in .unstack(-1)