I have a dataframe of companies, called slave <class 'pandas.core.frame.DataFrame'>:
Date ID Name Company Email
2018 1 AB Apple apple@apple.com
2017 2 BC MicroSoft ms@ms.com
2016 3 CD Amazon amz@amz.com
the Company column is an object type, not string.
and another big dataframe (2GB) of master data <class 'pandas.core.frame.DataFrame'>:
code company_name product
123 MicroSoft Windows
456 Apple iphone
789 Amazon cloud
The type company_name is object, same object type as the Company, not string.
I want to look up every value in Company value in the company_name of the second DF and append the match to a series:
def finder(slave, master):
finalList = []
for company in slave['Company']:
if type(company) == 'some_specific_type':
for info in master['company_name']:
if company in info:
finalList.append(master.loc[str(info)]['code'])
break
else:
finalList.append(company)
return finalList
how can the look up process be much faster for this kind of situation? the master list is 20m lines and current loop is extremely slow.
slave['newCode'] = finder(slave, master) // very slow
the result will be:
Date ID Name Company Email newCode
2018 1 AB Apple apple@apple.com 456
2017 2 BC MicroSoft ms@ms.com 123
2016 3 CD Amazon amz@amz.com 789
i am checking for the type of the value, it is not a simple merge operation. The slave and master DF are not of the same size of rows, master is 20m rows, whereas the slave is just 1K rows. and i want to compare column company_name, but get the corresponding row value from another column code and store it as a new column in the slave table.
My goal: how can I use NumPy instead, using vectorization with NumPy arrays?