Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should
Load the data from
known_data.csvas a dataframe calledexisting_dataAdd a new column called 'existing' to the
existing_datadfMerge the old
existing_datadataframe with the dataframe callednew_dataon the five columnsCheck whether
new_datacontains new rows by looking atmerge[merge.existing.isnull()](the complement of the new data and the existing data)Append the new rows to the
known_data.csvfile
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.