I have a data frame that includes a column of messy strings. Each messy string includes the name of a single country somewhere in it. Here's a toy version:
df <- data.frame(string = c("Russia is cool (2015) ",
"I like - China",
"Stuff happens in North Korea"),
stringsAsFactors = FALSE)
Thanks to the countrycode package, I also have a second data set that includes two useful columns: one with regexs for country names (regex) and another with the associated country name (country.name). We can load this data set like this:
library(countrycode)
data(countrycode_data)
I would like to write code that uses the regular expressions in countrycode_data$regex to spot the country name in each row of df$string; associates that regex with the proper country name in countrycode_data$country.name; and, finally, writes that name to the relevant position in a new column, df$country. After performing this TBD operation, df would look like this:
string country
1 Russia is cool (2015) Russian Federation
2 I like - China China
3 Stuff happens in North Korea Korea, Democratic People's Republic of
I can't quite wrap my head around how to do this. I have tried using various combinations of grepl, which, tolower, and %in%, but I'm getting the direction or dimensions (or both) wrong.