Here is a solution and some thoughts about this:
A working solution if, as in your example, the strings are always separated by whitespaces in Text:
import pandas as pd
df = pd.DataFrame({'Text': ['A++ python', 'Teapot warmeR'],})
languages = ["Python", "R", "A++", "TEA"]
# Extracting column as list and convert to lower case
text_col = df['Text'].tolist()
text_col = [x.lower() for x in text_col]
# To lower case too
languages = [x.lower() for x in languages]
# Finding "whole words"
to_add = [lang for lang in languages for langs_list in text_col if lang in langs_list.split(" ")]
# Adding columns
for lang in to_add:
df[lang] = pd.Series(dtype='int')
print(df)
Output:
Text python a++
0 A++ python NaN NaN
1 Teapot warmeR NaN NaN
Thoughts:
In fact this is an interesting multi-causal problem.
1st cause: "A++" ends with 2 plus signs which are regex special characters that need to be escaped
2nd: You need to find whole words, so we should use regex boudaries \b "as usual" but:
3rd: \b will match "Python", but \b won't work after the plus sign (a non-word character) of "A++" and the whitespace after it because \b is a zero width match that will match between a word character (\w) and a non-word character (\W) or between a word character and the start of end of string.
4th: We could replace the ending \b with \B, and the the regex will match "A++" because \B is \b negated. But this time, it will not match "Python" anymore and it will match "TEA"...
We could analyse this like that :
Here is the "final" (non-working) code and after that an explanation of the steps followed to write it:
for lang in languages:
if lang not in df.columns:
needle = re.escape(lang)
needle = r'\b{}\B'.format(needle)
if df['Text'].str.lower().str.contains(needle, case=False, regex=True).any():
df[lang] = pd.Series(dtype='int')
- For clarity, we use
case=False and remove .str.lower() and lang.lower()
- We set
regex=True in order to use regex to match whole words. But as is, the regex will fail becasue "A++" needs to be escaped.
- We escape the strings with
needle = re.escape(lang). But now we get substrings: Pyton R, A++ and TEA.
- So we use word boundary
\b: needle = r'\b{}\b'.format(needle). But now we only get Python...
- So we use word boundary
\B at the end: needle = r'\b{}\B'.format(needle). Now, we get A++, but this does not match Python anymore and we also get TEA...
To conclude we can't use a simple regex that will work with all cases. BUT you can use a complex regex (adaptive word boundaries from https://stackoverflow.com/a/45145800/3832970) as in the answer of @Wiktor Stribiżew.
And, if, as in your example, the strings are always separated by whitespaces in Text, we could split on whitespaces and check if the whole words are in the resulting lists using in operator.