Matching a substring from a substring list in a list of strings

Question

I have a list of substrings which has about 10000 entries -

substr_ls = ['N_COULT16_1 1', 'S_COULT2', 'XBG_F 1', 'FAIRWY_3', .....]

I have a list of strings which has about 100 entries -

main_str_ls = ['N_COULT16_1 1XF', 'S_COULT2_RT', 'XBG_F TX300 1', 'FAIRWY_34_AG', ....]

As you see, the substrings are not perfect substrings of strings from main_str_ls. The sequence of alphabets, numbers, etc from substring will have to match the sequence from string for it to be a match. For example - 'XBG_F 1' is a match with 'XBG_F TX300 1' because the sequence is a match even though there is a 'TX300' in the middle of 'XBG_F' and '1' What I'm currently doing is using this function -

def is_subsequence(pattern, items_to_use):
    items_to_use = (x for x in items_to_use)
    return all(any(x == y for y in items_to_use) for x, _ in itertools.groupby(pattern))

from Finding a substring in a jumbled string by iterating over main_str_ls (contents of main_str_ls used as items_to_use) and substr_ls (contents of substr_ls used as pattern) and when I find a match, it breaks the loop and does some stuff. Something like this -

for main_str in main_str_ls:
    main_str = main_str.strip()
    for substr in substr_ls:
        substr = substr.strip() 
        if is_subsequence(substr, main_str):
            **do stuff**

Is there a better way or a pythonic approach for doing this?

I would change the `substr_ls` list into a list of regex's `re_str_ls`. `"XBG_F 1"` could become `r"XBG_F.*1"`, then use `if re.match(re_str, test_str): ...` — flakes, Feb 10 '21 at 16:44

score 1 · Answer 1 · answered Feb 10 '21 at 18:06

1

One of the diffence between what you need vs the jumbled string question is they are concerned about allowing repeats. I don't think you can use that design directly. Instead, try this link https://www.geeksforgeeks.org/given-two-strings-find-first-string-subsequence-second/

answered Feb 10 '21 at 18:06

Bing Wang

1,548
1
8
7

the best way to go about will still be to iterate over both the lists, right? – dhruv gami Feb 10 '21 at 19:27
1

Yes the complexity is O(MN(L1+L2)). Given your problem size I think it is workable – Bing Wang Feb 10 '21 at 19:54

Matching a substring from a substring list in a list of strings

1 Answers1