I'm working with a Python source code corpus. I would like the strings to be replaced with STRING. Python strings are annoying because they allow so many delimiters. Here is what I've tried and the issues I've run into.
r'"(\\"|[^"])*"'andr"'(\\'|[^'])*'"This doesn't work because if a string contains the opposite delimiter.
r'(\'|"|\'\'\'|""")(?:\\\1|(?!\1))*\1'This was my attempt at a catch all, but the lookahead doesn't work. I basically wanted
r'(\'|"|\'\'\'|""")(?:\\\1|[^\1])*\1'if that were possible.Multiline strings mess stuff up. You can't use
[^"""]because"""is not one character.- Strings that contain the other delimiters like
"'". - Strings that escape the delimiter like
'\''.
These are the kinds of strings that need to be matched. The entire block is a string with the delimiters included.
'/$\'"`''\\''^__[\'\\"]([^\'\\"]*)[\'\\"]'"Couldn't do that"
These are all valid strings, but you can probably see where it might be hard to match them. Essentially, I want this:
def hello_world():
print("'blah' \"blah\"")
To become:
def hello_world():
print( STRING )
For simplicity sake, let's say the entire Python file is inside of a string. Right now I am reading a file line by line, but I could treat it as one string if necessary. It really doesn't matter how the file is read. If your solution requires a specific method, I will use that. I am not sure this problem can be solved entirely with regex. If you have a solution that involves other code, that would be much appreciated as well.