Parsing HTML with regexes is usually a bad idea, but that's not exactly what you're trying to do here. All you really want is to strip out the HTML tags. In your example, you try to match the tags and parse out the attributes. But you don't need to do this.
If the following assumptions hold:
- You don't need to get rid of HTML entities
- Your tags don't define any whitespace (i.e. you don't care that
<p> delimits paragraphs)
- You don't have any comments or doctypes
Then all you need to do is to strip the pattern </?[^>]+>.
Escaped, in vim, this is:
s/<\/\?[^>]\+>//g