Why does \p{P} when used in
(^|\p{P})(?!,alpha,),*alpha,*
behave differently from \p{Ps} used in
(^|\p{Ps})(?!,alpha,),*alpha,*
when used to process
(,alpha,
\p{P} matches whereas \p{Ps} does not match
- 9
- 1
-
The problem is with \p{Po}. For example, `(^|[\p{P}-[\p{Po}]])(?!,alpha,),*alpha,*` works as wanted. But I would still be interested in knowing why. – asr Oct 18 '20 at 11:01
-
See the [list of chars matched with `\p{Ps}`](https://www.fileformat.info/info/unicode/category/Ps/list.htm). All cateogries list: https://www.fileformat.info/info/unicode/category/index.htm – Wiktor Stribiżew Oct 18 '20 at 11:11
1 Answers
Because \p{P} matches the , in your string, not (.
This is because you have a negative lookahead (?!,alpha,) after \p{P}. This means that "after 'any punctuation', there must not be the string ,alpha,". Well, There is ,alpha, after (, so \p{P} fails to match (. The regex engine moves forward one character, and tries again. This time, \p{P} matches , and there is no ,alpha, after , (there is only alpha,!), and the rest of the match succeeds too, so the whole match succeeds. The matched string is ,alpha,, without the (.
If you change the \p{P} to \p{Ps}, it will fail to match ( just like before, but also fail to match ,, causing the whole match the fail. Note that the ^ alternative doesn't get chosen, because even though the lookahead passes, your regex requires a , to immediately follow. But after the start of string, there is a ( instead.
- 213,210
- 22
- 193
- 313