Using strsplit() in R, ignoring anything in parentheses

Question

I'm trying to use strsplit() in R to break a string into pieces based on commas, but I don't want to split up anything in parentheses. I think the answer is a regex but I'm struggling to get the code right.

So for example:

x <- "This is it, isn't it (well, yes)"
> strsplit(x, ", ")
[[1]]
[1] "This is it"     "isn't it (well" "yes)"

When what I would like is:

[1] "This is it"     "isn't it (well, yes)"

You try to use the braces `(...)` as a non splitable block scope and have to put your intention into the splitting regexp. This is not a simple task. — huckfinn, Feb 11 '16 at 18:59

score 15 · Answer 1 · edited Aug 08 '16 at 03:12

15

We can use PCRE regex to FAIL any , that follows that a ( before the ) and split by , followed by 0 or more space (\\s*)

 strsplit(x, '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl=TRUE)[[1]]
 #[1] "This is it"           "isn't it (well, yes)"

edited Aug 08 '16 at 03:12

Rich Scriven

97,041
11
181
245

answered Feb 11 '16 at 18:52

akrun

874,273
37
540
662

score 6 · Answer 2 · edited May 23 '17 at 12:22

6

I would suggest another regex with (*SKIP)(*F) to ignore all the (...) substrings and only match the commas outside of parenthesized substrings:

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too)"
strsplit(x, "\\([^()]*\\)(*SKIP)(*F)|\\h*,\\h*", perl=T)

See IDEONE demo

You can read more about How do (*SKIP) or (*F) work on regex? here. The regex matches:

\( - an opening bracket
[^()]* - zero or more characters other than ( and )
\) - a closing bracket
(*SKIP)(*F) - the verbs that advance the current regex index to the position after the closing bracket
| - or...
\\h*,\\h* - a comma surrounded with zero or more horizontal whitespaces.

edited May 23 '17 at 12:22

Community

1
1

answered Feb 11 '16 at 19:18

Wiktor Stribiżew

607,720
39
448
563

Did you hijack *stribizhev*'s account? o_O – Bhargav Rao Feb 11 '16 at 20:38
@BhargavRao: It is my account, I just changed the name. You can do it once a month on SO :) – Wiktor Stribiżew Feb 11 '16 at 20:55
1

This is great. Thanks! – John Smith Feb 12 '16 at 13:38

score 1 · Answer 3 · answered Feb 11 '16 at 22:31

A different approach:

Adding on to @Wiktor's sample string,

x <- "This is it, isn't it (well, yes), and (well, this, that, and this, too). Let's look, does it work?"

Now the magic:

> strsplit(x, ", |(?>\\(.*?\\).*?\\K(, |$))", perl = TRUE)
[[1]]
[1] "This is it"                                       
[2] "isn't it (well, yes)"                             
[3] "and (well, this, that, and this, too). Let's look"
[4] "does it work?"

So how does , |(?>\$.*?\$.*?\\K(, |$)) match?

| captures either of the groups on either side, both
- on the left, the string ,
- and on the right, (?>\$.*?\$.*?\\K(, |$)):
  - (?> ... ) sets up an atomic group, which does not allow backtracking to reevaluate what it matches.
  - In this case, it looks for an open parenthesis (\$),
  - then any character (.) repeated from 0 to infinity times (*), but as few as possible (?), i.e. . is evaluated lazily.
  - The previous . repetition is then limited by the first close parenthesis (\$),
  - followed by another set of any character repeated 0 to as few as possible (.*?)
  - with a \\K at the end, which throws away the match so far and sets the starting point of a new match.
  - The previous .*? is limited by a capturing group (( ... )) with an | that either
    - selects an actual text string, ,,
    - or moves \\K to the end of the line, $, if there are no more commas.

*Whew.*

If my explanation is confusing, see the docs linked above, and check out regex101.com, where you can put in the above regex (single escaped—\—instead of R-style double escaped—\\) and a test string to see what it matches and get an explanation of what it's doing. You'll need to set the g (global) modifier in the box next to the regex box to show all matches and not just the first.

Happy strspliting!

Using strsplit() in R, ignoring anything in parentheses

3 Answers3

Linked