Using dplyr, assign unique names to groups of rows defined by multiple criteria

Question

In my reprex below:

RSA is the output of a process that is to be analyzed and whose results is to be grouped.
Each RSA group has varying range of days (datenum) each is observed.
var1 varies less frequently, but each is observed for the same 8 days consecutively.
The RSA groups are to be numbered sequentially within the var1 group; when a new var1 is encountered the RSA group numbering begins anew.
idx_objective is the index that I am looking for.

Reprex:

var1 <- c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ccc", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd", "ddd")
RSA <- c(1,1,1,0,-1,-1,0,-1, 
         0,0,0,-1,-1,-1,1,1,
        -1,-1,0,1,1,-1,-1,1, 
        1,-1,-1,1,1,0,-1,1)
idx_objective <- c(1,1,1,2,3,3,4,5, 
                   1,1,1,2,2,2,3,3, 
                   1,1,2,3,3,4,4,5, 
                   1,2,2,3,3,4,5,6)
objective.df <- data.frame(var1, RSA, idx_objective) %>%
  group_by(var1) %>%
  mutate  (datenum = 1:n()) %>%
  relocate (datenum, .after = var1)

I have reviewed many SO posts that appear to be similar...

1 dplyr: group variables then assign unique names based on unique grouping

revolves around correct use of cumsum, which I think I am using correctly

[https://stackoverflow.com/questions/40519129/how-to-assign-unique-id-for-group-of-duplicates]

[2]How to divide between groups of rows using dplyr

The last two don't seem applicable; two others referenced in the following:

Approach #1: using a change flag and cumsum

objective.try1 <- objective.df %>%
  group_by(var1) %>%
  mutate(chg_flg = ifelse(lag(RSA) != RSA, 1, 0) %>%
    coalesce(0)) %>%
  relocate(chg_flg, .after = RSA) %>%
  relocate (datenum, .after var1) %>%
  group_by(var1, chg_flg) %>%
  mutate (idx_objective_try = cumsum(chg_flg) +1) %>%

Results:

objective.try1 <- c(1, 1, 1, 2, 3, 1, 4, 5, 1, 1, 1, 2, 1, 1, 3, 1, 1, 2, 3, 1, 4, 1, 5, 1, 2, 1, 3, 1, 4, 5, 6)
objective.df <- data.frame(var1, RSA, idx_objective, objective.try1 %>%
  group_by(var1) %>%
  mutate (datenum = 1: n()) %>%
  relocate(datenum, .after = var1)

Observation for objective.try1: rows 1-5 work, but row 6 incorrectly restarts the idx numbering over again, but then resumes correctly reflecting the chg_flg until rows 13 and 14 at which time the idx numbering is again incorrectly restarted, but then resumes again being correct for one row until being incorrect again at rows 16, 21, 23, 27, and 29.

Following the logic at row 6, for example -- the previous idx_objective_try (row 5) is 3 and the chg_flg value at row 6 is zero, so the idx_objecitve_try ought to be the correct value of 3. Why isn't it?

Approach #2: Using match and duplicated:

objective.try2 <- objective.df %>%
  group_by(var1) %>%. # var1 corresponds to "prop" in the SO post (both the slower moving variables)
  mutate(well_rep1 = match(RSA, unique(RSA)), # "RSA" corresponds to "well" in the SO post (both the faster changing variables)
  well_rep2 = cumsum(!duplicated(RSA))) # approach similar to above

Observation for objective.try2: most rows work, but there again are rows that do not work, though the rows that don't work are different from those in the first try.

I would appreciate it if someone would point out what I am doing wrong.

Try something like `objective.df %>% group_by(var1) %>% mutate(new_id_objective = cumsum(c(1, diff(RSA)) != 0))`? — Ben, Aug 02 '22 at 00:31
@Ben, Yep, that works. Thanks. I wish I understood why it works: I don't understand the purpose of the `1` term in the argument of the `cumsum`. i realize it doesn't work without it, but I cannot get my head around why. Care to comment? Also, if you want to position your comment as an answer, I'll accept it. — W Barker, Aug 02 '22 at 13:07
@W Barker See answer and let me know if this helps with an explanation. — Ben, Aug 02 '22 at 13:49
@Ben, sorry for the delay, but yes, the explanation was what I needed. Thanks — W Barker, Aug 26 '22 at 20:22

Ben · Accepted Answer · 2022-08-02T20:01:19.453

Focusing on the first group in var1 which is "aaa", you have the following column data for RSA:

objective.df[objective.df$var1 == "aaa", "RSA"]
    RSA
  <dbl>
1     1
2     1
3     1
4     0
5    -1
6    -1
7     0
8    -1

If you use the diff function on this to get differences between one value and the next, you will get 7 differences for the 8 data elements in RSA for "aaa". In other words, if you have a length "n" vector you will get back "n-1" values in the end (without changing arguments).

Since we want to add a new column new_id_objective, we will need to include 8 values for this group, not 7. The combine c will help create the vector for the new column, combining an initial number with the 7 differences from diff, for a total of 8.

So for "aaa", the diff will return 0, 0, -1, -1, 0, 1, -1. These values are compared with 0 (!= 0), with the result of the evaluation being either TRUE or FALSE. The cumsum of TRUE will increment the counter. In other words, the value of new_id_objective will increase row-by-row when there is a difference between consecutive row values (up or down, or anything but zero).

In this example, we used combine c with an initial value of 1. Note that we can use any number that is not zero to start with (while in this case we used 1, we could have used 5, 100, -10, or anything not zero). By doing this, the first evaluation will be TRUE (not equal to zero) and the new_id_objective counter value will start with 0 + 1 = 1. Otherwise, if you did try to use zero, i.e., c(0, diff(RSA)) != 0, this will evaluate to FALSE, and the cumulative sum would start with 0, not 1.

Let me know if this helps.

library(tidyverse)

objective.df %>% 
  group_by(var1) %>% 
  mutate(new_id_objective = cumsum(c(1, diff(RSA)) != 0))

Output

   var1  datenum   RSA idx_objective new_id_objective
   <chr>   <int> <dbl>         <dbl>            <int>
 1 aaa         1     1             1                1
 2 aaa         2     1             1                1
 3 aaa         3     1             1                1
 4 aaa         4     0             2                2
 5 aaa         5    -1             3                3
 6 aaa         6    -1             3                3
 7 aaa         7     0             4                4
 8 aaa         8    -1             5                5
 9 bbb         1     0             1                1
10 bbb         2     0             1                1
# … with 22 more rows

Big help. Thanks. I especially like the phrase "`cumsum` of TRUE". :) — W Barker, Aug 06 '22 at 09:26

Using dplyr, assign unique names to groups of rows defined by multiple criteria

1 Answers1