Why am I getting a different result when I run this mean in a slightly different way?

Question

In the first block of code, I am grouping by VAR1 and I get a mean for VAR2 based on the two categories of VAR1. However, when I select only for the category of interest from VAR1, in order to obtain a mean of people for VAR2 who fall into category 1 in VAR1, I get a different result.

What am I doing wrong?

 DF %>%
   na.omit() %>%
   group_by(VAR1) %>% 
   summarise(mean(VAR2))

# A tibble: 2 × 2
  VAR1 `mean(VAR2)`
                     <dbl>                         <dbl>
1                        0                          12.1
2                        1                          11.6

> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)

[1] 11.95238

`na.omit` first will remove the whole row if there's an `NA` value, not just the `NA` value in the column. — thelatemail, Mar 04 '22 at 05:28
Adding on to @thelatemail's explanation: DF must have other variables in it with NAs. When you run `na.omit()` you are losing rows relevant to the current calculation because, e.g., VAR3 has a bunch of NAs in it. — DanY, Mar 04 '22 at 05:38
Welcome to SO! Please see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It's easier to answer questions when you provide some data. You can provide your data via the output of `dput(DF)` or to give a minimal subset use `dput(head(DF))`. — AndrewGB, Mar 04 '22 at 07:23

Abdur Rohman · Answer 1 · 2022-03-04T08:21:09.683

The discrepancy in the mean values of VAR1 in those cases probably comes from the presence of NAs in another column, as @DanY explained briefly. I will try to elaborate this through the following example. Suppose we have DF that consists of three columns, two of which come from mtcars and another one is VAR1. Suppose there are 15 rows with VAR1 of 0.

library(dplyr) 
DF <- mtcars %>% select(mpg, cyl)
DF$VAR1 <- c(rep(0, 15),rep(1,17))
DF
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Datsun 710          22.8   4    0
#Hornet 4 Drive      21.4   6    0
#Hornet Sportabout   18.7   8    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

Now, let's change three values of cyl into NA.

DF$cyl[3:5] <- NA
head(DF)
#                   mpg cyl VAR1
#Mazda RX4         21.0   6    0
#Mazda RX4 Wag     21.0   6    0
#Datsun 710        22.8  NA    0
#Hornet 4 Drive    21.4  NA    0
#Hornet Sportabout 18.7  NA    0
#Valiant           18.1   6    0

Now DF still have 15 values of mpg with VAR1 of 0 because neither mpg nor VAR1 has NA. Now we try to compare the mean of mpg in the two cases you have compared:

DF %>% na.omit() %>% group_by(VAR1) %>% summarise(mean(mpg))
# A tibble: 2 x 2
#   VAR1 `mean(mpg)`
#  <dbl>       <dbl>
#1     0        18.2
#2     1        21.3

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
mean(DF[DF$VAR1 == 1, 'mpg',], na.rm = TRUE)
#[1] 21.3

Clearly you got different mean values. It is because in DF %>% na.omit() operates on DF so any NA value in any row in DF will cause omission of that row. That's why the three rows in cyl that contain NAs are omitted so that mpg column has only 12 values, of which the mean is then computed.

DF %>% na.omit()
#                     mpg cyl VAR1
#Mazda RX4           21.0   6    0
#Mazda RX4 Wag       21.0   6    0
#Valiant             18.1   6    0
#Duster 360          14.3   8    0
#Merc 240D           24.4   4    0
#Merc 230            22.8   4    0
#Merc 280            19.2   6    0
#Merc 280C           17.8   6    0
#Merc 450SE          16.4   8    0
#Merc 450SL          17.3   8    0
#Merc 450SLC         15.2   8    0
#Cadillac Fleetwood  10.4   8    0
#Lincoln Continental 10.4   8    1
#Chrysler Imperial   14.7   8    1
#Fiat 128            32.4   4    1
#Honda Civic         30.4   4    1
#Toyota Corolla      33.9   4    1
#Toyota Corona       21.5   4    1
#Dodge Challenger    15.5   8    1
#AMC Javelin         15.2   8    1
#Camaro Z28          13.3   8    1
#Pontiac Firebird    19.2   8    1
#Fiat X1-9           27.3   4    1
#Porsche 914-2       26.0   4    1
#Lotus Europa        30.4   4    1
#Ford Pantera L      15.8   8    1
#Ferrari Dino        19.7   6    1
#Maserati Bora       15.0   8    1
#Volvo 142E          21.4   4    1

DF %>% na.omit() %>% filter(VAR1 == 0) %>% pull(mpg) %>% mean
# [1] 18.15833

On the other hand, DF[DF$VAR1 == 0, 'mpg',] operates only on mpg column, and has nothing to do with cyl. That's why it has 15 values of mpg with no NA so na.rm will not omit anything.

DF[DF$VAR1 == 0, 'mpg',]
#[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4
#[13] 17.3 15.2 10.4

mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72

Why am I getting a different result when I run this mean in a slightly different way?

1 Answers1