I have a data frame of ~100 observations of 38 demographic variables, as well as pre- and post-test scores in six domains (var1:var6). I fitted a linear model using lm() such that test.lm <- lm(var1_post ~ var1_pre + dem1 + dem2 + ... + dem38, data=test.df). The data frame test.df is a subset of a larger data frame, fulldata.df. In fulldata.df, I have 17 observations that do not have complete post data for var3_post and var4_post. However, test.df does not include those columns. It is just var1_pre, all the demographic variables, and var1_post. There are no missing values at all in test.df.
When I run summary(test.lm), it tells me that 17 observations have been removed for missingness. Presumably the 17 from fulldata?
Coefficients: (11 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
intercept -2.36E+01 9.84E+00 -2.403 0.02076 *
var1_pre 7.96E+00 1.48E+00 5.368 3.20E-06 ***
dem1 1.90E+00 1.16E+00 1.631 0.11037
dem2 1.43E-04 1.02E-01 0.001 0.99889
dem3 -7.52E-01 1.14E+00 -0.66 0.51277
dem4 7.65E-02 1.65E-01 0.463 0.6459
...
dem38 -2.50E+00 2.89E+00 -0.866 0.39135
Residual standard error: 3.93 on 42 degrees of freedom
(17 observations deleted due to missingness)
Multiple R-squared: 0.6452, Adjusted R-squared: 0.4003
F-statistic: 2.634 on 29 and 42 DF, p-value: 0.002075
It doesn't make sense to me at all that lm() would recognize missingness from the larger data frame, but I cannot figure out where else the "missing" 17 observations would be coming from. Even when running which(!complete.cases(test.df)) it returns integer(0).
Anyone have any thoughts as to where those 17 observations could be or how I might go about identifying them?