I have a data.frame orig, which is subset and assigned to cpy.
library(data.table)
orig <- data.frame(id=letters[c(2,1,2,1)], col1=c(300,46,89,2),
col2=1:4, col3=1:4)
print(orig)
# id col1 col2 col3
# b 300 1 1
# a 46 2 2
# b 89 3 3
# a 2 4 4
cpy <- orig[,c("id","col1","col2")]
cpy is a shallow copy of orig and references parts of orig (all but the omitted columns).
Because cpy is a subset of orig, it references the shared columns only and the update by reference feature of setDT(cpy) does not come into play. This leaves orig and cpy in a potentially dangerous state where they only share the pointers to a subset of their columns.
setDT(cpy)
.Internal(inspect(orig))
.Internal(inspect(cpy))
If now setkey is applied to cpy its columns and therefore those columns in orig get sorted (here update by reference plays out). The omitted columns (col3) are not affected by the sorting because they are unknown in cpy. They then are out of sync with the rest of the object.
setkey(cpy,id,col1)
print(cpy)
# id col1 col2
# a 2 4
# a 46 2
# b 89 3
# b 300 1
print(orig)
# id col1 col2 col3
# a 2 4 1
# a 46 2 2
# b 89 3 3
# b 300 1 4
To avoid this behaviour, any action which forces a deep instead of a shallow copy while assigning cpy (e.g. copy()) breaks the reference to orig and thus prevents the unwanted messing up there.
Is there any way that cpy does not loose the reference to the object orig itself and its omitted columns?