Being new to R, I am looking for an efficient way to perform a loop with an analogue of VLOOKUP with two conditions. VLOOKUP allows to look up for a specific value throughout a column and apply it to each row of my data frame.
I have a long data.frame DF of 3 variables:
Car: identification number of the observed items (cars). Unique for each car, but not for each row.Date: date of the observation, format="%Y-%m-%d"Area: logic variable showing if an observation (Car) on thisDatewas in a certain area (TRUE) or not (FALSE)
I need to create a new binary variable AreaChange that shows if the Area changed in the next 10 days for this Car: if yes then 1, if no change then 0. I am also interested in one direction of change: from FALSE to TRUE.
It is possible that Area changes several times in the next 10 days, if at least one of the changes is from FALSE to TRUE, the AreaChange should equal 1.
It is also possible that some Cars were observed for less than 10 days at certain periods, in these cases the AreaChange calculation is also needed.
A sample dataset can look like:
set.seed(1)
DF <- data.frame(
Cars=as.integer(sample(127345:127346, 2000, replace=T)), #2 cars sample
Date=as.Date
(seq(from = as.Date("2015-12-21"), to=as.Date("2017-01-30"), length.out = 2000)),
Area=as.logical(sample(x=c(0,1), prob=c(.7, .3), size=2000, replace=T)))
DF <- DF[!duplicated(DF[,c("Cars","Date")]),] #795 observations
For me it looks as:
- Extracting 10
FutureAreavalues for each row, matching on two parameters: sameCarandDatebetween (DateandDate+10). I suppose that it can be done in a loop format for the 10 days. - Creating the binary new variable
AreaChangeequaling 0 if all availableFutureAreavalues are the same, or if the currentAreafor this row is TRUE.
I have found suggestions on cases with merging 2 data frames or for matching on just 1 condition or without extracting the Area values on future days, but did not manage to combine them for my case.
For now, I have only managed to get the AreaChange, ignoring the need to match Car and comparing the Area only with the Area in 10 days, not for every day in the next 10 days.
DF$Date10 <- DF$Date+10
library(expss)
DF$Area10 <- vlookup(DF$Date10, DF[,1:3], result_column = 3, lookup_column = 2)
DF$AreaChange10 <- ifelse(DF$Area10!=DF$Area & DF$Area==FALSE, 1, 0)
The desired output is the AreaChange column, for instance as following:
- equals 1 if a switch of
Areafrom FALSE to TRUE occurred between currentDateandDate+10 for the givenCar, no matter what is the number of NA values during these days, - equals 0 otherwise.
Cars Date Area AreaDay0 AreaDay+1 AreaDay+2 AreaDay+3 AreaDay+4 AreaDay+5 AreaDay+6 AreaDay+7 AreaDay+8 AreaDay9 AreaDay+10 AreaChange Comment
127345 12/21/15 TRUE 1 0 0 0 1 1 0 0 NA 1 0 1 yes,_as_includes_switch_from_0_to_1
127346 12/21/15 TRUE 1 1 1 0 0 0 0 0 0 0 0 0 no,_as_the_switch_is_from_1_to_0
127347 12/22/15 FALSE 0 0 0 0 0 0 0 0 0 0 0 0 no,_as_no_switch
127348 12/22/15 FALSE 0 0 0 0 0 0 0 NA 1 0 0 1 yes,_as_includes_switch_from_0_to_1
127349 12/23/15 TRUE 1 1 1 1 1 1 NA 1 1 1 1 0 no,_as_no_switch
127350 12/21/15 FALSE 0 NA NA NA NA NA NA NA NA NA 1 1 yes,_as_includes_switch_from_0_to_1
Many thanks for any suggestions on how to optimize and proceed.