Two basic approaches: (1) do a yearly sequence, conditioning on whether the start is before and if the end is after June (06), used in the base R approach; (2) do a monthly sequence, conditioning (group-filtering) each year on whether "06" is in one of the months. They both work, which you choose depends on comfort/preference.
base R
Since this example uses "sequence by year", I'll use the fact that POSIXlt (notice the l, as opposed to the more common POSIXct) is really a list with elements for year, month, etc:
dput(as.POSIXlt(Sys.time()))
# structure(list(sec = 21.7977600097656, min = 30L, hour = 8L,
# mday = 2L, mon = 5L, year = 123L, wday = 5L, yday = 152L,
# isdst = 1L, zone = "EDT", gmtoff = -14400L), class = c("POSIXlt",
# "POSIXt"), tzone = c("", "EST", "EDT"))
With that, we can easily $-index the list and compare the month as a value. NOTE: ?POSIXlt shows that mon is ‘mon’ 0-11: months after the first of the year, meaning that June is 5 in 0-based months. (Ergo the use of 5 below.)
fun <- function(sd, ed) {
sdlt <- as.POSIXlt(sd)
edlt <- as.POSIXlt(ed)
if (sdlt$mon > 5) sdlt$year <- sdlt$year+1
if (edlt$mon < 5) edlt$year <- edlt$year-1
if (sdlt <= edlt) unique(format(seq.Date(as.Date(sdlt), as.Date(edlt), by = "year"), format="%Y"))
}
years <- Map(fun, mydata$startdate, mydata$enddate)
str(years)
# List of 5
# $ : chr [1:5] "2008" "2009" "2010" "2011" ...
# $ : chr [1:4] "2008" "2009" "2010" "2011"
# $ : chr "2012"
# $ : chr "2012"
# $ : NULL
out <- data.frame(
id = rep(mydata$id, lengths(years)),
statenumber = rep(mydata$statenumber, lengths(years)),
year = unlist(years))
out
# id statenumber year
# 1 R007 1 2008
# 2 R007 1 2009
# 3 R007 1 2010
# 4 R007 1 2011
# 5 R007 1 2012
# 6 R008 2 2008
# 7 R008 2 2009
# 8 R008 2 2010
# 9 R008 2 2011
# 10 R008 3 2012
# 11 R009 4 2012
If there's a chance to have duplicates (repeated states for an id that gap within the same year), then you can use unique(out).
Map just "zips" the data together. The individual calls to fun unrolled would look like this:
list(
fun(mydata$startdate[1], mydata$enddate[1]),
fun(mydata$startdate[2], mydata$enddate[2]),
fun(mydata$startdate[3], mydata$enddate[3]),
fun(mydata$startdate[4], mydata$enddate[4]),
fun(mydata$startdate[5], mydata$enddate[5])
)
The inclusion of ed on the end of c(seq.Date(..), ed) is to guard against the fact that seq may not include the year of the enddate. In this case, it ensures that R008 in state 2 sees 2012.
dplyr
In this (and the data.table) section, we'll use the monthly sequence instead, using format="%m" as the month. Dissimilar from POSIXlt above (June is 5), reading ?%strptime for the %-codes defines %m as ‘%m’ Month as decimal number (01-12), so June is back to "06".
library(dplyr)
mydata %>%
rowwise() %>%
summarize(
id, statenumber,
dates = seq(startdate, enddate, by = "month"),
year = format(dates, format = "%Y")) %>%
group_by(id, statenumber, year) %>%
filter(any(format(dates, format = "%m") == "06")) %>%
distinct(id, statenumber, year) %>%
ungroup()
# # A tibble: 11 × 3
# id statenumber year
# <chr> <dbl> <chr>
# 1 R007 1 2008
# 2 R007 1 2009
# 3 R007 1 2010
# 4 R007 1 2011
# 5 R007 1 2012
# 6 R008 2 2008
# 7 R008 2 2009
# 8 R008 2 2010
# 9 R008 2 2011
# 10 R008 3 2012
# 11 R009 4 2012
I generally try to avoid rowwise when able, but this problem does need to be executed one row at a time (which is effectively what Map is doing in the base R solution above).
data.table
library(data.table)
as.data.table(mydata)[, .(id, statenumber, dates = seq(startdate, enddate, by = "month")), by = .(seq(nrow(mydata)))
][, year := format(dates, format="%Y")
][, .SD[any(format(dates, format="%m") == "06"),], by = .(id, statenumber, year)
][, c("seq", "dates") := NULL
][, unique(.SD)]
# id statenumber year
# <char> <num> <char>
# 1: R007 1 2008
# 2: R007 1 2009
# 3: R007 1 2010
# 4: R007 1 2011
# 5: R007 1 2012
# 6: R008 2 2008
# 7: R008 2 2009
# 8: R008 2 2010
# 9: R008 2 2011
# 10: R008 3 2012
# 11: R009 4 2012