Alright. I've heavily edited this question to a) make it make more sense, and b) reflect where I'm at with the problem.
I have two data sets -- let's call them set1 and set2 -- each of about 6 million rows. Currently, I have them loaded into R as data.tables.
>set1<-data.table(read.csv('~/file1.csv', stringsAsFactors=F))
>setkey(set1, id1)
>head(set1)
id1 start_unixtime end_unixtime seconds_diff id2
1: 1674 1354741858 1354741858 0 227167461
2: 1674 1354752386 1354752951 565 227246263
3: 1674 1354764412 1354764412 0 227358796
4: 1674 1354773044 1354773776 732 227421295
5: 1674 1354778651 1354778651 0 227448774
6: 1674 1354810424 1354810424 0 227631113
>set2<-data.table(read.csv('~/file2.csv', stringsAsFactors=F))
>setkey(set2, id1)
>head(set2)
id1 unix_timestamp event_name
1: 1674 1355202784 join
2: 1674 1354351118 join
3: 1674 1354349648 play
4: 1674 1354780517 join
5: 1674 1355278891 join
6: 1674 1354617262 join
One problematic detail to point out: set2 has no unique keys. Only the tuple of each row is actually unique. In set1, id2 is unique. Fun times!
The operation I'm performing goes like this: for every row in set2, I need to take the unix_timestamp, find the row in set1 where start_unixtimestamp <= unix_timestamp <= end_unixtimestamp and id1 matches, then assign the corresponding set1.id2 to the appropriate row in set2. Every row in set2 has an entry in set1, but not every row in set1 has an entry in set2. One id2 can be assigned to many rows in set2. What I need to wind up with is this (NOTE: the following data is fake, as I haven't been able to produce any actual success yet.):
>head(set2)
id1 unix_timestamp event_name id2
1: 1674 1355202784 join 227167461
2: 1674 1354351118 join 227157309
3: 1674 1354349648 play 227157309
4: 1674 1354780517 join 227157309
5: 1674 1355278891 join 271089456
6: 1674 1354617262 join 221729485
Here is a mess of data table I've cooked up:
set2[, id2 := set1[set2[, id1], list(start_unixtime, end_unixtime, id2)][(start_unixtime <= unix_timestamp & unix_timestamp <= end_unixtime), id2, by=id2]][, list(id2)][, id2:= id2]
To talk through what I understand to be going on:
set2calls the assignment operator:=- The right hand side invokes
set1, which starts byjoiningid1 rows from set2. - The columns
start_unixtime,end_unixtime, andid2are selected. - From that result, a second set of selections are done, which gets
id2where theutc_timestampofid2is betweenstart_unixtimeandend_unixtime. - ...and here, I think I'm doing something badly wrong -- because at this step, I always seem to have two columns, each labeled
id2and containing identical results. So, I select one column... - ...and specify it for assignment. (I do not know why this is done twice. I found this SO post, which uses this second
:=, and this one which does not, and I have simply no idea why.
...which does not work. @mnel has proposed something like:
set2[set1, nomatch=0][unix_timestamp %between c(start_unixtime, end_unixtime, incbounds=T)]
...which works when I try it with his test data, but not with my data. It occurs to me that my data might be of some type (character?) which data.table (or R period) might not be coercing properly? I might be dense, but I can't seem to figure out how to call as.integer() on a specified column of a data.table.
Edit: yeah, my data was all character, and I forgot that data.table inherits from data.frame. So, a little set1$start_unixtime <- as.integer($set1$start_unixtime) and at least I'm sure everything is all integers. However, when I run that command, I still get this:
>head(set2)
Empty data.table (0 rows) of 8 cols: id1,utc_timestamp,event_name,start_unixtime,end_unixtime,seconds_diff...
Addition Here are snippets of my actual data:
set1 <- as.data.table(list(id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L),
start_unixtime = c(1354741858L, 1354752386L, 1354764412L, 1354773044L, 1354778651L, 1354810424L),
end_unixtime = c(1354741858L, 1354752951L, 1354764412L, 1354773776L, 1354778651L, 1354810424L),
seconds_diff = c(0L, 565L, 0L, 732L, 0L, 0L),
id2 = c(227167461L, 227246263L, 227358796L, 227421295L, 227448774L, 227631113L))
set2 <- as.data.table(list(
id1 = c(1674L, 1674L, 1674L, 1674L, 1674L, 1674L),
utc_timestamp = c(1354752431L, 1354780517L, 1354811978L, 1354824385L, 1354833271L, 1354862753L),
event_name = c("joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2", "joinRegularTable_2","joinRegularTable_2", "joinRegularTable_2"))