This question is based on answers of question here: Join based on a concatenated 'origin-destination' variable and consecutive individual sites
I have a for loop that needs to iterate over +100 thousand rows, and performs too slow. The main task is to iterate over each row of pairs and find the Vonat_ID variable in STATIONS dataframe, then with the match function find the stations in STATIONS that go from HONNAN variable to HOVA variable from the iterating row in pairs.
This is the pairs dataframe. (167 thousand rows)
head(pairs)
# A tibble: 6 x 4
# Groups: Vonat_ID [2]
Vonat_ID HONNAN HOVA OD_pár
<chr> <chr> <chr> <chr>
1 20210216-40517 Hegyeshalom Hegyeshalom oh. Hegyeshalom -- Hegyeshalom oh.
2 20210216-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
3 20210216-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
4 20210223-40517 Püspökladány Ferencváros Püspökladány -- Ferencváros
5 20210223-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
6 20210223-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
This is the STATIONS dataframe. (2.7 millions rows)
head(STATIONS)
# A tibble: 6 x 2
# Groups: Vonat_ID [1]
Vonat_ID Állomás
<chr> <chr>
1 20210216-40517 Biharkeresztes oh.
2 20210216-40517 Biharkeresztes
3 20210216-40517 Mezőpeterd
4 20210216-40517 Berettyóújfalu
5 20210216-40517 Sáp
6 20210216-40517 Báránd
...
20210216-40517 Öttevény
55 20210216-40517 Lébény-Mosonszentmiklós
56 20210216-40517 Kimle-Károlyháza
57 20210216-40517 Mosonmagyaróvár
58 20210216-40517 Hegyeshalom
59 20210216-40517 Hegyeshalom oh.
60 20210223-40517 Biharkeresztes oh.
61 20210223-40517 Biharkeresztes
62 20210223-40517 Mezőpeterd
63 20210223-40517 Berettyóújfalu
64 20210223-40517 Sáp
65 20210223-40517 Báránd
...
88 20210223-40517 Öttevény
89 20210223-40517 Lébény-Mosonszentmiklós
90 20210223-40517 Kimle-Károlyháza
91 20210223-40517 Mosonmagyaróvár
92 20210223-40517 Hegyeshalom
93 20210223-40517 Hegyeshalom oh.
...
I created vectors of each variable from pairs to speed up the for loop. pairs_ID, pairs_HN, pairs_HV and pairs_OD, respectively.
The desired output looks like this: (stat is all the "Állomás" from STATIONS between the "OD_pár" starting and endpoint. (HONNAN means FROM, HOVA means TO))
OD stat
1 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom #"OD_pár" from the first row of `pairs`
2 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom oh.
3 Biharkeresztes oh. -- Püspökladány Biharkeresztes oh.#"OD_pár" from the second row of `pairs`
4 Biharkeresztes oh. -- Püspökladány Biharkeresztes
5 Biharkeresztes oh. -- Püspökladány Mezőpeterd
6 Biharkeresztes oh. -- Püspökladány Berettyóújfalu
The for loop I am using is the following. I need a huge dataframe at the end with all the matching ODs and their station names, hence the list appending at the end of the loop. Also, I added an error catch function in case any appear.
bridge_total <- list()
for (x in 1:nrow(pairs)) {
tryCatch({
OD <- pairs_OD[x]
stat <- STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]][match(pairs_HN[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]]) :
match(pairs_HV[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]])]
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
df <- data.frame("OD"=OD, "stat"=stat)
bridge_total[[x]] <- df
print(x)
}
bridge_total <- do.call(rbind, bridge_total)
Is there a way I could speed up the loop even better? right now as of test runs, to iterate over +100k rows, it requires 2-3 hours minimum.