I need a fast intersection of ordered 'list-like' variables
set offers the faster intersection in python, when compared to numpy.array or looping a list, see timing code below for reference or the numerous posts on intersection on SO (1, 2, 3,...)
As set.intersection is a for loop, I'd guess it should be applicable in principle to ordered structures.
Is there an high-speed alternative that I am missing?
edits: adding the frozenset solution, which does appears to be quite fast and maintain the order
Alternatives explored so far
I. First thing that came to mind is collections.OrderedDict, however it does not seem to fit the bill because using a dict would be a waste of space and the keys of an OrderedDict don't offer an intersection method, which brings you back to converting them to a set.
II. Second option is to use a classic dict, intersect on its keys and maintain the sorting order in its values, however this results in many more operations to extract the sorted values and defeats the idea of doing the sorting off-line in a pre-arranged dictionary.
III. A "nuclear alternative" that came to mind is looping in cython as per this google groups post, however I'd favour a pure-python solution if possible for project guidelines (and admittedly like this other user here my C is a bit rusty)
Timing code for intersection in Python
import numpy as np
import timeit
# === Create some data ========================================================
x_array = np.random.randint(0,1000,100)
y_array = np.random.randint(0,1000,100)
x_set = set(x_array)
y_set = set(y_array)
x_list = list(x_array)
y_list = list(y_array)
x_frozenset = frozenset(x_list)
y_frozenset = frozenset(y_list)
# === Prepare the test Functions ==============================================
def f_set():
return x_set.intersection(y_set)
def f_numpy():
return np.intersect1d(x_array,y_array)
def f_list():
return [i for i in x_list if i in y_list]
def f_frozenset_loop():
return [x for x in x_list if x in y_frozenset]
def f_frozenset_list_filter_1():
list(filter(y_frozenset.__contains__, x_list))
def f_frozenset_list_filter_2():
return [*filter(y_frozenset.__contains__, x_list)]
# === Output ==================================================================
print('set.intersection ', timeit.timeit(f_set, number=1000))
print('frozenset/list filter 2 ', timeit.timeit(f_frozenset_list_filter_2, number=1000))
print('frozenset/list filter 1 ', timeit.timeit(f_frozenset_list_filter_1, number=1000))
print('frozenset/list loop ', timeit.timeit(f_frozenset_loop, number=1000))
print('np.intersect1d ', timeit.timeit(f_numpy, number=1000))
print('list loop (slooow) ', timeit.timeit(f_list, number=1000))
Output looks like this on my machine currently. There's 1 order of magnitude of difference between set and numpy:
set.intersection 0.001270713739261699
frozenset/list loop 0.006882539311596062
frozenset/list filter 2 0.01010412826037355
frozenset/list filter 1 0.019078864781689093
np.intersect1d 0.019451117983342953
list loop (slooow) 0.20725976641605037