Find sets that contain at least one element from other sets

Question

Suppose we are given n sets and want to construct all minimal sets that have at least one element in common with each of the input sets. A set S is called minimal, if there is no admissible set S' that is a subset of S.

An example:

In: s1 = {1, 2, 3}; s2 = {3, 4, 5}; s3 = {5, 6}

Out: [{1, 4, 6}, {1, 5}, {2, 4, 6}, {2, 5}, {3, 5}, {3, 6}]

My idea was to iteratively add one set after the other to the solution:

result = f(s1, f(s2, f(s3, ...)))

whereby f is a merge function that could look as follows:

function f(newSet, setOfSets):
   Step 1: 
      return all elements of setOfSets that share an element with newSet

   Step 2: 
      for each remaining element setE of setOfSets:
         for each element e of newSet:
            return union(setE, {e})

The issue with the above appraoch is that the cartesian product computed in step 2 may contain supersets of sets returned in step 1. I was thinking of going through all already returned sets (see Find minimal set of subsets that covers a given set), but this seems to be too complicated and inefficient, and I hope that there is a better solution in my special case.

How could I achieve the goal without determining the full cartesian product in step 2?

Note that this question is related to the question of finding the smallest set only, but I need to find all sets that are minimal in the way specified above. I am aware that the number of solutions will not be polynomial.

The number n of input sets will be several hundret, but the sets contain only elements from a limited range (e.g. about 20 different values), which also limits the sets' sizes. It would be acceptible if the algorithm runs in O(n^2), but it should be basically linear (maybe with a log multiplier) of the output sets.

Should each output set be a subset of the union of the input sets? I don't see that stated explicitly, but without that constraint, the number of possible output sets will be uncountably infinite, and therefore not constructable! — Nick Russo, May 28 '20 at 15:42
_At least one_ or _exactly one_? In other words, why `{1, 3, 5}` is not in `Out`? — user58697, May 28 '20 at 17:15
@user58697 it can't (always) be exactly one, as the input sets may not allow it. E.g., input: {1, 2}, {1}, {2}; output: {1, 2}. Here the (only) output set must have both 1 and 2, even though that's more than one of the elements in the first input set. But perhaps the intention is "at least one, and no more than necessary"? After all, the word "minimal" is mentioned twice. — Nick Russo, May 28 '20 at 17:20
@BishalG Suppose `n` will be several hundret, though it would be nice to have the algorithm scale upwards. Order of `n` sqared would be acceptable, though. — Samufi, May 28 '20 at 18:06
@user58697 The comment by Nick Russo is correct, it is "at least" and not "exactly" (see also the example in the question). "No more than necessary" follows from the condition that none of the returned sets shall be a superset of any other returned set. — Samufi, May 28 '20 at 18:16
@NickRusso Yes, the sets shall be minimal and thus not contain elements that are in none of the input sets. I will edit the question to clarify. (Comment renewed due to an error) — Samufi, May 28 '20 at 19:46
"the sets are of moderate size (e.g. about 20) and contain only elements from a limited range (e.g. about 20 different values)." This sounds like the sets are all close to identical. There aren't very many ways to choose about 20 unique items given about 20 choices. Is this correct? — Nick Russo, May 29 '20 at 03:14
@NickRusso The 'moderate size' 20 is the limit of the sets' sizes, but the sets can be smaller. There are, for example, many ways to choose subsets of size 10 from 20 items. — Samufi, May 29 '20 at 03:29

score 1 · Answer 1 · answered May 31 '20 at 06:31

Since your space is so constrained -- only 20 values from which to choose -- beat this thing to death with a blunt instrument:

Convert each of your target sets (the ones to be covered) to a bit-map. In your given case, this will correspond to an integer of 20 bits, one bit position for each of the 20 values.
Create a list of candidate covering bitmaps, the integers 0 through (2^20-1)
Take the integers in order. Use bit operations to determine whether each target set has a 1 bit in common with the candidate. If all satisfy the basic condition, the candidate is validated.
When you validate a candidate, remove all super-set integers from the list of candidates.
When you run out of candidates, your validates candidates are the desired collection. In the code below, I simply print each as it is identified.

Code:

from time import time
start = time()

s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}

# Convert each set to its bit-map
point_set = [7, 28, 48]

# make list of all possible covering bitmaps
cover = list(range(2**20))

while cover:
    # Pop any item from remaining covering sets
    candidate = cover.pop(0)
    # Does this bitmap have a bit in common with each target set?
    if all((candidate & point) for point in point_set):
        print(candidate)

        # Remove all candidates that are supersets of the successful covering one.
        superset = set([other for other in cover if (candidate & ~other) == 0])
        cover = [item for item in cover if item not in superset]
        print(time() - start, "lag time")

print(time() - start, "seconds")

Output -- I have not converted the candidate integers back to their constituent elements. This is a straightforward task.

Note that most of the time in this example is spent in exhausting the list of integers that were not supersets of a validated cover set, such as all multiples of 32 (the lower 6 bits are all zero, and thus are disjoint from any cover set).

This 33 seconds is on my aging desktop computer; your laptop or other platform is almost certainly faster. I trust that any improvement from a more efficient algorithm is easily offset in that this algorithm is quick to implement and easier to understand.

17
0.4029195308685303 lag time
18
0.6517734527587891 lag time
20
0.8456630706787109 lag time
36
1.0555419921875 lag time
41
1.2604553699493408 lag time
42
1.381387710571289 lag time
33.005757570266724 seconds

This is an interesting idea. I wonder, however, how well the algorithm scales. 20 was just an order of magnitude, and I would not like the program to break down if there were, say, 40 values. That is, it would be acceptable if the runtime analsys includes a factor of the number of input variables somewhere. As far as I understand your algorithm, however, it runs in exponential time of the number of input variables. Correct me if I am wrong. — Samufi, May 31 '20 at 06:58
You're correct; this is exponential in both time and space. The candidate list is 2^N in length, and will eventually delete or accept each of those elements. Although the deletions are done in an efficient list comprehension, the algorithm is still **O(2^N)**. Since you specified a limit of "about 20", the theoretical complexity seems to be overridden by practical considerations. — Prune, May 31 '20 at 18:29

Samufi · Answer 2 · 2020-06-03T21:53:23.913

I have come up with a solution based on the trie data structure as described here. Tries make it relatively fast to determine whether one of the stored sets is a subset of another given set (Savnik, 2013).

The solution then looks as follows:

Create a trie
Iterate through the given sets
- In each iteration, go through the sets in the trie and check if they are disjoint with the new set.
- If they are, continue; if not, add corresponding new sets to the trie unless they are supersets of sets in the trie.

The worst-case runtime is O(n m c), whereby m is the maximal number of solutions if we consider only n' <= n of the input sets, and c is the time factor from the subset lookups.

The code is below. I have implemented the algorithm based on the python package datrie, which is a wrapper around an efficent C implementation of a trie. The code below is in cython but can be converted to pure python easily by removing/exchangin cython specific commands.

The extended trie implementation:

from datrie cimport BaseTrie, BaseState, BaseIterator

cdef bint has_subset_c(BaseTrie trie, BaseState trieState, str setarr, 
                        int index, int size):
    cdef BaseState trieState2 = BaseState(trie)
    cdef int i
    trieState.copy_to(trieState2)
    for i in range(index, size):
        if trieState2.walk(setarr[i]):
            if trieState2.is_terminal() or has_subset_c(trie, trieState2, setarr, 
                                                        i, size): 
                return True
            trieState.copy_to(trieState2)
    return False


cdef class SetTrie():
    def __init__(self, alphabet, initSet=[]):
        if not hasattr(alphabet, "__iter__"):
            alphabet = range(alphabet)
        self.trie = BaseTrie("".join(chr(i) for i in alphabet))
        self.touched = False
        for i in initSet:
            self.trie[chr(i)] = 0
            if not self.touched:
                self.touched = True

    def has_subset(self, superset):
        cdef BaseState trieState = BaseState(self.trie)
        setarr = "".join(chr(i) for i in superset)
        return bool(has_subset_c(self.trie, trieState, setarr, 0, len(setarr)))

    def extend(self, sets):
        for s in sets:
            self.trie["".join(chr(i) for i in s)] = 0
            if not self.touched:
                self.touched = True

    def delete_supersets(self):
        cdef str elem 
        cdef BaseState trieState = BaseState(self.trie)
        cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
        if trieIter.next():
            elem = trieIter.key()
            while trieIter.next():
                self.trie._delitem(elem)
                if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                    self.trie._setitem(elem, 0)
                elem = trieIter.key()
            if has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                val = self.trie.pop(elem)
                if not has_subset_c(self.trie, trieState, elem, 0, len(elem)):
                    self.trie._setitem(elem, val)


    def update_by_settrie(self, SetTrie setTrie, maxSize=inf, initialize=True):
        cdef BaseIterator trieIter = BaseIterator(BaseState(setTrie.trie))
        cdef str s
        if initialize and not self.touched and trieIter.next():
            for s in trieIter.key():
                self.trie._setitem(s, 0)
            self.touched = True

        while trieIter.next():
            self.update(set(trieIter.key()), maxSize, True)

    def update(self, otherSet, maxSize=inf, isStrSet=False):
        if not isStrSet:
            otherSet = set(chr(i) for i in otherSet)
        cdef str subset, newSubset, elem
        cdef list disjointList = []
        cdef BaseTrie trie = self.trie
        cdef int l
        cdef BaseIterator trieIter = BaseIterator(BaseState(self.trie))
        if trieIter.next():
            subset = trieIter.key()
            while trieIter.next():
                if otherSet.isdisjoint(subset):
                    disjointList.append(subset)
                    trie._delitem(subset)
                subset = trieIter.key()
            if otherSet.isdisjoint(subset):
                disjointList.append(subset)
                trie._delitem(subset)

        cdef BaseState trieState = BaseState(self.trie)
        for subset in disjointList:
            l = len(subset)
            if l < maxSize:
                if l+1 > self.maxSizeBound:
                    self.maxSizeBound = l+1
                for elem in otherSet:
                    newSubset = subset + elem
                    trieState.rewind()
                    if not has_subset_c(self.trie, trieState, newSubset, 0, 
                                        len(newSubset)):
                        trie[newSubset] = 0

    def get_frozensets(self):
        return (frozenset(ord(t) for t in subset) for subset in self.trie)

    def clear(self):
        self.touched = False
        self.trie.clear()

    def prune(self, maxSize):
        cdef bint changed = False
        cdef BaseIterator trieIter 
        cdef str k
        if self.maxSizeBound > maxSize:
            self.maxSizeBound = maxSize
            trieIter = BaseIterator(BaseState(self.trie))
            k = ''
            while trieIter.next():
                if len(k) > maxSize:
                    self.trie._delitem(k)
                    changed = True
                k = trieIter.key()
            if len(k) > maxSize:
                self.trie._delitem(k)
                changed = True
        return changed

    def __nonzero__(self):
        return self.touched

    def __repr__(self):
        return str([set(ord(t) for t in subset) for subset in self.trie])

This can be used as follows:

def cover_sets(sets):
    strie = SetTrie(range(10), *([i] for i in sets[0]))
    for s in sets[1:]:
        strie.update(s)
    return strie.get_frozensets()

Timing:

from timeit import timeit
s1 = {1, 2, 3}
s2 = {3, 4, 5}
s3 = {5, 6}
%timeit cover_sets([s1, s2, s3])

Result:

37.8 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.

Find sets that contain at least one element from other sets

2 Answers2

Linked