Set Cover and Aliens

The Scenario

The year is 3000. Aliens have invaded and enslaved nearly the entire human race. Thankfully, you have somehow managed to stay safe and out of sight from your would-be alien overlords, and are currently holed up in a small house in a rural part of the country. However, supplies are beginning to wear thin, and before long you know you're going to need to retrieve these 4 different items from the abandoned buildings around you if you want to survive.

You've been hiding out here for a long time, and as such are rather familiar with the area. You know exactly where to find these items, but traveling to the different buildings for supplies is going to be dangerous. While traveling between buildings, you're at risk of being abducted by a roving UFO's tractor beam, vaporized, or worse! Analyzing a map of the area indicates that there are 4 buildings within equal distance from your safe house that combined have all of the resources you need.

Additionally, your current circumstances indicate that you only need one of each item. Retrieving more than one copy of each item is simply not going to increase your chances of survival. With this in mind, which of these buildings should you visit to get all the items you need in the minimum amount of trips? At a glance, it should be easy to tell that you should visit building 1 and building 4 to get the supplies you need, and that the minimum number of trips that you'll need to make to get all of the supplies is 2. However, what if you wanted to gather 100 items from 20 different buildings? Hidden within this relatively silly scenario is a deceptively hard problem known as the Set Cover Search Problem. Before we try and come up with an algorithm however, let's take a step back and see if we can turn this problem into something a bit more generic.

Abstracting The Problem

To start off with, we can think of our items and buildings as a set of subsets \(S = \{S_0, S_1, S_2\ ...\}\), and the union of all subsets in \(S\) as the universal set \(\mathbb{U}\). In mathematical notation:

\[ \mathbb{U} = \bigcup\limits^n _{i=0} S_i \]

where \(n\) is the number of subsets in \(S\).

\[e.g.\]

The universal set \(\mathbb{U}\) of \(\{\{1, 2\},\ \{4, 5\},\ \{2, 3\}\} = \{1, 2, 3, 4, 5\}\)

Next, let's use \(O\) to denote the optimal (or minimum) set cover of any given \(S\). This is in contrast to a suboptimal set cover, which is a set cover solution that contains more than the minimum amount of subsets needed to achieve full set coverage.

\[e.g.\]

The optimal set cover solution \(O\) of \(S\) where \(S = \{\{1, 2\}, \{3, 4\}, \{2, 3\}\}\) is \(\{\{1, 2\}, \{3, 4\}\}\)

Lastly, we're going to use the function \(p(S)\) to denote the power set of \(S\). The power set is defined as the set of all subsets of any given \(S\), including the empty set \(\emptyset\) and \(S\) itself.

\[e.g.\]

\[ p(\{1, 2\}) = \{\{\emptyset\},\ \{1\},\ \{2\},\ \{1, 2\}\} \]

Devising An Algorithm

With these tools at our disposal, lets see if we can come up with an algorithm. To start, one can readily deduce that if the power set \(p(S)\) contains all the subsets of \(S\), and our optimal solution \(O \subseteq S\), then \(O \in p(S)\). Therefore, every possible set cover in \(S\) (be it optimal or suboptimal) can be found within \(p(S)\) as well. As a result, finding \(O\) is as simple as iterating through all of the sets in \(p(S)\) to find the set \(S\) with the least amount of subsets where \(\bigcup\limits^n _i S_i = \mathbb{U}\)

Ultimately, our solution is going to look something like this:

def optimal_set_cover(S)
    P = p(S)
    U = universe(S)
    O = S
    for each set Pi in P
        Pi_U = universe(Pi)
        if Pi_U == U and len(Pi) < len(O)
            O = Pi
    return O

Let's try and implement a working solution in Python.

Implementation

To start, we're going to need to write a function to generate power sets. As generating power sets isn't the focus of this post, we're simply going to use a recipe directly from the python itertools recipe book to do it for us.

1
2
3
4
5


from itertools import chain, combinations

def power_set(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

>>> [s for s in power_set([1, 2, 3])]
[(), (1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]

As you can see from the sample output, this is going to generate a list of tuples instead of sets like we might expect. We could easily modify this function to convert these tuples into sets, but doing so would significantly increase our final run time complexity. Given that our code will be able handle these tuples the same as if they were sets (as they're both iterables), this is an unnecessary step that I have intentionally left out.

We're also going to need a function to generate the universe of any given \(S\). This will work pretty much exactly as described in the previous section of this post; simply iterate through \(S\) (labeled as "sos" in our code) and union each subset together to form the universe \(\mathbb{U}\).

1
2
3
4
5


def universe(sos):
    u = set()
    for s in sos:
        u = u.union(s)
    return u

>>> universe([{1, 2, 3}, {4, 5}, {6}])
{1, 2, 3, 4, 5, 6}

With these two functions at our disposal, we now have everything we need to implement our optimal set cover algorithm:

1
2
3
4
5
6
7
8


def optimal_set_cover(sos):
    u = universe(sos)
    power_sos = power_set(sos)
    min_sos = sos # The minimum size set cover found so far.
    for s in power_sos:
        if universe(s) == u and len(s) < len(min_sos):
            min_sos = s
    return list(min_sos)

>>> optimal_set_cover([{1, 2}, {2, 3}, {3, 4}])
[{1, 2}, {3, 4}]

Analyzing Our Implementation

Unfortunately, as you may have guessed when I mentioned deriving our answer from the power set of \(S\), this algorithm is very, very slow. So slow that any attempt to generate a solution for more than 50 subsets is going to be realistically impossible for any computer to calculate; and unfortunately, it gets worse. This is about as fast as generating optimal set covers gets. Unfortunately, no algorithm exists that can generate an optimal set cover without having to calculate \(p(S)\) in the worst case. As generating \(p(S)\) has a worst case run time complexity of \(O(2^n)\), all optimal set cover algorithms are doomed to run in non polynomial time. Because of this, the set cover search problem is said to fall under the class of NP-Complete problems, a class of problems that are all extremely difficult to compute.

To put into perspective how incredibly slow an algorithm with a worst case run time complexity of to \(O(2^n)\) really is, lets write some benchmarks. To aid in benchmarking our code, I've written a little helper function generate random sets of subsets.

1
2
3
4
5
6


import random

def generate_random_sos(num_elements, num_sets):
    return [set([random.choice([z for z in range(1, num_elements + 1)])
        for y in range(1, random.randint(1, num_elements))])
        for x in range(num_sets)]

>>> generate_random_sos(5, 3)
[{4, 5}, {3, 4}, {1, 5}]

Using the timeit library, we were able to generate the following results.

optimal_set_cover(generate_random_sos(10, 10)) # 0.0028 seconds
optimal_set_cover(generate_random_sos(15, 15)) # 0.113 seconds
optimal_set_cover(generate_random_sos(20, 20)) # 5.08 seconds
optimal_set_cover(generate_random_sos(25, 25)) # 251.313 seconds
optimal_set_cover(generate_random_sos(30, 30)) # 6092.618 seconds

All benchmarks were recorded on a 2014 Macbook Pro.

Extrapolating upon this trend, we can estimate that calculating the optimal set cover for any given \(S\) with 50 subsets and 50 unique elements would take over 200 years to compute!

Instead of trying to find the absolute optimal set cover for any given set of subsets, what if we were okay with a solution that was close-enough? In other words, a suboptimal solution. If so, we're back in business because there just so happens to be an algorithm that can generate a pretty good suboptimal set covers (how good is beyond the scope of this post, but you can read more here if you'd like) that runs in \(O(n^2)\) time!

Taking A Greedy Approach

Key to most approximation algorithms is a solid heuristic. One simple heuristic we could use is simply the size of each subset. After all, the larger the subset, the larger the amount of potentially uncovered elements that might exist in that subset. Using this heuristic, lets try implementing a greedy algorithm that sorts the subsets from largest to smallest, and then iteratively unions them together until we've achieved full set coverage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def suboptimal_set_cover_1(sos):
    u = universe(sos)
    sos = sorted(sos, key=len)
    min_sos = [] # The minimum set cover solution found so far.
    ku = set() # The universe of elements covered so far, the "known universe".
    while ku != u:
        s = sos.pop()
        # Sanity check: Don't add a set if it doesn't increase our coverage.
        if len(ku.union(s)) > len(ku):
            min_sos.append(s)
            ku = ku.union(s)
    return min_sos

>>> suboptimal_set_cover_1([{1, 2, 3}, {4, 5}, {3}])
[{1, 2, 3}, {4, 5}] # Perfect!

suboptimal_set_cover_1([{1, 2}, {3, 4}, {2, 3}, {2}])
[{2, 3}, {3, 4}, {1, 2}] # Okay...

>>> suboptimal_set_cover_1([{1, 2, 3}, {2, 3, 4}, {4, 5, 6}])
[{4, 5, 6}, {2, 3, 4}, {1, 2, 3}] # Yikes!

So, our solution isn't perfect, but it's not the worst! Additionally, it's lightning fast compared to our previous algorithm, operating with a worst case run time complexity of \(O(nlog(n))\) (the time complexity of sorting \(S\) by the size of each subset). Notably, our algorithm really suffers when faced with an \(S\) where most of the subsets are of a similar size. In the special case where all subsets are the exact same size, our algorithm completely falls flat! In such a situation, all subsets are weighted equally according to our heuristic and our algorithm is unable make any meaningful decisions.

Fine Tuning Our Heuristic

Fortunately, we can do better than this by slightly changing our heuristic. What if instead of picking the largest subsets first, we instead picked the subsets that contain the largest amount of uncovered elements first? This should help us avoid situations where most of the subsets are of the same size.

1
2
3
4
5
6
7
8
9


def suboptimal_set_cover_2(sos):
    u = universe(sos)
    min_sos = [] # The minimum set cover solution found so far.
    ku = set() # The universe of elements covered so far, the "known universe".
    while ku != u:
        max_s = max(sos, key=lambda s: s.difference(ku))
        ku = ku.union(max_s)
        min_sos.append(max_s)
    return min_sos

>>> suboptimal_set_cover_2([{1, 2, 3}, {4, 5}, {3}])
[{1, 2, 3}, {4, 5}] # Perfect!

suboptimal_set_cover_2([{1, 2}, {3, 4}, {2, 3}, {2}])
[{1, 2}, {3, 4}] # Perfect!

>>> suboptimal_set_cover_2([{2, 3, 4, 5}, {1, 2, 3, 4}, {4, 5, 6}, {1, 2, 3}])
[{2, 3, 4, 5}, {1, 2, 3, 4}, {4, 5, 6}] # Not perfect but I'll take it.

One unfortunate side effect of this implementation is that it does increase the run time complexity to \(O(n^2)\) This is a result of having to re-calculate the subset with the largest set difference from the known universe at each iteration. However, this is a fine price to pay for greatly improved performance.